Web Scraping with C# and HtmlAgilityPack: A Practical 2026 Tutorial

If you’re a .NET developer, C# is an excellent language for web scraping:

  • HttpClient is fast and battle-tested
  • you get great tooling (debuggers, LINQ, structured logging)
  • parsers like HtmlAgilityPack make HTML extraction straightforward

This tutorial is intentionally practical. We’ll build a scraper that:

  • fetches a list page
  • parses items
  • follows pagination
  • visits detail pages
  • exports clean data to CSV

And we’ll do it in a way you can ship:

  • request timeouts
  • retries
  • respectful crawl delays
  • optional proxy support (ProxiesAPI-ready)

Target keyword: web scraping with c#

Make C# scrapers more reliable with ProxiesAPI

When you move from a handful of pages to thousands, stability becomes a networking problem. ProxiesAPI can help reduce blocks and intermittent failures so your C# scraper keeps shipping data.


When HtmlAgilityPack is the right tool (and when it isn’t)

HtmlAgilityPack (HAP) is ideal when:

  • the site is mostly server-rendered HTML
  • you can see the data in “View Source”
  • you only need GET requests and HTML parsing

It’s not ideal when:

  • content is rendered client-side (React/Vue) and not present in HTML
  • data loads via XHR calls that require auth tokens
  • the site uses heavy bot detection that serves challenges

In those cases you may need:

  • to scrape the JSON APIs the site uses internally
  • or a real browser automation tool (Playwright)

But don’t start with a headless browser by default. Start with HTTP + HTML.


Project setup (.NET 8 console app)

dotnet new console -n ScrapeDemo
cd ScrapeDemo

dotnet add package HtmlAgilityPack

Optional but recommended:

dotnet add package Polly

We’ll use Polly for retries.


Step 1: Build a solid HTTP layer (timeouts + headers)

Create Http.cs:

using System;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;

public static class Http
{
    public static HttpClient CreateClient(IWebProxy? proxy = null)
    {
        var handler = new HttpClientHandler
        {
            AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
            Proxy = proxy,
            UseProxy = proxy != null
        };

        var client = new HttpClient(handler)
        {
            Timeout = TimeSpan.FromSeconds(30)
        };

        client.DefaultRequestHeaders.TryAddWithoutValidation(
            "User-Agent",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
        );
        client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml");
        client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.9");

        return client;
    }
}

Proxy / ProxiesAPI hook

If your ProxiesAPI account exposes a proxy endpoint, you can route requests through it by setting an HTTP proxy.

using System;
using System.Net;

public static class ProxyConfig
{
    public static IWebProxy? FromEnv()
    {
        // Example format: http://user:pass@host:port
        var proxyUrl = Environment.GetEnvironmentVariable("PROXIESAPI_PROXY_URL");
        if (string.IsNullOrWhiteSpace(proxyUrl)) return null;

        return new WebProxy(new Uri(proxyUrl));
    }
}

Then:

var proxy = ProxyConfig.FromEnv();
var http = Http.CreateClient(proxy);

If you don’t set PROXIESAPI_PROXY_URL, the scraper runs directly.


Step 2: Pick a demo target you’re allowed to scrape

Use a site with:

  • static HTML list pages
  • clear pagination
  • stable markup

For example:

  • a documentation directory
  • a public catalog
  • a blog archive

In code, we’ll write the scraper to be generic:

  • it starts at a StartUrl
  • it parses item links with XPath/CSS-like queries
  • it follows a Next link until none

Step 3: Parse HTML with HtmlAgilityPack

Create Parser.cs:

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;

public static class Parser
{
    public static HtmlDocument Load(string html)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        return doc;
    }

    public static string? Text(HtmlNode? node)
    {
        if (node == null) return null;
        var t = node.InnerText?.Trim();
        return string.IsNullOrWhiteSpace(t) ? null : WebUtility.HtmlDecode(t);
    }

    public static string? Attr(HtmlNode? node, string attr)
    {
        return node?.GetAttributeValue(attr, null);
    }
}

Now you can do:

var doc = Parser.Load(html);
var title = Parser.Text(doc.DocumentNode.SelectSingleNode("//h1"));

Step 4: A real scraping loop (pagination + details)

We’ll scrape:

  • list pages → collect item URLs
  • item pages → extract fields

Create Models.cs:

public record ItemRow(
    string Url,
    string? Title,
    string? Price,
    string? Availability
);

Create Scraper.cs:

using HtmlAgilityPack;
using Polly;
using Polly.Retry;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;

public class Scraper
{
    private readonly HttpClient _http;
    private readonly AsyncRetryPolicy<HttpResponseMessage> _retry;

    public Scraper(HttpClient http)
    {
        _http = http;

        _retry = Policy
            .HandleResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500 || r.StatusCode == (HttpStatusCode)429)
            .Or<HttpRequestException>()
            .WaitAndRetryAsync(
                retryCount: 4,
                sleepDurationProvider: attempt => TimeSpan.FromSeconds(Math.Min(12, Math.Pow(2, attempt)))
            );
    }

    public async Task<string> GetHtmlAsync(string url, CancellationToken ct)
    {
        var response = await _retry.ExecuteAsync(() => _http.GetAsync(url, ct));
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync(ct);
    }

    public async Task<List<ItemRow>> CrawlAsync(string startUrl, CancellationToken ct)
    {
        var rows = new List<ItemRow>();
        var seen = new HashSet<string>(StringComparer.OrdinalIgnoreCase);

        string? pageUrl = startUrl;

        while (!string.IsNullOrWhiteSpace(pageUrl))
        {
            Console.WriteLine($"LIST {pageUrl}");
            var html = await GetHtmlAsync(pageUrl, ct);
            var doc = Parser.Load(html);

            // Customize these selectors for your target site.
            // Example: list item links
            var links = doc.DocumentNode.SelectNodes("//a[@href]")
                ?.Select(n => n.GetAttributeValue("href", null))
                ?.Where(h => !string.IsNullOrWhiteSpace(h))
                ?.Distinct()
                ?.ToList() ?? new List<string>();

            // Keep only links that look like item pages.
            // Replace this with your own URL filter.
            var itemLinks = links.Where(h => h.Contains("/item/", StringComparison.OrdinalIgnoreCase)).ToList();

            foreach (var href in itemLinks)
            {
                var abs = new Uri(new Uri(pageUrl), href).ToString();
                if (!seen.Add(abs)) continue;

                var item = await ScrapeItemAsync(abs, ct);
                rows.Add(item);

                // Respectful delay
                await Task.Delay(TimeSpan.FromMilliseconds(400), ct);
            }

            // Pagination: find “next” link (customize)
            var nextNode = doc.DocumentNode.SelectSingleNode("//a[contains(translate(normalize-space(.), 'NEXT', 'next'), 'next')]");
            var nextHref = nextNode?.GetAttributeValue("href", null);

            pageUrl = string.IsNullOrWhiteSpace(nextHref) ? null : new Uri(new Uri(pageUrl), nextHref).ToString();

            await Task.Delay(TimeSpan.FromSeconds(1), ct);
        }

        return rows;
    }

    private async Task<ItemRow> ScrapeItemAsync(string url, CancellationToken ct)
    {
        Console.WriteLine($"ITEM {url}");
        var html = await GetHtmlAsync(url, ct);
        var doc = Parser.Load(html);

        // Customize these selectors for your target item page
        var title = Parser.Text(doc.DocumentNode.SelectSingleNode("//h1"));
        var price = Parser.Text(doc.DocumentNode.SelectSingleNode("//*[contains(@class,'price')]") );
        var availability = Parser.Text(doc.DocumentNode.SelectSingleNode("//*[contains(., 'In stock') or contains(., 'Out of stock')]") );

        return new ItemRow(url, title, price, availability);
    }
}

This code is intentionally “pattern-based”: you must customize the XPath filters for your target site.

The important production lessons are:

  • keep network code isolated
  • retry on transient failures
  • dedupe URLs
  • crawl list pages → item pages

Step 5: Export to CSV

Add Csv.cs:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

public static class Csv
{
    public static void Write(string path, IEnumerable<ItemRow> rows)
    {
        using var sw = new StreamWriter(path);
        sw.WriteLine("url,title,price,availability");

        foreach (var r in rows)
        {
            var line = string.Join(",",
                Escape(r.Url),
                Escape(r.Title),
                Escape(r.Price),
                Escape(r.Availability)
            );
            sw.WriteLine(line);
        }

        static string Escape(string? s)
        {
            s ??= "";
            s = s.Replace("\"", "\"\"");
            return $"\"{s}\"";
        }
    }
}

And in Program.cs:

using System;
using System.Threading;

var proxy = ProxyConfig.FromEnv();
var http = Http.CreateClient(proxy);

var scraper = new Scraper(http);

var startUrl = "https://example.com/catalog";

using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(10));
var rows = await scraper.CrawlAsync(startUrl, cts.Token);

Csv.Write("output.csv", rows);
Console.WriteLine($"Wrote output.csv ({rows.Count} rows)");

Practical anti-blocking basics (2026 edition)

To reduce failures without doing anything sketchy:

  • Use timeouts and retries
  • Keep request rate reasonable (add delay)
  • Reuse a single HttpClient
  • Rotate IPs only when necessary (ProxiesAPI)
  • Cache pages during development

If a site requires JavaScript to render data, switch strategies:

  • scrape the underlying JSON endpoints
  • or use Playwright

FAQ

Is HtmlAgilityPack still good in 2026?

Yes. It’s not trendy, but it’s stable and widely used.

Can I use ProxiesAPI with C#?

Yes—most proxy services work with C# by configuring HttpClientHandler.Proxy. Keep your proxy logic in one place so the rest of your scraper stays unchanged.


Next upgrades

  • add structured logging (Serilog)
  • add concurrency with a bounded queue
  • store results in SQLite
  • write a “resume” mechanism (so long crawls can restart)
Make C# scrapers more reliable with ProxiesAPI

When you move from a handful of pages to thousands, stability becomes a networking problem. ProxiesAPI can help reduce blocks and intermittent failures so your C# scraper keeps shipping data.

Related guides

Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Live Stock Prices from Yahoo Finance (Python + ProxiesAPI)
Fetch Yahoo Finance quote pages via ProxiesAPI, parse price + change + market cap, and export clean rows to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks
Scrape GitHub Repository Data (Stars, Releases, Issues) with Python + ProxiesAPI
Scrape GitHub repo pages as HTML (not just the API): stars, forks, open issues/PRs, latest release, and recent issues. Includes defensive selectors, CSV export, and a screenshot.
tutorial#python#github#web-scraping
Scrape Real Estate Listings from Realtor.com (Python + ProxiesAPI)
Extract listing URLs and key fields (price, beds, baths, address) from Realtor.com search results with pagination, retries, and a ProxiesAPI-backed fetch layer. Includes selectors, CSV export, and a screenshot.
tutorial#python#real-estate#realtor