Web Scraping with C# and HtmlAgilityPack: A Practical 2026 Tutorial

If you’re searching for “web scraping with C#”, you usually want one of two things:

  1. a real example that works end-to-end (not a fragment)
  2. a way to keep it reliable when you’re scraping more than a couple pages

This guide is a practical 2026 tutorial on building a C# scraper using:

  • HttpClient for requests
  • HtmlAgilityPack for parsing HTML
  • pagination crawling
  • exporting data to CSV and JSON
  • basic reliability patterns (timeouts, retries, respectful delays)

We’ll scrape a simple, static target so you can focus on the fundamentals.

Example target used here: https://quotes.toscrape.com/ (a public demo site for scraping practice)

When your C# scraper scales, stabilize fetches with ProxiesAPI

C# is excellent for reliable scrapers — but at scale you still hit throttling, geo-variance, and intermittent blocks. ProxiesAPI helps keep your network layer stable so your parsers see consistent HTML.


1) When C# is a great choice for web scraping

C#/.NET is underrated for scraping. It gives you:

  • a fast, strongly-typed language
  • excellent HTTP tooling (HttpClient)
  • great JSON support (System.Text.Json)
  • easy concurrency (Tasks)
  • good packaging/deployment options (containers, Windows services, etc.)

The tradeoff: you need to be slightly more explicit than Python in a few places.


2) Project setup (dotnet + HtmlAgilityPack)

Create a new console app:

dotnet new console -n QuoteScraper
cd QuoteScraper

Add HtmlAgilityPack:

dotnet add package HtmlAgilityPack

3) Fetching HTML with HttpClient (timeouts + headers)

Most sites will treat a default client differently from a browser. At minimum:

  • set a reasonable timeout
  • set a User-Agent
  • handle non-200 responses

Create a file HttpFetch.cs:

using System;
using System.Net.Http;
using System.Threading.Tasks;

public static class HttpFetch
{
    private static readonly HttpClient _http = new HttpClient(new HttpClientHandler
    {
        AutomaticDecompression = System.Net.DecompressionMethods.GZip |
                               System.Net.DecompressionMethods.Deflate |
                               System.Net.DecompressionMethods.Brotli
    })
    {
        Timeout = TimeSpan.FromSeconds(30)
    };

    static HttpFetch()
    {
        _http.DefaultRequestHeaders.UserAgent.ParseAdd(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
            "(KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36");
        _http.DefaultRequestHeaders.Accept.ParseAdd(
            "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        _http.DefaultRequestHeaders.AcceptLanguage.ParseAdd("en-US,en;q=0.9");
    }

    public static async Task<string> GetStringAsync(string url)
    {
        using var resp = await _http.GetAsync(url);
        if (!resp.IsSuccessStatusCode)
        {
            var msg = $"HTTP {(int)resp.StatusCode} {resp.ReasonPhrase} for {url}";
            throw new HttpRequestException(msg);
        }

        return await resp.Content.ReadAsStringAsync();
    }
}

4) Parsing HTML with HtmlAgilityPack

HtmlAgilityPack gives you an HTML DOM plus XPath queries.

We’ll scrape:

  • quote text
  • author
  • tags

Each quote block on quotes.toscrape.com looks like:

  • div.quote
    • span.text
    • small.author
    • div.tags a.tag

Create a file Quote.cs:

using System.Collections.Generic;

public record Quote(string Text, string Author, List<string> Tags);

Create a parser QuoteParser.cs:

using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;

public static class QuoteParser
{
    public static List<Quote> ParseQuotes(string html)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(html);

        var outList = new List<Quote>();
        var quoteNodes = doc.DocumentNode.SelectNodes("//div[contains(@class,'quote')]")
                         ?? new HtmlNodeCollection(null);

        foreach (var q in quoteNodes)
        {
            var text = q.SelectSingleNode(".//span[@class='text']")?.InnerText?.Trim();
            var author = q.SelectSingleNode(".//small[@class='author']")?.InnerText?.Trim();

            var tags = q.SelectNodes(".//div[@class='tags']//a[contains(@class,'tag')]")
                        ?.Select(n => n.InnerText.Trim())
                        .Where(s => !string.IsNullOrWhiteSpace(s))
                        .ToList()
                        ?? new List<string>();

            if (!string.IsNullOrWhiteSpace(text) && !string.IsNullOrWhiteSpace(author))
                outList.Add(new Quote(text!, author!, tags));
        }

        return outList;
    }
}

5) Pagination: crawling multiple pages safely

Most scraping jobs become “crawl list pages → follow links → scrape details”.

For our demo site:

  • page 1: https://quotes.toscrape.com/
  • page 2: https://quotes.toscrape.com/page/2/

We’ll crawl until there’s no “Next” link.

Create Crawler.cs:

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using HtmlAgilityPack;

public static class Crawler
{
    public static async Task<List<Quote>> CrawlAllQuotesAsync()
    {
        var results = new List<Quote>();

        var pageUrl = "https://quotes.toscrape.com/";

        while (true)
        {
            var html = await HttpFetch.GetStringAsync(pageUrl);
            results.AddRange(QuoteParser.ParseQuotes(html));

            // find Next page link
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var next = doc.DocumentNode.SelectSingleNode("//li[@class='next']/a");
            if (next == null) break;

            var href = next.GetAttributeValue("href", null);
            if (string.IsNullOrWhiteSpace(href)) break;

            pageUrl = new Uri(new Uri(pageUrl), href).ToString();

            // be polite
            await Task.Delay(600);
        }

        return results;
    }
}

6) Export to CSV and JSON

Create Export.cs:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.Json;

public static class Export
{
    public static void ToJson(string path, List<Quote> quotes)
    {
        var json = JsonSerializer.Serialize(quotes, new JsonSerializerOptions
        {
            WriteIndented = true
        });

        File.WriteAllText(path, json, Encoding.UTF8);
    }

    public static void ToCsv(string path, List<Quote> quotes)
    {
        var sb = new StringBuilder();
        sb.AppendLine("text,author,tags");

        foreach (var q in quotes)
        {
            var tags = string.Join("|", q.Tags.Select(Escape));
            sb.AppendLine($"{Escape(q.Text)},{Escape(q.Author)},{tags}");
        }

        File.WriteAllText(path, sb.ToString(), Encoding.UTF8);
    }

    private static string Escape(string s)
    {
        if (s == null) return "";
        var needs = s.Contains(",") || s.Contains("\"") || s.Contains("\n");
        var t = s.Replace("\"", "\"\"");
        return needs ? $"\"{t}\"" : t;
    }
}

And wire it up in Program.cs:

using System;
using System.Threading.Tasks;

public class Program
{
    public static async Task Main()
    {
        var quotes = await Crawler.CrawlAllQuotesAsync();

        Console.WriteLine($"quotes: {quotes.Count}");
        Export.ToJson("quotes.json", quotes);
        Export.ToCsv("quotes.csv", quotes);

        Console.WriteLine("wrote quotes.json and quotes.csv");
    }
}

Run it:

dotnet run

You should see output like:

quotes: 100
wrote quotes.json and quotes.csv

7) Reliability upgrades you’ll want in real scrapers

Once you move beyond a demo site, add these patterns:

  1. Retry policy for transient errors (429/5xx)
  2. Rate limiting (don’t exceed a certain QPS)
  3. Caching (don’t refetch pages you already processed)
  4. Robust selectors (multiple fallbacks; don’t assume one XPath always works)
  5. Block detection (CAPTCHA pages, “unusual traffic”, login walls)

In .NET, retries are often done with Polly (a widely used resilience library). If you’re avoiding dependencies, implement a small exponential backoff loop yourself.


8) Where ProxiesAPI fits

C# is excellent for building reliable scrapers, but at scale you still hit:

  • IP-based throttling
  • geo-dependent pages
  • intermittent 403/429
  • different markup when the site suspects automation

A proxy-backed fetch layer can help.

A simple pattern is to keep your application the same, but route requests through a proxy/API at the HTTP layer.


9) Checklist: production-ready “web scraping with C#”

  • HttpClient has timeouts and decompression
  • You send a realistic User-Agent
  • Parsers handle missing nodes without crashing
  • Pagination stops correctly (no infinite loops)
  • You export clean, escaped CSV
  • You have retries/backoff and a throttle delay

If you want, tell me the site you’re scraping and whether it’s server-rendered or JS-heavy — and I’ll suggest the right C# architecture (HTMLAgilityPack vs Playwright).

When your C# scraper scales, stabilize fetches with ProxiesAPI

C# is excellent for reliable scrapers — but at scale you still hit throttling, geo-variance, and intermittent blocks. ProxiesAPI helps keep your network layer stable so your parsers see consistent HTML.

Related guides

How to Scrape Google Finance Data with Python (Quotes, News, and Historical Prices)
Scrape Google Finance quote pages for price, key stats, news headlines, and a simple historical price series with Python. Includes selector-first HTML parsing, CSV export, and block-avoidance tactics (timeouts, retries, and ProxiesAPI-friendly patterns).
guide#python#google-finance#web-scraping
Scrape Expedia Flight and Hotel Data with Python (Step-by-Step)
A practical Expedia scraper in Python using Playwright: open search results, extract hotel cards (and where flight offers live), paginate safely, and export clean JSON/CSV. Includes ProxiesAPI-friendly network patterns and a screenshot.
tutorial#python#playwright#expedia
Scrape Podcast Data from Apple Podcasts: Charts + Episode Metadata (Python + ProxiesAPI)
Scrape Apple Podcasts chart pages, extract show details, then pull episode metadata into a clean dataset. Includes screenshot + robust parsing with fallbacks.
tutorial#python#podcasts#apple-podcasts
Scrape Sports Scores from ESPN (Python + ProxiesAPI)
Fetch ESPN’s scoreboard page, parse games + teams + scores into a clean table, then export CSV/JSON. Includes a screenshot and a resilient parsing strategy.
tutorial#python#espn#sports