Web Scraping with C# and HtmlAgilityPack: A Practical 2026 Tutorial
If you’re searching for “web scraping with C#”, you usually want one of two things:
- a real example that works end-to-end (not a fragment)
- a way to keep it reliable when you’re scraping more than a couple pages
This guide is a practical 2026 tutorial on building a C# scraper using:
- HttpClient for requests
- HtmlAgilityPack for parsing HTML
- pagination crawling
- exporting data to CSV and JSON
- basic reliability patterns (timeouts, retries, respectful delays)
We’ll scrape a simple, static target so you can focus on the fundamentals.
Example target used here: https://quotes.toscrape.com/ (a public demo site for scraping practice)
C# is excellent for reliable scrapers — but at scale you still hit throttling, geo-variance, and intermittent blocks. ProxiesAPI helps keep your network layer stable so your parsers see consistent HTML.
1) When C# is a great choice for web scraping
C#/.NET is underrated for scraping. It gives you:
- a fast, strongly-typed language
- excellent HTTP tooling (HttpClient)
- great JSON support (System.Text.Json)
- easy concurrency (Tasks)
- good packaging/deployment options (containers, Windows services, etc.)
The tradeoff: you need to be slightly more explicit than Python in a few places.
2) Project setup (dotnet + HtmlAgilityPack)
Create a new console app:
dotnet new console -n QuoteScraper
cd QuoteScraper
Add HtmlAgilityPack:
dotnet add package HtmlAgilityPack
3) Fetching HTML with HttpClient (timeouts + headers)
Most sites will treat a default client differently from a browser. At minimum:
- set a reasonable timeout
- set a User-Agent
- handle non-200 responses
Create a file HttpFetch.cs:
using System;
using System.Net.Http;
using System.Threading.Tasks;
public static class HttpFetch
{
private static readonly HttpClient _http = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = System.Net.DecompressionMethods.GZip |
System.Net.DecompressionMethods.Deflate |
System.Net.DecompressionMethods.Brotli
})
{
Timeout = TimeSpan.FromSeconds(30)
};
static HttpFetch()
{
_http.DefaultRequestHeaders.UserAgent.ParseAdd(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
"(KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36");
_http.DefaultRequestHeaders.Accept.ParseAdd(
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
_http.DefaultRequestHeaders.AcceptLanguage.ParseAdd("en-US,en;q=0.9");
}
public static async Task<string> GetStringAsync(string url)
{
using var resp = await _http.GetAsync(url);
if (!resp.IsSuccessStatusCode)
{
var msg = $"HTTP {(int)resp.StatusCode} {resp.ReasonPhrase} for {url}";
throw new HttpRequestException(msg);
}
return await resp.Content.ReadAsStringAsync();
}
}
4) Parsing HTML with HtmlAgilityPack
HtmlAgilityPack gives you an HTML DOM plus XPath queries.
We’ll scrape:
- quote text
- author
- tags
Each quote block on quotes.toscrape.com looks like:
div.quotespan.textsmall.authordiv.tags a.tag
Create a file Quote.cs:
using System.Collections.Generic;
public record Quote(string Text, string Author, List<string> Tags);
Create a parser QuoteParser.cs:
using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;
public static class QuoteParser
{
public static List<Quote> ParseQuotes(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var outList = new List<Quote>();
var quoteNodes = doc.DocumentNode.SelectNodes("//div[contains(@class,'quote')]")
?? new HtmlNodeCollection(null);
foreach (var q in quoteNodes)
{
var text = q.SelectSingleNode(".//span[@class='text']")?.InnerText?.Trim();
var author = q.SelectSingleNode(".//small[@class='author']")?.InnerText?.Trim();
var tags = q.SelectNodes(".//div[@class='tags']//a[contains(@class,'tag')]")
?.Select(n => n.InnerText.Trim())
.Where(s => !string.IsNullOrWhiteSpace(s))
.ToList()
?? new List<string>();
if (!string.IsNullOrWhiteSpace(text) && !string.IsNullOrWhiteSpace(author))
outList.Add(new Quote(text!, author!, tags));
}
return outList;
}
}
5) Pagination: crawling multiple pages safely
Most scraping jobs become “crawl list pages → follow links → scrape details”.
For our demo site:
- page 1:
https://quotes.toscrape.com/ - page 2:
https://quotes.toscrape.com/page/2/
We’ll crawl until there’s no “Next” link.
Create Crawler.cs:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using HtmlAgilityPack;
public static class Crawler
{
public static async Task<List<Quote>> CrawlAllQuotesAsync()
{
var results = new List<Quote>();
var pageUrl = "https://quotes.toscrape.com/";
while (true)
{
var html = await HttpFetch.GetStringAsync(pageUrl);
results.AddRange(QuoteParser.ParseQuotes(html));
// find Next page link
var doc = new HtmlDocument();
doc.LoadHtml(html);
var next = doc.DocumentNode.SelectSingleNode("//li[@class='next']/a");
if (next == null) break;
var href = next.GetAttributeValue("href", null);
if (string.IsNullOrWhiteSpace(href)) break;
pageUrl = new Uri(new Uri(pageUrl), href).ToString();
// be polite
await Task.Delay(600);
}
return results;
}
}
6) Export to CSV and JSON
Create Export.cs:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.Json;
public static class Export
{
public static void ToJson(string path, List<Quote> quotes)
{
var json = JsonSerializer.Serialize(quotes, new JsonSerializerOptions
{
WriteIndented = true
});
File.WriteAllText(path, json, Encoding.UTF8);
}
public static void ToCsv(string path, List<Quote> quotes)
{
var sb = new StringBuilder();
sb.AppendLine("text,author,tags");
foreach (var q in quotes)
{
var tags = string.Join("|", q.Tags.Select(Escape));
sb.AppendLine($"{Escape(q.Text)},{Escape(q.Author)},{tags}");
}
File.WriteAllText(path, sb.ToString(), Encoding.UTF8);
}
private static string Escape(string s)
{
if (s == null) return "";
var needs = s.Contains(",") || s.Contains("\"") || s.Contains("\n");
var t = s.Replace("\"", "\"\"");
return needs ? $"\"{t}\"" : t;
}
}
And wire it up in Program.cs:
using System;
using System.Threading.Tasks;
public class Program
{
public static async Task Main()
{
var quotes = await Crawler.CrawlAllQuotesAsync();
Console.WriteLine($"quotes: {quotes.Count}");
Export.ToJson("quotes.json", quotes);
Export.ToCsv("quotes.csv", quotes);
Console.WriteLine("wrote quotes.json and quotes.csv");
}
}
Run it:
dotnet run
You should see output like:
quotes: 100
wrote quotes.json and quotes.csv
7) Reliability upgrades you’ll want in real scrapers
Once you move beyond a demo site, add these patterns:
- Retry policy for transient errors (429/5xx)
- Rate limiting (don’t exceed a certain QPS)
- Caching (don’t refetch pages you already processed)
- Robust selectors (multiple fallbacks; don’t assume one XPath always works)
- Block detection (CAPTCHA pages, “unusual traffic”, login walls)
In .NET, retries are often done with Polly (a widely used resilience library). If you’re avoiding dependencies, implement a small exponential backoff loop yourself.
8) Where ProxiesAPI fits
C# is excellent for building reliable scrapers, but at scale you still hit:
- IP-based throttling
- geo-dependent pages
- intermittent 403/429
- different markup when the site suspects automation
A proxy-backed fetch layer can help.
A simple pattern is to keep your application the same, but route requests through a proxy/API at the HTTP layer.
9) Checklist: production-ready “web scraping with C#”
- HttpClient has timeouts and decompression
- You send a realistic User-Agent
- Parsers handle missing nodes without crashing
- Pagination stops correctly (no infinite loops)
- You export clean, escaped CSV
- You have retries/backoff and a throttle delay
If you want, tell me the site you’re scraping and whether it’s server-rendered or JS-heavy — and I’ll suggest the right C# architecture (HTMLAgilityPack vs Playwright).
C# is excellent for reliable scrapers — but at scale you still hit throttling, geo-variance, and intermittent blocks. ProxiesAPI helps keep your network layer stable so your parsers see consistent HTML.