Web Scraping with C# and HtmlAgilityPack: A Practical 2026 Tutorial
If you’re a .NET developer, C# is an excellent language for web scraping:
HttpClientis fast and battle-tested- you get great tooling (debuggers, LINQ, structured logging)
- parsers like HtmlAgilityPack make HTML extraction straightforward
This tutorial is intentionally practical. We’ll build a scraper that:
- fetches a list page
- parses items
- follows pagination
- visits detail pages
- exports clean data to CSV
And we’ll do it in a way you can ship:
- request timeouts
- retries
- respectful crawl delays
- optional proxy support (ProxiesAPI-ready)
Target keyword: web scraping with c#
When you move from a handful of pages to thousands, stability becomes a networking problem. ProxiesAPI can help reduce blocks and intermittent failures so your C# scraper keeps shipping data.
When HtmlAgilityPack is the right tool (and when it isn’t)
HtmlAgilityPack (HAP) is ideal when:
- the site is mostly server-rendered HTML
- you can see the data in “View Source”
- you only need GET requests and HTML parsing
It’s not ideal when:
- content is rendered client-side (React/Vue) and not present in HTML
- data loads via XHR calls that require auth tokens
- the site uses heavy bot detection that serves challenges
In those cases you may need:
- to scrape the JSON APIs the site uses internally
- or a real browser automation tool (Playwright)
But don’t start with a headless browser by default. Start with HTTP + HTML.
Project setup (.NET 8 console app)
dotnet new console -n ScrapeDemo
cd ScrapeDemo
dotnet add package HtmlAgilityPack
Optional but recommended:
dotnet add package Polly
We’ll use Polly for retries.
Step 1: Build a solid HTTP layer (timeouts + headers)
Create Http.cs:
using System;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
public static class Http
{
public static HttpClient CreateClient(IWebProxy? proxy = null)
{
var handler = new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
Proxy = proxy,
UseProxy = proxy != null
};
var client = new HttpClient(handler)
{
Timeout = TimeSpan.FromSeconds(30)
};
client.DefaultRequestHeaders.TryAddWithoutValidation(
"User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
);
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.9");
return client;
}
}
Proxy / ProxiesAPI hook
If your ProxiesAPI account exposes a proxy endpoint, you can route requests through it by setting an HTTP proxy.
using System;
using System.Net;
public static class ProxyConfig
{
public static IWebProxy? FromEnv()
{
// Example format: http://user:pass@host:port
var proxyUrl = Environment.GetEnvironmentVariable("PROXIESAPI_PROXY_URL");
if (string.IsNullOrWhiteSpace(proxyUrl)) return null;
return new WebProxy(new Uri(proxyUrl));
}
}
Then:
var proxy = ProxyConfig.FromEnv();
var http = Http.CreateClient(proxy);
If you don’t set PROXIESAPI_PROXY_URL, the scraper runs directly.
Step 2: Pick a demo target you’re allowed to scrape
Use a site with:
- static HTML list pages
- clear pagination
- stable markup
For example:
- a documentation directory
- a public catalog
- a blog archive
In code, we’ll write the scraper to be generic:
- it starts at a
StartUrl - it parses item links with XPath/CSS-like queries
- it follows a
Nextlink until none
Step 3: Parse HTML with HtmlAgilityPack
Create Parser.cs:
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
public static class Parser
{
public static HtmlDocument Load(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
public static string? Text(HtmlNode? node)
{
if (node == null) return null;
var t = node.InnerText?.Trim();
return string.IsNullOrWhiteSpace(t) ? null : WebUtility.HtmlDecode(t);
}
public static string? Attr(HtmlNode? node, string attr)
{
return node?.GetAttributeValue(attr, null);
}
}
Now you can do:
var doc = Parser.Load(html);
var title = Parser.Text(doc.DocumentNode.SelectSingleNode("//h1"));
Step 4: A real scraping loop (pagination + details)
We’ll scrape:
- list pages → collect item URLs
- item pages → extract fields
Create Models.cs:
public record ItemRow(
string Url,
string? Title,
string? Price,
string? Availability
);
Create Scraper.cs:
using HtmlAgilityPack;
using Polly;
using Polly.Retry;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
public class Scraper
{
private readonly HttpClient _http;
private readonly AsyncRetryPolicy<HttpResponseMessage> _retry;
public Scraper(HttpClient http)
{
_http = http;
_retry = Policy
.HandleResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500 || r.StatusCode == (HttpStatusCode)429)
.Or<HttpRequestException>()
.WaitAndRetryAsync(
retryCount: 4,
sleepDurationProvider: attempt => TimeSpan.FromSeconds(Math.Min(12, Math.Pow(2, attempt)))
);
}
public async Task<string> GetHtmlAsync(string url, CancellationToken ct)
{
var response = await _retry.ExecuteAsync(() => _http.GetAsync(url, ct));
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync(ct);
}
public async Task<List<ItemRow>> CrawlAsync(string startUrl, CancellationToken ct)
{
var rows = new List<ItemRow>();
var seen = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
string? pageUrl = startUrl;
while (!string.IsNullOrWhiteSpace(pageUrl))
{
Console.WriteLine($"LIST {pageUrl}");
var html = await GetHtmlAsync(pageUrl, ct);
var doc = Parser.Load(html);
// Customize these selectors for your target site.
// Example: list item links
var links = doc.DocumentNode.SelectNodes("//a[@href]")
?.Select(n => n.GetAttributeValue("href", null))
?.Where(h => !string.IsNullOrWhiteSpace(h))
?.Distinct()
?.ToList() ?? new List<string>();
// Keep only links that look like item pages.
// Replace this with your own URL filter.
var itemLinks = links.Where(h => h.Contains("/item/", StringComparison.OrdinalIgnoreCase)).ToList();
foreach (var href in itemLinks)
{
var abs = new Uri(new Uri(pageUrl), href).ToString();
if (!seen.Add(abs)) continue;
var item = await ScrapeItemAsync(abs, ct);
rows.Add(item);
// Respectful delay
await Task.Delay(TimeSpan.FromMilliseconds(400), ct);
}
// Pagination: find “next” link (customize)
var nextNode = doc.DocumentNode.SelectSingleNode("//a[contains(translate(normalize-space(.), 'NEXT', 'next'), 'next')]");
var nextHref = nextNode?.GetAttributeValue("href", null);
pageUrl = string.IsNullOrWhiteSpace(nextHref) ? null : new Uri(new Uri(pageUrl), nextHref).ToString();
await Task.Delay(TimeSpan.FromSeconds(1), ct);
}
return rows;
}
private async Task<ItemRow> ScrapeItemAsync(string url, CancellationToken ct)
{
Console.WriteLine($"ITEM {url}");
var html = await GetHtmlAsync(url, ct);
var doc = Parser.Load(html);
// Customize these selectors for your target item page
var title = Parser.Text(doc.DocumentNode.SelectSingleNode("//h1"));
var price = Parser.Text(doc.DocumentNode.SelectSingleNode("//*[contains(@class,'price')]") );
var availability = Parser.Text(doc.DocumentNode.SelectSingleNode("//*[contains(., 'In stock') or contains(., 'Out of stock')]") );
return new ItemRow(url, title, price, availability);
}
}
This code is intentionally “pattern-based”: you must customize the XPath filters for your target site.
The important production lessons are:
- keep network code isolated
- retry on transient failures
- dedupe URLs
- crawl list pages → item pages
Step 5: Export to CSV
Add Csv.cs:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
public static class Csv
{
public static void Write(string path, IEnumerable<ItemRow> rows)
{
using var sw = new StreamWriter(path);
sw.WriteLine("url,title,price,availability");
foreach (var r in rows)
{
var line = string.Join(",",
Escape(r.Url),
Escape(r.Title),
Escape(r.Price),
Escape(r.Availability)
);
sw.WriteLine(line);
}
static string Escape(string? s)
{
s ??= "";
s = s.Replace("\"", "\"\"");
return $"\"{s}\"";
}
}
}
And in Program.cs:
using System;
using System.Threading;
var proxy = ProxyConfig.FromEnv();
var http = Http.CreateClient(proxy);
var scraper = new Scraper(http);
var startUrl = "https://example.com/catalog";
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(10));
var rows = await scraper.CrawlAsync(startUrl, cts.Token);
Csv.Write("output.csv", rows);
Console.WriteLine($"Wrote output.csv ({rows.Count} rows)");
Practical anti-blocking basics (2026 edition)
To reduce failures without doing anything sketchy:
- Use timeouts and retries
- Keep request rate reasonable (add delay)
- Reuse a single
HttpClient - Rotate IPs only when necessary (ProxiesAPI)
- Cache pages during development
If a site requires JavaScript to render data, switch strategies:
- scrape the underlying JSON endpoints
- or use Playwright
FAQ
Is HtmlAgilityPack still good in 2026?
Yes. It’s not trendy, but it’s stable and widely used.
Can I use ProxiesAPI with C#?
Yes—most proxy services work with C# by configuring HttpClientHandler.Proxy. Keep your proxy logic in one place so the rest of your scraper stays unchanged.
Next upgrades
- add structured logging (Serilog)
- add concurrency with a bounded queue
- store results in SQLite
- write a “resume” mechanism (so long crawls can restart)
When you move from a handful of pages to thousands, stability becomes a networking problem. ProxiesAPI can help reduce blocks and intermittent failures so your C# scraper keeps shipping data.