Web Scraping with Rust: reqwest + scraper Crate Tutorial
Rust is an excellent language for web scraping when you care about:
- performance (fast parsing + concurrency)
- reliability (type safety and fewer runtime surprises)
- building scrapers as long-running services
The ecosystem is mature enough that you can build production crawlers with a small set of crates:
reqwestfor HTTPscraperfor CSS-selector-based HTML parsingserdefor JSON
In this tutorial you’ll build a real scraper that:
- fetches a list page
- parses repeated items using CSS selectors
- follows pagination
- exports clean JSON
- supports proxies (including ProxiesAPI) via environment variables
Once your Rust scraper grows from dozens to thousands of URLs, a proxy layer can stabilize throughput. ProxiesAPI provides a consistent proxy endpoint you can plug into reqwest via standard proxy settings.
Project setup
Create a new Rust binary:
cargo new rust_scraper
cd rust_scraper
Add dependencies:
cargo add reqwest --features blocking,gzip,brotli,deflate,json
cargo add scraper
cargo add serde --features derive
cargo add serde_json
cargo add anyhow
cargo add url
Tip: This tutorial uses
reqwestin blocking mode for simplicity. You can migrate to async later.
The target: a paginated list of items
To keep the example broadly applicable, we’ll scrape a generic “list page” shape:
- a page with repeated item cards
- each card has a title + link
- pagination via a “Next” link
You can adapt this to:
- blog index pages
- product category pages
- directory listings
Step 1: Build an HTTP client with timeouts
use anyhow::Result;
use reqwest::blocking::Client;
use std::time::Duration;
fn build_client() -> Result<Client> {
let client = Client::builder()
.timeout(Duration::from_secs(30))
.user_agent("Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)")
.build()?;
Ok(client)
}
Proxy support (ProxiesAPI)
reqwest supports proxies through Proxy::all().
We’ll read a single proxy URL from an environment variable. If it’s set, we route all traffic through it.
use reqwest::Proxy;
fn build_client() -> Result<Client> {
let mut builder = Client::builder()
.timeout(Duration::from_secs(30))
.user_agent("Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)");
if let Ok(proxy_url) = std::env::var("PROXIESAPI_PROXY_URL") {
// Example format: http://USER:PASS@proxy.yourprovider.com:1234
builder = builder.proxy(Proxy::all(&proxy_url)?);
}
Ok(builder.build()?)
}
This pattern works with ProxiesAPI if you have a proxy endpoint URL.
Step 2: Fetch HTML
fn fetch_html(client: &Client, url: &str) -> Result<String> {
let resp = client.get(url).send()?;
// Optional: retry logic can be added if you want to treat 403/429/5xx specially.
let status = resp.status();
if !status.is_success() {
anyhow::bail!("HTTP {} for {}", status, url);
}
Ok(resp.text()?)
}
Step 3: Parse items with scraper
The scraper crate uses CSS selectors.
We’ll define a small struct for items:
use serde::Serialize;
#[derive(Debug, Serialize, Clone)]
struct Item {
title: String,
url: String,
}
Now parse from HTML:
use scraper::{Html, Selector};
use url::Url;
fn parse_items(html: &str, base_url: &str) -> Result<Vec<Item>> {
let doc = Html::parse_document(html);
// Update these selectors for your target site.
let card_sel = Selector::parse("article")?;
let link_sel = Selector::parse("a")?;
let base = Url::parse(base_url)?;
let mut out = Vec::new();
for card in doc.select(&card_sel) {
if let Some(a) = card.select(&link_sel).next() {
let title = a.text().collect::<Vec<_>>().join(" ").trim().to_string();
if title.is_empty() {
continue;
}
if let Some(href) = a.value().attr("href") {
let joined = base.join(href)?.to_string();
out.push(Item { title, url: joined });
}
}
}
Ok(out)
}
Why we keep selectors simple
For scraping at scale, you want:
- a small number of selectors
- a single place to update them
Markup changes are inevitable.
Step 4: Pagination: follow “Next”
fn find_next_page(html: &str, base_url: &str) -> Result<Option<String>> {
let doc = Html::parse_document(html);
// Common pattern: <a rel="next" href="...">Next</a>
let next_sel = Selector::parse("a[rel='next']")?;
if let Some(a) = doc.select(&next_sel).next() {
if let Some(href) = a.value().attr("href") {
let base = Url::parse(base_url)?;
return Ok(Some(base.join(href)?.to_string()));
}
}
// Fallback: text match "Next" (lightweight)
let a_sel = Selector::parse("a")?;
for a in doc.select(&a_sel) {
let text = a.text().collect::<Vec<_>>().join(" ").to_lowercase();
if text.contains("next") {
if let Some(href) = a.value().attr("href") {
let base = Url::parse(base_url)?;
return Ok(Some(base.join(href)?.to_string()));
}
}
}
Ok(None)
}
Step 5: Crawl N pages and de-duplicate
use std::collections::HashSet;
fn crawl(start_url: &str, max_pages: usize) -> Result<Vec<Item>> {
let client = build_client()?;
let mut url = start_url.to_string();
let mut page = 0usize;
let mut seen = HashSet::new();
let mut all = Vec::new();
while page < max_pages {
page += 1;
let html = fetch_html(&client, &url)?;
let items = parse_items(&html, &url)?;
let mut new_count = 0;
for it in items {
if seen.insert(it.url.clone()) {
all.push(it);
new_count += 1;
}
}
println!("page {}: total={} new={} url={}", page, all.len(), new_count, url);
if let Some(next) = find_next_page(&html, &url)? {
url = next;
} else {
break;
}
}
Ok(all)
}
Step 6: Export JSON
use std::fs::File;
use std::io::Write;
fn write_json(path: &str, items: &[Item]) -> Result<()> {
let json = serde_json::to_string_pretty(items)?;
let mut f = File::create(path)?;
f.write_all(json.as_bytes())?;
Ok(())
}
Main program:
fn main() -> Result<()> {
let start = "https://example.com/blog"; // replace
let items = crawl(start, 5)?;
write_json("items.json", &items)?;
println!("wrote items.json: {} items", items.len());
Ok(())
}
Comparison: Rust vs Python for scraping
| Dimension | Rust | Python |
|---|---|---|
| Speed | Excellent | Good |
| Safety | High | Medium |
| Ecosystem | Smaller but solid | Huge |
| Iteration speed | Medium | Fast |
| Concurrency | Great | Great (async) |
Rust is a strong choice when you’re building a scraper as a service (not a one-off script).
Practical advice for production Rust scrapers
- Keep selectors in one module and version them
- Add retries/backoff for 403/429/5xx
- Implement caching so you don’t re-fetch unchanged pages
- Store provenance (URL + fetch timestamp)
- Respect rate limits and spread load
Where ProxiesAPI fits (honestly)
ProxiesAPI helps at the network layer:
- reduce per-IP throttling impact
- smooth out transient blocks
- improve completion rate for large URL queues
It doesn’t replace:
- responsible crawl behavior
- monitoring
- compliance and ToS review
Next upgrades
- Move to async (
tokio,reqwestasync) for higher throughput - Add a queue + retry store (SQLite/Postgres)
- Integrate a headless browser only for JS-heavy pages (keep most fetches HTML-first)
Once your Rust scraper grows from dozens to thousands of URLs, a proxy layer can stabilize throughput. ProxiesAPI provides a consistent proxy endpoint you can plug into reqwest via standard proxy settings.