Web Scraping with Rust: reqwest + scraper Crate Tutorial (2026)
If you’ve done web scraping in Python, moving to Rust feels like switching from a Swiss Army knife to a purpose-built tool:
- performance and low memory overhead
- type safety (fewer “NoneType has no attribute” surprises)
- excellent async support for high-concurrency crawlers
This tutorial shows a practical baseline for web scraping with Rust using:
We’ll build a scraper that:
- fetches HTML with timeouts
- parses items from a page using CSS selectors
- follows pagination
- exports JSONL
Rust makes your scraper fast and correct — but networks still fail. ProxiesAPI can provide a stable proxy layer (rotation, retries, geo) so your reqwest client can keep moving with fewer code changes.
Project setup
Create a new project:
cargo new rust_scraper
cd rust_scraper
Add dependencies in Cargo.toml:
[dependencies]
reqwest = { version = "0.12", features = ["json", "gzip", "brotli", "deflate", "rustls-tls"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tokio = { version = "1", features = ["macros", "rt-multi-thread", "time"] }
anyhow = "1"
Step 1: Build an HTTP client with timeouts
A common beginner mistake is to use the default client with no timeout.
A scraper without timeouts will eventually hang.
use std::time::Duration;
use anyhow::Result;
use reqwest::header::{HeaderMap, HeaderValue, USER_AGENT, ACCEPT, ACCEPT_LANGUAGE};
use reqwest::Client;
fn build_client() -> Result<Client> {
let mut headers = HeaderMap::new();
headers.insert(
USER_AGENT,
HeaderValue::from_static(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
),
);
headers.insert(ACCEPT, HeaderValue::from_static("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"));
headers.insert(ACCEPT_LANGUAGE, HeaderValue::from_static("en-US,en;q=0.9"));
let client = Client::builder()
.default_headers(headers)
.connect_timeout(Duration::from_secs(10))
.timeout(Duration::from_secs(35))
.build()?;
Ok(client)
}
Proxy support (optional)
If you have an HTTP proxy endpoint (for example via ProxiesAPI), reqwest can route through it:
use reqwest::Proxy;
fn build_client_with_proxy(proxy_url: &str) -> Result<Client> {
let mut headers = HeaderMap::new();
headers.insert(USER_AGENT, HeaderValue::from_static("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"));
let client = Client::builder()
.default_headers(headers)
.proxy(Proxy::all(proxy_url)?)
.connect_timeout(Duration::from_secs(10))
.timeout(Duration::from_secs(35))
.build()?;
Ok(client)
}
Keep this as a switch (env var) so your code works locally without any proxy.
Step 2: Fetch HTML with a small retry loop
In production you might use a full retry library.
For a tutorial, a bounded exponential backoff loop is enough:
use tokio::time::sleep;
async fn fetch_html(client: &Client, url: &str) -> Result<String> {
let mut attempt: u32 = 0;
loop {
attempt += 1;
let res = client.get(url).send().await;
match res {
Ok(r) => {
let status = r.status();
if !status.is_success() {
// retry on transient server errors
if attempt < 5 && status.as_u16() >= 500 {
let backoff_ms = 500_u64 * 2_u64.pow(attempt.min(4));
sleep(std::time::Duration::from_millis(backoff_ms)).await;
continue;
}
anyhow::bail!("HTTP {} for {}", status, url);
}
return Ok(r.text().await?);
}
Err(e) => {
if attempt < 5 {
let backoff_ms = 500_u64 * 2_u64.pow(attempt.min(4));
sleep(std::time::Duration::from_millis(backoff_ms)).await;
continue;
}
return Err(e.into());
}
}
}
}
Step 3: Parse HTML using the scraper crate
scraper uses CSS selectors similar to BeautifulSoup / Cheerio.
Here’s an example that scrapes "article cards" from a page.
(Replace selectors with your target site’s structure.)
use scraper::{Html, Selector};
use serde::Serialize;
#[derive(Debug, Serialize)]
struct Item {
title: String,
url: String,
}
fn parse_items(base_url: &str, html: &str) -> Vec<Item> {
let doc = Html::parse_document(html);
let card_sel = Selector::parse("article a").unwrap();
let mut out = Vec::new();
for a in doc.select(&card_sel) {
let title = a.text().collect::<Vec<_>>().join(" ").trim().to_string();
let href = a.value().attr("href").unwrap_or("#");
let abs = if href.starts_with("http") {
href.to_string()
} else {
format!("{}{}", base_url.trim_end_matches('/'), href)
};
if title.len() < 3 || abs == "#" {
continue;
}
out.push(Item { title, url: abs });
}
out
}
Tip: keep selectors close to the content
Avoid depending on volatile CSS class names. Prefer:
- semantic tags (
article,h1,time) - attribute-based selectors (
a[href*="/company/"]) - stable ids/data attributes (if present)
Step 4: Pagination crawl + JSONL export
Let’s combine fetch + parse + pagination into a runnable program.
use anyhow::Result;
use serde_json::json;
use std::fs::File;
use std::io::Write;
#[tokio::main]
async fn main() -> Result<()> {
let base = "https://example.com";
// Optional: use a proxy from env
let client = if let Ok(proxy_url) = std::env::var("PROXY_URL") {
build_client_with_proxy(&proxy_url)?
} else {
build_client()?
};
let mut file = File::create("items.jsonl")?;
for page in 1..=3 {
let url = format!("{}/page/{}", base, page);
let html = fetch_html(&client, &url).await?;
let items = parse_items(base, &html);
eprintln!("page {} -> {} items", page, items.len());
for it in items {
let line = serde_json::to_string(&it)?;
file.write_all(line.as_bytes())?;
file.write_all(b"\n")?;
}
}
Ok(())
}
Run:
cargo run
Concurrency: async scraping without melting your target
Rust makes it easy to fire off 1,000 requests.
Don’t.
Instead:
- cap concurrency (e.g., 10–50)
- add per-request jitter
- keep backoff on failures
A simple pattern is a semaphore:
use std::sync::Arc;
use tokio::sync::Semaphore;
let sem = Arc::new(Semaphore::new(20));
// before a request
let _permit = sem.clone().acquire_owned().await?;
// do request here
Where ProxiesAPI fits (honestly)
Rust won’t save you from networking reality.
When you scale, you typically need:
- multiple egress IPs
- request stability across retries
- geographic variance (depending on the site)
ProxiesAPI can act as the proxy layer while Rust handles:
- concurrency
- parsing
- data export
The clean boundary is: proxy config in client builder; scraping logic unchanged.
Practical checklist
- Always set timeouts
- Retry only a small number of times
- Keep selectors conservative
- Export JSONL so you can stream/process later
- Add a proxy layer only when you need it
Next upgrades
- Polite rate limiting (token bucket)
- Robots + site policy checks
- Incremental crawling (ETags/If-Modified-Since where possible)
- Store results in SQLite/Postgres
Rust makes your scraper fast and correct — but networks still fail. ProxiesAPI can provide a stable proxy layer (rotation, retries, geo) so your reqwest client can keep moving with fewer code changes.