Web Scraping with Rust: reqwest + scraper Crate Tutorial (2026)

Mar 28, 2026 · guide · #rust, #web-scraping, #reqwest, #scraper, #html, #proxies, #tutorial

If you’ve done web scraping in Python, moving to Rust feels like switching from a Swiss Army knife to a purpose-built tool:

performance and low memory overhead
type safety (fewer “NoneType has no attribute” surprises)
excellent async support for high-concurrency crawlers

This tutorial shows a practical baseline for web scraping with Rust using:

reqwest for HTTP
scraper for CSS selector parsing
a simple retry strategy
optional proxy support

We’ll build a scraper that:

fetches HTML with timeouts
parses items from a page using CSS selectors
follows pagination
exports JSONL

Add a reliable proxy layer when your Rust scraper scales

Rust makes your scraper fast and correct — but networks still fail. ProxiesAPI can provide a stable proxy layer (rotation, retries, geo) so your reqwest client can keep moving with fewer code changes.

Get 1,000 free API calls View pricing

Project setup

Create a new project:

cargo new rust_scraper
cd rust_scraper

Add dependencies in Cargo.toml:

[dependencies]
reqwest = { version = "0.12", features = ["json", "gzip", "brotli", "deflate", "rustls-tls"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tokio = { version = "1", features = ["macros", "rt-multi-thread", "time"] }
anyhow = "1"

Step 1: Build an HTTP client with timeouts

A common beginner mistake is to use the default client with no timeout.

A scraper without timeouts will eventually hang.

use std::time::Duration;

use anyhow::Result;
use reqwest::header::{HeaderMap, HeaderValue, USER_AGENT, ACCEPT, ACCEPT_LANGUAGE};
use reqwest::Client;

fn build_client() -> Result<Client> {
    let mut headers = HeaderMap::new();
    headers.insert(
        USER_AGENT,
        HeaderValue::from_static(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
        ),
    );
    headers.insert(ACCEPT, HeaderValue::from_static("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"));
    headers.insert(ACCEPT_LANGUAGE, HeaderValue::from_static("en-US,en;q=0.9"));

    let client = Client::builder()
        .default_headers(headers)
        .connect_timeout(Duration::from_secs(10))
        .timeout(Duration::from_secs(35))
        .build()?;

    Ok(client)
}

Proxy support (optional)

If you have an HTTP proxy endpoint (for example via ProxiesAPI), reqwest can route through it:

use reqwest::Proxy;

fn build_client_with_proxy(proxy_url: &str) -> Result<Client> {
    let mut headers = HeaderMap::new();
    headers.insert(USER_AGENT, HeaderValue::from_static("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"));

    let client = Client::builder()
        .default_headers(headers)
        .proxy(Proxy::all(proxy_url)?)
        .connect_timeout(Duration::from_secs(10))
        .timeout(Duration::from_secs(35))
        .build()?;

    Ok(client)
}

Keep this as a switch (env var) so your code works locally without any proxy.

Step 2: Fetch HTML with a small retry loop

In production you might use a full retry library.

For a tutorial, a bounded exponential backoff loop is enough:

use tokio::time::sleep;

async fn fetch_html(client: &Client, url: &str) -> Result<String> {
    let mut attempt: u32 = 0;

    loop {
        attempt += 1;
        let res = client.get(url).send().await;

        match res {
            Ok(r) => {
                let status = r.status();
                if !status.is_success() {
                    // retry on transient server errors
                    if attempt < 5 && status.as_u16() >= 500 {
                        let backoff_ms = 500_u64 * 2_u64.pow(attempt.min(4));
                        sleep(std::time::Duration::from_millis(backoff_ms)).await;
                        continue;
                    }
                    anyhow::bail!("HTTP {} for {}", status, url);
                }

                return Ok(r.text().await?);
            }
            Err(e) => {
                if attempt < 5 {
                    let backoff_ms = 500_u64 * 2_u64.pow(attempt.min(4));
                    sleep(std::time::Duration::from_millis(backoff_ms)).await;
                    continue;
                }
                return Err(e.into());
            }
        }
    }
}

Step 3: Parse HTML using the `scraper` crate

scraper uses CSS selectors similar to BeautifulSoup / Cheerio.

Here’s an example that scrapes "article cards" from a page.

(Replace selectors with your target site’s structure.)

use scraper::{Html, Selector};
use serde::Serialize;

#[derive(Debug, Serialize)]
struct Item {
    title: String,
    url: String,
}

fn parse_items(base_url: &str, html: &str) -> Vec<Item> {
    let doc = Html::parse_document(html);

    let card_sel = Selector::parse("article a").unwrap();

    let mut out = Vec::new();

    for a in doc.select(&card_sel) {
        let title = a.text().collect::<Vec<_>>().join(" ").trim().to_string();
        let href = a.value().attr("href").unwrap_or("#");

        let abs = if href.starts_with("http") {
            href.to_string()
        } else {
            format!("{}{}", base_url.trim_end_matches('/'), href)
        };

        if title.len() < 3 || abs == "#" {
            continue;
        }

        out.push(Item { title, url: abs });
    }

    out
}

Tip: keep selectors close to the content

Avoid depending on volatile CSS class names. Prefer:

semantic tags (article, h1, time)
attribute-based selectors (a[href*="/company/"])
stable ids/data attributes (if present)

Step 4: Pagination crawl + JSONL export

Let’s combine fetch + parse + pagination into a runnable program.

use anyhow::Result;
use serde_json::json;
use std::fs::File;
use std::io::Write;

#[tokio::main]
async fn main() -> Result<()> {
    let base = "https://example.com";

    // Optional: use a proxy from env
    let client = if let Ok(proxy_url) = std::env::var("PROXY_URL") {
        build_client_with_proxy(&proxy_url)?
    } else {
        build_client()?
    };

    let mut file = File::create("items.jsonl")?;

    for page in 1..=3 {
        let url = format!("{}/page/{}", base, page);
        let html = fetch_html(&client, &url).await?;
        let items = parse_items(base, &html);

        eprintln!("page {} -> {} items", page, items.len());

        for it in items {
            let line = serde_json::to_string(&it)?;
            file.write_all(line.as_bytes())?;
            file.write_all(b"\n")?;
        }
    }

    Ok(())
}

Run:

cargo run

Concurrency: async scraping without melting your target

Rust makes it easy to fire off 1,000 requests.

Don’t.

Instead:

cap concurrency (e.g., 10–50)
add per-request jitter
keep backoff on failures

A simple pattern is a semaphore:

use std::sync::Arc;
use tokio::sync::Semaphore;

let sem = Arc::new(Semaphore::new(20));

// before a request
let _permit = sem.clone().acquire_owned().await?;
// do request here