Web Scraping with Rust: reqwest + scraper Crate Tutorial

May 08, 2026 · guide · #rust, #reqwest, #scraper, #web-scraping, #cli, #proxies, #json

Rust is an excellent language for web scraping when you care about:

performance (fast parsing + concurrency)
reliability (type safety and fewer runtime surprises)
building scrapers as long-running services

The ecosystem is mature enough that you can build production crawlers with a small set of crates:

reqwest for HTTP
scraper for CSS-selector-based HTML parsing
serde for JSON

In this tutorial you’ll build a real scraper that:

fetches a list page
parses repeated items using CSS selectors
follows pagination
exports clean JSON
supports proxies (including ProxiesAPI) via environment variables

Scale Rust crawlers with ProxiesAPI

Once your Rust scraper grows from dozens to thousands of URLs, a proxy layer can stabilize throughput. ProxiesAPI provides a consistent proxy endpoint you can plug into reqwest via standard proxy settings.

Get 1,000 free API calls View pricing

Project setup

Create a new Rust binary:

cargo new rust_scraper
cd rust_scraper

Add dependencies:

cargo add reqwest --features blocking,gzip,brotli,deflate,json
cargo add scraper
cargo add serde --features derive
cargo add serde_json
cargo add anyhow
cargo add url

Tip: This tutorial uses reqwest in blocking mode for simplicity. You can migrate to async later.

The target: a paginated list of items

To keep the example broadly applicable, we’ll scrape a generic “list page” shape:

a page with repeated item cards
each card has a title + link
pagination via a “Next” link

You can adapt this to:

blog index pages
product category pages
directory listings

Step 1: Build an HTTP client with timeouts

use anyhow::Result;
use reqwest::blocking::Client;
use std::time::Duration;

fn build_client() -> Result<Client> {
    let client = Client::builder()
        .timeout(Duration::from_secs(30))
        .user_agent("Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)")
        .build()?;
    Ok(client)
}

Proxy support (ProxiesAPI)

reqwest supports proxies through Proxy::all().

We’ll read a single proxy URL from an environment variable. If it’s set, we route all traffic through it.

use reqwest::Proxy;

fn build_client() -> Result<Client> {
    let mut builder = Client::builder()
        .timeout(Duration::from_secs(30))
        .user_agent("Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)");

    if let Ok(proxy_url) = std::env::var("PROXIESAPI_PROXY_URL") {
        // Example format: http://USER:PASS@proxy.yourprovider.com:1234
        builder = builder.proxy(Proxy::all(&proxy_url)?);
    }

    Ok(builder.build()?)
}

This pattern works with ProxiesAPI if you have a proxy endpoint URL.

Step 2: Fetch HTML

fn fetch_html(client: &Client, url: &str) -> Result<String> {
    let resp = client.get(url).send()?;

    // Optional: retry logic can be added if you want to treat 403/429/5xx specially.
    let status = resp.status();
    if !status.is_success() {
        anyhow::bail!("HTTP {} for {}", status, url);
    }

    Ok(resp.text()?)
}

Step 3: Parse items with `scraper`

The scraper crate uses CSS selectors.

We’ll define a small struct for items:

use serde::Serialize;

#[derive(Debug, Serialize, Clone)]
struct Item {
    title: String,
    url: String,
}

Now parse from HTML:

use scraper::{Html, Selector};
use url::Url;

fn parse_items(html: &str, base_url: &str) -> Result<Vec<Item>> {
    let doc = Html::parse_document(html);

    // Update these selectors for your target site.
    let card_sel = Selector::parse("article")?;
    let link_sel = Selector::parse("a")?;

    let base = Url::parse(base_url)?;

    let mut out = Vec::new();

    for card in doc.select(&card_sel) {
        if let Some(a) = card.select(&link_sel).next() {
            let title = a.text().collect::<Vec<_>>().join(" ").trim().to_string();
            if title.is_empty() {
                continue;
            }

            if let Some(href) = a.value().attr("href") {
                let joined = base.join(href)?.to_string();
                out.push(Item { title, url: joined });
            }
        }
    }

    Ok(out)
}

Why we keep selectors simple

For scraping at scale, you want:

a small number of selectors
a single place to update them

Markup changes are inevitable.

Step 4: Pagination: follow “Next”

fn find_next_page(html: &str, base_url: &str) -> Result<Option<String>> {
    let doc = Html::parse_document(html);

    // Common pattern: <a rel="next" href="...">Next</a>
    let next_sel = Selector::parse("a[rel='next']")?;

    if let Some(a) = doc.select(&next_sel).next() {
        if let Some(href) = a.value().attr("href") {
            let base = Url::parse(base_url)?;
            return Ok(Some(base.join(href)?.to_string()));
        }
    }

    // Fallback: text match "Next" (lightweight)
    let a_sel = Selector::parse("a")?;
    for a in doc.select(&a_sel) {
        let text = a.text().collect::<Vec<_>>().join(" ").to_lowercase();
        if text.contains("next") {
            if let Some(href) = a.value().attr("href") {
                let base = Url::parse(base_url)?;
                return Ok(Some(base.join(href)?.to_string()));
            }
        }
    }

    Ok(None)
}

Step 5: Crawl N pages and de-duplicate

use std::collections::HashSet;

fn crawl(start_url: &str, max_pages: usize) -> Result<Vec<Item>> {
    let client = build_client()?;

    let mut url = start_url.to_string();
    let mut page = 0usize;

    let mut seen = HashSet::new();
    let mut all = Vec::new();

    while page < max_pages {
        page += 1;
        let html = fetch_html(&client, &url)?;

        let items = parse_items(&html, &url)?;
        let mut new_count = 0;

        for it in items {
            if seen.insert(it.url.clone()) {
                all.push(it);
                new_count += 1;
            }
        }

        println!("page {}: total={} new={} url={}", page, all.len(), new_count, url);

        if let Some(next) = find_next_page(&html, &url)? {
            url = next;
        } else {
            break;
        }
    }

    Ok(all)
}

Step 6: Export JSON

use std::fs::File;
use std::io::Write;

fn write_json(path: &str, items: &[Item]) -> Result<()> {
    let json = serde_json::to_string_pretty(items)?;
    let mut f = File::create(path)?;
    f.write_all(json.as_bytes())?;
    Ok(())
}

Main program:

fn main() -> Result<()> {
    let start = "https://example.com/blog"; // replace
    let items = crawl(start, 5)?;
    write_json("items.json", &items)?;
    println!("wrote items.json: {} items", items.len());
    Ok(())
}

Comparison: Rust vs Python for scraping

Dimension	Rust	Python
Speed	Excellent	Good
Safety	High	Medium
Ecosystem	Smaller but solid	Huge
Iteration speed	Medium	Fast
Concurrency	Great	Great (async)

Rust is a strong choice when you’re building a scraper as a service (not a one-off script).

Practical advice for production Rust scrapers

Keep selectors in one module and version them
Add retries/backoff for 403/429/5xx
Implement caching so you don’t re-fetch unchanged pages
Store provenance (URL + fetch timestamp)
Respect rate limits and spread load

Where ProxiesAPI fits (honestly)

ProxiesAPI helps at the network layer:

reduce per-IP throttling impact
smooth out transient blocks
improve completion rate for large URL queues

It doesn’t replace:

responsible crawl behavior
monitoring
compliance and ToS review

Next upgrades

Move to async (tokio, reqwest async) for higher throughput
Add a queue + retry store (SQLite/Postgres)
Integrate a headless browser only for JS-heavy pages (keep most fetches HTML-first)

Scale Rust crawlers with ProxiesAPI

Get 1,000 free API calls View pricing

A practical Rust scraping guide: fetch pages with reqwest, rotate proxies, parse HTML with the scraper crate, handle retries/timeouts, and export structured data.

guide#rust#web-scraping#reqwest

Web Scraping with Go (Colly Framework): Complete Guide

Learn web scraping in Go using Colly: selectors, concurrency, rate limits, retries, and exporting to JSON/CSV. Includes a practical ProxiesAPI integration pattern for more reliable crawling.

guide#go#golang#colly

How to Scrape Stack Overflow Questions and Accepted Answers with Python (By Tag)

Build a resilient Stack Overflow scraper: crawl tag pages, extract question metadata, follow links, and parse accepted answers. Includes retries, dedupe, and ProxiesAPI-ready requests + a screenshot of the tag page.

tutorial#python#stack-overflow#web-scraping

Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)

Collect Stack Overflow Q&A for a tag with pagination, answer extraction, and a proof screenshot. Export clean JSON for analysis.

tutorial#python#stack-overflow#web-scraping

Web Scraping with Rust: reqwest + scraper Crate Tutorial

Related guides