Web Scraping with Rust: reqwest + scraper Crate Tutorial

Rust is an excellent language for web scraping when you care about:

  • performance (fast parsing + concurrency)
  • reliability (type safety and fewer runtime surprises)
  • building scrapers as long-running services

The ecosystem is mature enough that you can build production crawlers with a small set of crates:

  • reqwest for HTTP
  • scraper for CSS-selector-based HTML parsing
  • serde for JSON

In this tutorial you’ll build a real scraper that:

  1. fetches a list page
  2. parses repeated items using CSS selectors
  3. follows pagination
  4. exports clean JSON
  5. supports proxies (including ProxiesAPI) via environment variables
Scale Rust crawlers with ProxiesAPI

Once your Rust scraper grows from dozens to thousands of URLs, a proxy layer can stabilize throughput. ProxiesAPI provides a consistent proxy endpoint you can plug into reqwest via standard proxy settings.


Project setup

Create a new Rust binary:

cargo new rust_scraper
cd rust_scraper

Add dependencies:

cargo add reqwest --features blocking,gzip,brotli,deflate,json
cargo add scraper
cargo add serde --features derive
cargo add serde_json
cargo add anyhow
cargo add url

Tip: This tutorial uses reqwest in blocking mode for simplicity. You can migrate to async later.


The target: a paginated list of items

To keep the example broadly applicable, we’ll scrape a generic “list page” shape:

  • a page with repeated item cards
  • each card has a title + link
  • pagination via a “Next” link

You can adapt this to:

  • blog index pages
  • product category pages
  • directory listings

Step 1: Build an HTTP client with timeouts

use anyhow::Result;
use reqwest::blocking::Client;
use std::time::Duration;

fn build_client() -> Result<Client> {
    let client = Client::builder()
        .timeout(Duration::from_secs(30))
        .user_agent("Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)")
        .build()?;
    Ok(client)
}

Proxy support (ProxiesAPI)

reqwest supports proxies through Proxy::all().

We’ll read a single proxy URL from an environment variable. If it’s set, we route all traffic through it.

use reqwest::Proxy;

fn build_client() -> Result<Client> {
    let mut builder = Client::builder()
        .timeout(Duration::from_secs(30))
        .user_agent("Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)");

    if let Ok(proxy_url) = std::env::var("PROXIESAPI_PROXY_URL") {
        // Example format: http://USER:PASS@proxy.yourprovider.com:1234
        builder = builder.proxy(Proxy::all(&proxy_url)?);
    }

    Ok(builder.build()?)
}

This pattern works with ProxiesAPI if you have a proxy endpoint URL.


Step 2: Fetch HTML

fn fetch_html(client: &Client, url: &str) -> Result<String> {
    let resp = client.get(url).send()?;

    // Optional: retry logic can be added if you want to treat 403/429/5xx specially.
    let status = resp.status();
    if !status.is_success() {
        anyhow::bail!("HTTP {} for {}", status, url);
    }

    Ok(resp.text()?)
}

Step 3: Parse items with scraper

The scraper crate uses CSS selectors.

We’ll define a small struct for items:

use serde::Serialize;

#[derive(Debug, Serialize, Clone)]
struct Item {
    title: String,
    url: String,
}

Now parse from HTML:

use scraper::{Html, Selector};
use url::Url;

fn parse_items(html: &str, base_url: &str) -> Result<Vec<Item>> {
    let doc = Html::parse_document(html);

    // Update these selectors for your target site.
    let card_sel = Selector::parse("article")?;
    let link_sel = Selector::parse("a")?;

    let base = Url::parse(base_url)?;

    let mut out = Vec::new();

    for card in doc.select(&card_sel) {
        if let Some(a) = card.select(&link_sel).next() {
            let title = a.text().collect::<Vec<_>>().join(" ").trim().to_string();
            if title.is_empty() {
                continue;
            }

            if let Some(href) = a.value().attr("href") {
                let joined = base.join(href)?.to_string();
                out.push(Item { title, url: joined });
            }
        }
    }

    Ok(out)
}

Why we keep selectors simple

For scraping at scale, you want:

  • a small number of selectors
  • a single place to update them

Markup changes are inevitable.


Step 4: Pagination: follow “Next”

fn find_next_page(html: &str, base_url: &str) -> Result<Option<String>> {
    let doc = Html::parse_document(html);

    // Common pattern: <a rel="next" href="...">Next</a>
    let next_sel = Selector::parse("a[rel='next']")?;

    if let Some(a) = doc.select(&next_sel).next() {
        if let Some(href) = a.value().attr("href") {
            let base = Url::parse(base_url)?;
            return Ok(Some(base.join(href)?.to_string()));
        }
    }

    // Fallback: text match "Next" (lightweight)
    let a_sel = Selector::parse("a")?;
    for a in doc.select(&a_sel) {
        let text = a.text().collect::<Vec<_>>().join(" ").to_lowercase();
        if text.contains("next") {
            if let Some(href) = a.value().attr("href") {
                let base = Url::parse(base_url)?;
                return Ok(Some(base.join(href)?.to_string()));
            }
        }
    }

    Ok(None)
}

Step 5: Crawl N pages and de-duplicate

use std::collections::HashSet;

fn crawl(start_url: &str, max_pages: usize) -> Result<Vec<Item>> {
    let client = build_client()?;

    let mut url = start_url.to_string();
    let mut page = 0usize;

    let mut seen = HashSet::new();
    let mut all = Vec::new();

    while page < max_pages {
        page += 1;
        let html = fetch_html(&client, &url)?;

        let items = parse_items(&html, &url)?;
        let mut new_count = 0;

        for it in items {
            if seen.insert(it.url.clone()) {
                all.push(it);
                new_count += 1;
            }
        }

        println!("page {}: total={} new={} url={}", page, all.len(), new_count, url);

        if let Some(next) = find_next_page(&html, &url)? {
            url = next;
        } else {
            break;
        }
    }

    Ok(all)
}

Step 6: Export JSON

use std::fs::File;
use std::io::Write;

fn write_json(path: &str, items: &[Item]) -> Result<()> {
    let json = serde_json::to_string_pretty(items)?;
    let mut f = File::create(path)?;
    f.write_all(json.as_bytes())?;
    Ok(())
}

Main program:

fn main() -> Result<()> {
    let start = "https://example.com/blog"; // replace
    let items = crawl(start, 5)?;
    write_json("items.json", &items)?;
    println!("wrote items.json: {} items", items.len());
    Ok(())
}

Comparison: Rust vs Python for scraping

DimensionRustPython
SpeedExcellentGood
SafetyHighMedium
EcosystemSmaller but solidHuge
Iteration speedMediumFast
ConcurrencyGreatGreat (async)

Rust is a strong choice when you’re building a scraper as a service (not a one-off script).


Practical advice for production Rust scrapers

  • Keep selectors in one module and version them
  • Add retries/backoff for 403/429/5xx
  • Implement caching so you don’t re-fetch unchanged pages
  • Store provenance (URL + fetch timestamp)
  • Respect rate limits and spread load

Where ProxiesAPI fits (honestly)

ProxiesAPI helps at the network layer:

  • reduce per-IP throttling impact
  • smooth out transient blocks
  • improve completion rate for large URL queues

It doesn’t replace:

  • responsible crawl behavior
  • monitoring
  • compliance and ToS review

Next upgrades

  • Move to async (tokio, reqwest async) for higher throughput
  • Add a queue + retry store (SQLite/Postgres)
  • Integrate a headless browser only for JS-heavy pages (keep most fetches HTML-first)
Scale Rust crawlers with ProxiesAPI

Once your Rust scraper grows from dozens to thousands of URLs, a proxy layer can stabilize throughput. ProxiesAPI provides a consistent proxy endpoint you can plug into reqwest via standard proxy settings.

Related guides

Web Scraping with Rust: reqwest + scraper Crate Tutorial (2026)
A practical Rust scraping guide: fetch pages with reqwest, rotate proxies, parse HTML with the scraper crate, handle retries/timeouts, and export structured data.
guide#rust#web-scraping#reqwest
Web Scraping with Go (Colly Framework): Complete Guide
Learn web scraping in Go using Colly: selectors, concurrency, rate limits, retries, and exporting to JSON/CSV. Includes a practical ProxiesAPI integration pattern for more reliable crawling.
guide#go#golang#colly
How to Scrape Stack Overflow Questions and Accepted Answers with Python (By Tag)
Build a resilient Stack Overflow scraper: crawl tag pages, extract question metadata, follow links, and parse accepted answers. Includes retries, dedupe, and ProxiesAPI-ready requests + a screenshot of the tag page.
tutorial#python#stack-overflow#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Collect Stack Overflow Q&A for a tag with pagination, answer extraction, and a proof screenshot. Export clean JSON for analysis.
tutorial#python#stack-overflow#web-scraping