Web Crawling Explained: How to Build Scalable Crawlers Without Wasting Requests

Jun 29, 2026 · guide · #web crawling, #web scraping, #architecture, #python, #data-pipelines, #proxiesapi

If you search for web crawling, you will find two bad extremes:

vague definitions that say crawling is just "visiting websites automatically"
over-engineered architectures that assume you need a mini search engine on day one

Most teams need something in the middle: a crawler that is selective, polite, resumable, and cheap to run.

Here is the simplest useful definition:

Web crawling is the process of systematically discovering and fetching many pages by following links, sitemaps, feeds, or known URL patterns.

That is different from a one-page scraper. A scraper extracts fields from a page. A crawler decides which pages to visit next.

A crawler is only as good as its fetch layer

Once your queue and dedupe logic are correct, the biggest crawler problems are usually failed fetches, retries, and blocked requests. ProxiesAPI helps on that layer without forcing you to rebuild the crawler itself.

Get 1,000 free API calls View pricing

Web crawling vs web scraping

The terms overlap, but they solve different problems.

Task	Best tool
Pull title, price, and rating from one product page	Scraper
Discover all category pages and every product underneath them	Crawler
Revisit only changed pages every day	Crawler + scraper
Export structured fields from the final HTML	Scraper

A good mental model:

the crawler expands your frontier of URLs
the scraper extracts data from the pages the crawler fetched

If you skip that separation, you usually waste requests.

When web crawling is the right choice

Use web crawling when:

you do not know all the target URLs in advance
new pages appear continuously
you need broad coverage across categories, tags, or pagination
you want regular refreshes instead of a one-time export

Avoid a crawler when:

you already have a clean API
a sitemap or feed gives you the exact pages you need
you only care about 20 known URLs

The fastest crawler is the one you did not need to build.

The minimum crawler architecture

A practical web crawling system has five parts:

seeds
frontier / queue
fetcher
parser / link extractor
storage and state

1. Seeds

Seeds are your starting URLs. They might be:

homepage sections
category pages
XML sitemaps
RSS or Atom feeds
a known list of seller, product, or article URLs

Good seeds reduce wasted discovery.

2. Frontier / queue

This is the list of URLs still to visit. The frontier should track:

URL
depth
source page
scheduled time
priority

3. Fetcher

The fetcher handles:

timeouts
retries
headers
proxy routing if needed
response validation

4. Parser / link extractor

The parser decides:

what fields to extract
which new links are worth enqueuing
which links to ignore

5. Storage and state

At minimum store:

visited URLs
canonicalized URL fingerprints
HTTP status
last fetch time
extracted record count

That is enough to pause, resume, and debug.

How good crawlers avoid wasting requests

The biggest mistake in web crawling is fetching pages just because you can.

Here is how disciplined crawlers stay efficient.

Canonicalize URLs

Treat these as the same when appropriate:

tracking parameters like utm_*
duplicate slashes
#fragment suffixes
equivalent sort parameters you do not care about

If you do not normalize URLs, your crawler will revisit the same page under many shapes.

Filter links before enqueueing

Do not add every anchor tag to the queue.

Filter by:

domain allowlist
path patterns
content type expectations
depth limit
language or country segment

Respect change signals

If a site gives you them, use:

ETag
Last-Modified
sitemap lastmod
feed timestamps

Those signals are much cheaper than blind refetching.

Separate discovery from extraction

This is one of the best optimizations.

Example:

crawl category pages every hour
crawl product detail pages only when they are newly discovered or changed

That alone can cut requests dramatically.

A small Python crawler skeleton

You do not need Kafka to learn the idea. This small example shows a polite breadth-first crawler with basic dedupe.

from __future__ import annotations

import time
from collections import deque
from dataclasses import dataclass
from urllib.parse import urljoin, urlparse, urlunparse

import requests
from bs4 import BeautifulSoup

ALLOWED_HOST = "example.com"
TIMEOUT = (10, 20)

session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})


@dataclass
class QueueItem:
    url: str
    depth: int


def canonicalize(url: str) -> str:
    parsed = urlparse(url)
    clean = parsed._replace(fragment="", query="")
    return urlunparse(clean).rstrip("/")


def same_host(url: str) -> bool:
    return urlparse(url).netloc == ALLOWED_HOST


def fetch(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def extract_links(html: str, base_url: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    links = []
    for a in soup.select("a[href]"):
        full = canonicalize(urljoin(base_url, a["href"]))
        if same_host(full):
            links.append(full)
    return links


def crawl(seed_urls: list[str], max_pages: int = 100, max_depth: int = 2) -> list[str]:
    queue = deque(QueueItem(canonicalize(url), 0) for url in seed_urls)
    seen: set[str] = set()
    fetched: list[str] = []

    while queue and len(fetched) < max_pages:
        item = queue.popleft()
        if item.url in seen or item.depth > max_depth:
            continue

        seen.add(item.url)
        html = fetch(item.url)
        fetched.append(item.url)

        for link in extract_links(html, item.url):
            if link not in seen:
                queue.append(QueueItem(link, item.depth + 1))

        time.sleep(1.0)  # politeness

    return fetched

This is not production-ready, but it demonstrates the important split:

queue policy
fetch policy
parse policy

That split is what lets a crawler grow without turning into spaghetti.

Politeness rules that actually matter

Every web crawling guide mentions politeness. Fewer explain what that means operationally.

Use this checklist:

Rule	Why it matters
Limit concurrency per host	Prevent accidental overload
Add delays or token buckets	Smooth request bursts
Cache aggressively during development	Avoid repeated fetches while debugging
Respect robots.txt where appropriate for your use case and legal posture	Reduce obvious abuse patterns
Retry selectively, not endlessly	Prevent retry storms
Validate response bodies	A `200` can still be a block page

The hidden one is response validation.

Many crawlers waste half their budget fetching login pages, challenge pages, empty pages, and soft-blocks that technically returned 200 OK.

How to prioritize the queue

Not all pages deserve the same freshness.

A sensible priority model:

Page type	Priority	Refresh cadence
Category / listing page	High	Frequent
Newly discovered detail page	High	Immediate
Stable evergreen article	Medium	Periodic
Old detail page with no changes in months	Low	Rare