Web Crawling Explained: How to Build Scalable Crawlers Without Wasting Requests

If you search for web crawling, you will find two bad extremes:

  • vague definitions that say crawling is just "visiting websites automatically"
  • over-engineered architectures that assume you need a mini search engine on day one

Most teams need something in the middle: a crawler that is selective, polite, resumable, and cheap to run.

Here is the simplest useful definition:

Web crawling is the process of systematically discovering and fetching many pages by following links, sitemaps, feeds, or known URL patterns.

That is different from a one-page scraper. A scraper extracts fields from a page. A crawler decides which pages to visit next.

A crawler is only as good as its fetch layer

Once your queue and dedupe logic are correct, the biggest crawler problems are usually failed fetches, retries, and blocked requests. ProxiesAPI helps on that layer without forcing you to rebuild the crawler itself.


Web crawling vs web scraping

The terms overlap, but they solve different problems.

TaskBest tool
Pull title, price, and rating from one product pageScraper
Discover all category pages and every product underneath themCrawler
Revisit only changed pages every dayCrawler + scraper
Export structured fields from the final HTMLScraper

A good mental model:

  • the crawler expands your frontier of URLs
  • the scraper extracts data from the pages the crawler fetched

If you skip that separation, you usually waste requests.


When web crawling is the right choice

Use web crawling when:

  • you do not know all the target URLs in advance
  • new pages appear continuously
  • you need broad coverage across categories, tags, or pagination
  • you want regular refreshes instead of a one-time export

Avoid a crawler when:

  • you already have a clean API
  • a sitemap or feed gives you the exact pages you need
  • you only care about 20 known URLs

The fastest crawler is the one you did not need to build.


The minimum crawler architecture

A practical web crawling system has five parts:

  1. seeds
  2. frontier / queue
  3. fetcher
  4. parser / link extractor
  5. storage and state

1. Seeds

Seeds are your starting URLs. They might be:

  • homepage sections
  • category pages
  • XML sitemaps
  • RSS or Atom feeds
  • a known list of seller, product, or article URLs

Good seeds reduce wasted discovery.

2. Frontier / queue

This is the list of URLs still to visit. The frontier should track:

  • URL
  • depth
  • source page
  • scheduled time
  • priority

3. Fetcher

The fetcher handles:

  • timeouts
  • retries
  • headers
  • proxy routing if needed
  • response validation

The parser decides:

  • what fields to extract
  • which new links are worth enqueuing
  • which links to ignore

5. Storage and state

At minimum store:

  • visited URLs
  • canonicalized URL fingerprints
  • HTTP status
  • last fetch time
  • extracted record count

That is enough to pause, resume, and debug.


How good crawlers avoid wasting requests

The biggest mistake in web crawling is fetching pages just because you can.

Here is how disciplined crawlers stay efficient.

Canonicalize URLs

Treat these as the same when appropriate:

  • tracking parameters like utm_*
  • duplicate slashes
  • #fragment suffixes
  • equivalent sort parameters you do not care about

If you do not normalize URLs, your crawler will revisit the same page under many shapes.

Do not add every anchor tag to the queue.

Filter by:

  • domain allowlist
  • path patterns
  • content type expectations
  • depth limit
  • language or country segment

Respect change signals

If a site gives you them, use:

  • ETag
  • Last-Modified
  • sitemap lastmod
  • feed timestamps

Those signals are much cheaper than blind refetching.

Separate discovery from extraction

This is one of the best optimizations.

Example:

  • crawl category pages every hour
  • crawl product detail pages only when they are newly discovered or changed

That alone can cut requests dramatically.


A small Python crawler skeleton

You do not need Kafka to learn the idea. This small example shows a polite breadth-first crawler with basic dedupe.

from __future__ import annotations

import time
from collections import deque
from dataclasses import dataclass
from urllib.parse import urljoin, urlparse, urlunparse

import requests
from bs4 import BeautifulSoup

ALLOWED_HOST = "example.com"
TIMEOUT = (10, 20)

session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})


@dataclass
class QueueItem:
    url: str
    depth: int


def canonicalize(url: str) -> str:
    parsed = urlparse(url)
    clean = parsed._replace(fragment="", query="")
    return urlunparse(clean).rstrip("/")


def same_host(url: str) -> bool:
    return urlparse(url).netloc == ALLOWED_HOST


def fetch(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def extract_links(html: str, base_url: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    links = []
    for a in soup.select("a[href]"):
        full = canonicalize(urljoin(base_url, a["href"]))
        if same_host(full):
            links.append(full)
    return links


def crawl(seed_urls: list[str], max_pages: int = 100, max_depth: int = 2) -> list[str]:
    queue = deque(QueueItem(canonicalize(url), 0) for url in seed_urls)
    seen: set[str] = set()
    fetched: list[str] = []

    while queue and len(fetched) < max_pages:
        item = queue.popleft()
        if item.url in seen or item.depth > max_depth:
            continue

        seen.add(item.url)
        html = fetch(item.url)
        fetched.append(item.url)

        for link in extract_links(html, item.url):
            if link not in seen:
                queue.append(QueueItem(link, item.depth + 1))

        time.sleep(1.0)  # politeness

    return fetched

This is not production-ready, but it demonstrates the important split:

  • queue policy
  • fetch policy
  • parse policy

That split is what lets a crawler grow without turning into spaghetti.


Politeness rules that actually matter

Every web crawling guide mentions politeness. Fewer explain what that means operationally.

Use this checklist:

RuleWhy it matters
Limit concurrency per hostPrevent accidental overload
Add delays or token bucketsSmooth request bursts
Cache aggressively during developmentAvoid repeated fetches while debugging
Respect robots.txt where appropriate for your use case and legal postureReduce obvious abuse patterns
Retry selectively, not endlesslyPrevent retry storms
Validate response bodiesA 200 can still be a block page

The hidden one is response validation.

Many crawlers waste half their budget fetching login pages, challenge pages, empty pages, and soft-blocks that technically returned 200 OK.


How to prioritize the queue

Not all pages deserve the same freshness.

A sensible priority model:

Page typePriorityRefresh cadence
Category / listing pageHighFrequent
Newly discovered detail pageHighImmediate
Stable evergreen articleMediumPeriodic
Old detail page with no changes in monthsLowRare

That is why web crawling is not just "follow every link." It is scheduling.


When a crawler should stop

A crawler without stop conditions will happily spend your whole budget.

Common stop rules:

  • maximum pages
  • maximum depth
  • domain or path boundaries
  • freshness windows
  • no-new-links threshold
  • response-quality threshold

If you define those upfront, you avoid the classic situation where a crawler spends 80% of its time on low-value archive pages.


Where ProxiesAPI fits in a crawler

ProxiesAPI is not the crawl strategy. It is part of the fetch layer.

That means it helps most when:

  • you already know which pages are worth fetching
  • your dedupe and queue are working
  • network instability or blocking becomes the bottleneck

It does not fix:

  • bad seed selection
  • duplicate URLs
  • missing canonicalization
  • poor queue priorities

Those are architecture problems, not proxy problems.


The simplest good web crawling strategy

If you want a practical starting point, do this:

  1. Start from the smallest seed set that covers the target area.
  2. Canonicalize URLs before enqueueing them.
  3. Keep separate logic for discovery pages and detail pages.
  4. Re-fetch only pages that changed or are likely to change.
  5. Add ProxiesAPI only when the fetch layer becomes unreliable.

That is enough to build a crawler that scales much better than a giant pile of ad hoc requests.get() calls.

Web crawling is not about collecting the most URLs. It is about collecting the right URLs with the fewest wasted requests.

A crawler is only as good as its fetch layer

Once your queue and dedupe logic are correct, the biggest crawler problems are usually failed fetches, retries, and blocked requests. ProxiesAPI helps on that layer without forcing you to rebuild the crawler itself.

Related guides

Web Crawling vs Web Scraping: Architecture, Scope, and When to Use Each
A practical guide to web crawling vs web scraping: what each one does, how the architectures differ, and when to use a crawler, a scraper, or both together.
guides#web crawling#web scraping#architecture
Proxy List Guide: Why Public Lists Fail for Web Scraping
Explain the tradeoffs of raw proxy lists versus managed rotation, validation, and retry layers for production scraping.
guide#proxy list#web scraping#proxies
How to Scrape Shopify Stores: Product, Price, Inventory
Break down how to detect Shopify storefront patterns and extract product, pricing, and availability data without relying on brittle selectors.
guide#shopify product scraping#shopify#ecommerce
How to Scrape Google Search Results with Python
Walk through extracting titles, URLs, and snippets from Google result pages while handling rate limits and anti-bot friction.
guide#scrape google#python#serp