Web Crawling Explained: How to Build Scalable Crawlers Without Wasting Requests
If you search for web crawling, you will find two bad extremes:
- vague definitions that say crawling is just "visiting websites automatically"
- over-engineered architectures that assume you need a mini search engine on day one
Most teams need something in the middle: a crawler that is selective, polite, resumable, and cheap to run.
Here is the simplest useful definition:
Web crawling is the process of systematically discovering and fetching many pages by following links, sitemaps, feeds, or known URL patterns.
That is different from a one-page scraper. A scraper extracts fields from a page. A crawler decides which pages to visit next.
Once your queue and dedupe logic are correct, the biggest crawler problems are usually failed fetches, retries, and blocked requests. ProxiesAPI helps on that layer without forcing you to rebuild the crawler itself.
Web crawling vs web scraping
The terms overlap, but they solve different problems.
| Task | Best tool |
|---|---|
| Pull title, price, and rating from one product page | Scraper |
| Discover all category pages and every product underneath them | Crawler |
| Revisit only changed pages every day | Crawler + scraper |
| Export structured fields from the final HTML | Scraper |
A good mental model:
- the crawler expands your frontier of URLs
- the scraper extracts data from the pages the crawler fetched
If you skip that separation, you usually waste requests.
When web crawling is the right choice
Use web crawling when:
- you do not know all the target URLs in advance
- new pages appear continuously
- you need broad coverage across categories, tags, or pagination
- you want regular refreshes instead of a one-time export
Avoid a crawler when:
- you already have a clean API
- a sitemap or feed gives you the exact pages you need
- you only care about 20 known URLs
The fastest crawler is the one you did not need to build.
The minimum crawler architecture
A practical web crawling system has five parts:
- seeds
- frontier / queue
- fetcher
- parser / link extractor
- storage and state
1. Seeds
Seeds are your starting URLs. They might be:
- homepage sections
- category pages
- XML sitemaps
- RSS or Atom feeds
- a known list of seller, product, or article URLs
Good seeds reduce wasted discovery.
2. Frontier / queue
This is the list of URLs still to visit. The frontier should track:
- URL
- depth
- source page
- scheduled time
- priority
3. Fetcher
The fetcher handles:
- timeouts
- retries
- headers
- proxy routing if needed
- response validation
4. Parser / link extractor
The parser decides:
- what fields to extract
- which new links are worth enqueuing
- which links to ignore
5. Storage and state
At minimum store:
- visited URLs
- canonicalized URL fingerprints
- HTTP status
- last fetch time
- extracted record count
That is enough to pause, resume, and debug.
How good crawlers avoid wasting requests
The biggest mistake in web crawling is fetching pages just because you can.
Here is how disciplined crawlers stay efficient.
Canonicalize URLs
Treat these as the same when appropriate:
- tracking parameters like
utm_* - duplicate slashes
#fragmentsuffixes- equivalent sort parameters you do not care about
If you do not normalize URLs, your crawler will revisit the same page under many shapes.
Filter links before enqueueing
Do not add every anchor tag to the queue.
Filter by:
- domain allowlist
- path patterns
- content type expectations
- depth limit
- language or country segment
Respect change signals
If a site gives you them, use:
ETagLast-Modified- sitemap
lastmod - feed timestamps
Those signals are much cheaper than blind refetching.
Separate discovery from extraction
This is one of the best optimizations.
Example:
- crawl category pages every hour
- crawl product detail pages only when they are newly discovered or changed
That alone can cut requests dramatically.
A small Python crawler skeleton
You do not need Kafka to learn the idea. This small example shows a polite breadth-first crawler with basic dedupe.
from __future__ import annotations
import time
from collections import deque
from dataclasses import dataclass
from urllib.parse import urljoin, urlparse, urlunparse
import requests
from bs4 import BeautifulSoup
ALLOWED_HOST = "example.com"
TIMEOUT = (10, 20)
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})
@dataclass
class QueueItem:
url: str
depth: int
def canonicalize(url: str) -> str:
parsed = urlparse(url)
clean = parsed._replace(fragment="", query="")
return urlunparse(clean).rstrip("/")
def same_host(url: str) -> bool:
return urlparse(url).netloc == ALLOWED_HOST
def fetch(url: str) -> str:
r = session.get(url, timeout=TIMEOUT)
r.raise_for_status()
return r.text
def extract_links(html: str, base_url: str) -> list[str]:
soup = BeautifulSoup(html, "lxml")
links = []
for a in soup.select("a[href]"):
full = canonicalize(urljoin(base_url, a["href"]))
if same_host(full):
links.append(full)
return links
def crawl(seed_urls: list[str], max_pages: int = 100, max_depth: int = 2) -> list[str]:
queue = deque(QueueItem(canonicalize(url), 0) for url in seed_urls)
seen: set[str] = set()
fetched: list[str] = []
while queue and len(fetched) < max_pages:
item = queue.popleft()
if item.url in seen or item.depth > max_depth:
continue
seen.add(item.url)
html = fetch(item.url)
fetched.append(item.url)
for link in extract_links(html, item.url):
if link not in seen:
queue.append(QueueItem(link, item.depth + 1))
time.sleep(1.0) # politeness
return fetched
This is not production-ready, but it demonstrates the important split:
- queue policy
- fetch policy
- parse policy
That split is what lets a crawler grow without turning into spaghetti.
Politeness rules that actually matter
Every web crawling guide mentions politeness. Fewer explain what that means operationally.
Use this checklist:
| Rule | Why it matters |
|---|---|
| Limit concurrency per host | Prevent accidental overload |
| Add delays or token buckets | Smooth request bursts |
| Cache aggressively during development | Avoid repeated fetches while debugging |
| Respect robots.txt where appropriate for your use case and legal posture | Reduce obvious abuse patterns |
| Retry selectively, not endlessly | Prevent retry storms |
| Validate response bodies | A 200 can still be a block page |
The hidden one is response validation.
Many crawlers waste half their budget fetching login pages, challenge pages, empty pages, and soft-blocks that technically returned 200 OK.
How to prioritize the queue
Not all pages deserve the same freshness.
A sensible priority model:
| Page type | Priority | Refresh cadence |
|---|---|---|
| Category / listing page | High | Frequent |
| Newly discovered detail page | High | Immediate |
| Stable evergreen article | Medium | Periodic |
| Old detail page with no changes in months | Low | Rare |
That is why web crawling is not just "follow every link." It is scheduling.
When a crawler should stop
A crawler without stop conditions will happily spend your whole budget.
Common stop rules:
- maximum pages
- maximum depth
- domain or path boundaries
- freshness windows
- no-new-links threshold
- response-quality threshold
If you define those upfront, you avoid the classic situation where a crawler spends 80% of its time on low-value archive pages.
Where ProxiesAPI fits in a crawler
ProxiesAPI is not the crawl strategy. It is part of the fetch layer.
That means it helps most when:
- you already know which pages are worth fetching
- your dedupe and queue are working
- network instability or blocking becomes the bottleneck
It does not fix:
- bad seed selection
- duplicate URLs
- missing canonicalization
- poor queue priorities
Those are architecture problems, not proxy problems.
The simplest good web crawling strategy
If you want a practical starting point, do this:
- Start from the smallest seed set that covers the target area.
- Canonicalize URLs before enqueueing them.
- Keep separate logic for discovery pages and detail pages.
- Re-fetch only pages that changed or are likely to change.
- Add ProxiesAPI only when the fetch layer becomes unreliable.
That is enough to build a crawler that scales much better than a giant pile of ad hoc requests.get() calls.
Web crawling is not about collecting the most URLs. It is about collecting the right URLs with the fewest wasted requests.
Once your queue and dedupe logic are correct, the biggest crawler problems are usually failed fetches, retries, and blocked requests. ProxiesAPI helps on that layer without forcing you to rebuild the crawler itself.