Python Web Crawler Tutorial: Build Your First Crawler (URLs, Robots, Rate Limits)
A web crawler is just a loop:
- take a URL from a queue
- fetch it
- extract links
- add new URLs back into the queue
But in the real world, everything around that loop matters:
- URLs duplicate endlessly (tracking params, fragments, redirect loops)
- sites publish
robots.txtrules you should respect - rate limits and politeness keep you from getting blocked
- transient network errors happen constantly
In this tutorial you’ll build a small-but-serious crawler in Python that covers the fundamentals:
- URL normalization + canonicalization
- domain scoping
- robots.txt checks
- a queue with persistent storage (SQLite)
- rate limiting + retries/backoff
- optional ProxiesAPI integration at the fetch layer
By the end you’ll have a crawler you can extend into a site auditor, docs indexer, price monitor, or content discovery bot.
Once your crawler grows beyond a handful of pages, network failures and throttling become the bottleneck. ProxiesAPI helps keep fetches stable (rotation, retries, higher success rates) while your crawler logic stays clean.
Before you crawl: ethics + scope
A crawler can create real load. Set constraints up front:
- one domain only (at first)
- max pages per run
- delay between requests
- respect robots.txt
- identify yourself (User-Agent)
Also: don’t crawl pages behind logins or paywalls unless you have permission.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll also use the standard library:
sqlite3for storageurllib.parsefor URL handlingrobotparserfor robots.txt
Architecture (simple and extendable)
We’ll structure the crawler in four layers:
- Frontier (queue): which URLs to visit next
- Fetcher: HTTP requests with retries
- Parser: extract links and any data you care about
- Storage: keep visited state so you can resume
Step 1: URL normalization (stop duplicates)
URL normalization is what prevents your crawler from exploding:
- remove fragments (
#section) - drop common tracking params (
utm_*) - normalize scheme/hostname casing
- resolve relative links
from urllib.parse import urljoin, urlparse, urlunparse, parse_qsl, urlencode
TRACKING_KEYS_PREFIXES = ("utm_",)
TRACKING_KEYS = {"gclid", "fbclid"}
def normalize_url(base_url: str, href: str) -> str | None:
if not href:
return None
# Resolve relative URLs
abs_url = urljoin(base_url, href)
p = urlparse(abs_url)
if p.scheme not in ("http", "https"):
return None
# Strip fragments
fragmentless = p._replace(fragment="")
# Remove common tracking parameters
q = []
for k, v in parse_qsl(fragmentless.query, keep_blank_values=True):
lk = k.lower()
if lk in TRACKING_KEYS:
continue
if any(lk.startswith(pref) for pref in TRACKING_KEYS_PREFIXES):
continue
q.append((k, v))
cleaned = fragmentless._replace(query=urlencode(q, doseq=True))
# Normalize host casing
netloc = cleaned.netloc.lower()
cleaned = cleaned._replace(netloc=netloc)
return urlunparse(cleaned)
Step 2: Robots.txt (be polite by default)
Python includes a robots parser in the standard library.
import urllib.robotparser
def robots_parser_for(site_root: str, user_agent: str):
rp = urllib.robotparser.RobotFileParser()
rp.set_url(site_root.rstrip("/") + "/robots.txt")
try:
rp.read()
except Exception:
# If robots fails to load, you decide policy.
# Conservative approach: treat as allowed, but keep rate limits strict.
pass
return rp
Later we’ll call:
rp.can_fetch(USER_AGENT, url)
Step 3: Fetcher with retries + optional ProxiesAPI
This is the single place to integrate ProxiesAPI.
import os
import time
import random
import requests
TIMEOUT = (10, 30)
USER_AGENT = "ProxiesAPI-Guides-Crawler/1.0 (+https://proxiesapi.com)"
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")
session = requests.Session()
def fetch(url: str, *, use_proxiesapi: bool = False, max_retries: int = 4) -> str:
last_err = None
for attempt in range(1, max_retries + 1):
try:
if use_proxiesapi:
if not PROXIESAPI_KEY:
raise RuntimeError("Missing PROXIESAPI_KEY")
r = session.get(
"https://api.proxiesapi.com",
params={
"auth_key": PROXIESAPI_KEY,
"url": url,
},
headers={
"User-Agent": USER_AGENT,
"Accept": "text/html,application/xhtml+xml",
},
timeout=TIMEOUT,
)
else:
r = session.get(
url,
headers={
"User-Agent": USER_AGENT,
"Accept": "text/html,application/xhtml+xml",
},
timeout=TIMEOUT,
)
r.raise_for_status()
return r.text
except Exception as e:
last_err = e
time.sleep(min(20, (2 ** (attempt - 1)) + random.random()))
raise RuntimeError(f"fetch failed: {url}") from last_err
Step 4: Parse links from HTML
Keep the parser boring:
- extract
<a href> - normalize
- filter by domain scope
from bs4 import BeautifulSoup
def extract_links(page_url: str, html: str) -> list[str]:
soup = BeautifulSoup(html, "lxml")
links = []
for a in soup.select("a[href]"):
href = a.get("href")
n = normalize_url(page_url, href)
if n:
links.append(n)
return links
Step 5: Persisted queue with SQLite (resume safely)
A crawler without persistence is a one-off script. SQLite makes it resumable.
Schema:
urls(url PRIMARY KEY, status, depth, last_error, fetched_at)
Status values:
queuedfetchingdoneerror
import sqlite3
from datetime import datetime
def db_connect(path: str = "crawler.sqlite"):
conn = sqlite3.connect(path)
conn.execute("""
CREATE TABLE IF NOT EXISTS urls (
url TEXT PRIMARY KEY,
status TEXT NOT NULL,
depth INTEGER NOT NULL,
last_error TEXT,
fetched_at TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_urls_status ON urls(status)")
return conn
def enqueue(conn, url: str, depth: int):
conn.execute(
"INSERT OR IGNORE INTO urls(url,status,depth) VALUES(?,?,?)",
(url, "queued", depth),
)
def next_queued(conn):
row = conn.execute(
"SELECT url, depth FROM urls WHERE status='queued' ORDER BY depth ASC LIMIT 1"
).fetchone()
return row
def mark(conn, url: str, status: str, err: str | None = None):
conn.execute(
"UPDATE urls SET status=?, last_error=?, fetched_at=? WHERE url=?",
(status, err, datetime.utcnow().isoformat(), url),
)
Step 6: Rate limiting + crawl loop
We’ll implement politeness delay:
- a global delay between requests
- a per-domain delay can be added later
We’ll also limit:
max_pagesmax_depth
from urllib.parse import urlparse
def crawl(
start_url: str,
*,
max_pages: int = 200,
max_depth: int = 3,
delay_s: float = 1.0,
use_proxiesapi: bool = False,
):
conn = db_connect()
start = normalize_url(start_url, start_url)
if not start:
raise ValueError("Invalid start URL")
root = urlparse(start)
site_root = f"{root.scheme}://{root.netloc}"
rp = robots_parser_for(site_root, USER_AGENT)
enqueue(conn, start, depth=0)
conn.commit()
fetched = 0
while fetched < max_pages:
nxt = next_queued(conn)
if not nxt:
break
url, depth = nxt
if depth > max_depth:
mark(conn, url, "done", err="max_depth")
conn.commit()
continue
if not rp.can_fetch(USER_AGENT, url):
mark(conn, url, "done", err="robots_disallow")
conn.commit()
continue
mark(conn, url, "fetching")
conn.commit()
try:
html = fetch(url, use_proxiesapi=use_proxiesapi)
links = extract_links(url, html)
# scope: stay on same host
for link in links:
if urlparse(link).netloc != root.netloc:
continue
enqueue(conn, link, depth=depth + 1)
mark(conn, url, "done")
conn.commit()
fetched += 1
print("done", fetched, url, "new_links", len(links))
except Exception as e:
mark(conn, url, "error", err=str(e)[:500])
conn.commit()
time.sleep(delay_s)
print("crawl finished. fetched", fetched)
Run it:
if __name__ == "__main__":
crawl(
"https://example.com",
max_pages=100,
max_depth=2,
delay_s=1.5,
use_proxiesapi=False,
)
Comparison: crawler vs scraper (quick mental model)
- Crawler: discovers URLs (graph traversal)
- Scraper: extracts structured fields from known pages
Most real projects combine both:
- crawler discovers product/detail pages
- scraper extracts price/title/etc.
Common upgrades (what to do next)
- Per-host rate limits (token bucket)
- Content-type filtering (skip PDFs/images)
- URL allow/deny patterns (only
/docs/) - Sitemaps: ingest
sitemap.xmlbefore crawling - Incremental crawl: re-check only changed pages
- Storage: store HTML hashes + extracted data
Where ProxiesAPI helps (honestly)
For a single small domain, you may not need proxies at all.
But as soon as you crawl:
- multiple sites,
- higher request volume,
- or targets with strict throttling,
…the fetch layer becomes the failure point.
ProxiesAPI helps keep that layer consistent (retries, rotation, higher success rates), so your crawler can focus on correctness: URL logic, robots, and data quality.
Once your crawler grows beyond a handful of pages, network failures and throttling become the bottleneck. ProxiesAPI helps keep fetches stable (rotation, retries, higher success rates) while your crawler logic stays clean.