Python Web Crawler Tutorial: Build Your First Crawler (URLs, Robots, Rate Limits)

Apr 24, 2026 · guide · #python, #web-crawling, #robots, #rate-limits, #scraping, #proxies

A web crawler is just a loop:

take a URL from a queue
fetch it
extract links
add new URLs back into the queue

But in the real world, everything around that loop matters:

URLs duplicate endlessly (tracking params, fragments, redirect loops)
sites publish robots.txt rules you should respect
rate limits and politeness keep you from getting blocked
transient network errors happen constantly

In this tutorial you’ll build a small-but-serious crawler in Python that covers the fundamentals:

URL normalization + canonicalization
domain scoping
robots.txt checks
a queue with persistent storage (SQLite)
rate limiting + retries/backoff
optional ProxiesAPI integration at the fetch layer

By the end you’ll have a crawler you can extend into a site auditor, docs indexer, price monitor, or content discovery bot.

Make your crawler resilient with ProxiesAPI

Once your crawler grows beyond a handful of pages, network failures and throttling become the bottleneck. ProxiesAPI helps keep fetches stable (rotation, retries, higher success rates) while your crawler logic stays clean.

Get 1,000 free API calls View pricing

Before you crawl: ethics + scope

A crawler can create real load. Set constraints up front:

one domain only (at first)
max pages per run
delay between requests
respect robots.txt
identify yourself (User-Agent)

Also: don’t crawl pages behind logins or paywalls unless you have permission.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll also use the standard library:

sqlite3 for storage
urllib.parse for URL handling
robotparser for robots.txt

Architecture (simple and extendable)

We’ll structure the crawler in four layers:

Frontier (queue): which URLs to visit next
Fetcher: HTTP requests with retries
Parser: extract links and any data you care about
Storage: keep visited state so you can resume

Step 1: URL normalization (stop duplicates)

URL normalization is what prevents your crawler from exploding:

remove fragments (#section)
drop common tracking params (utm_*)
normalize scheme/hostname casing
resolve relative links

from urllib.parse import urljoin, urlparse, urlunparse, parse_qsl, urlencode

TRACKING_KEYS_PREFIXES = ("utm_",)
TRACKING_KEYS = {"gclid", "fbclid"}


def normalize_url(base_url: str, href: str) -> str | None:
    if not href:
        return None

    # Resolve relative URLs
    abs_url = urljoin(base_url, href)
    p = urlparse(abs_url)

    if p.scheme not in ("http", "https"):
        return None

    # Strip fragments
    fragmentless = p._replace(fragment="")

    # Remove common tracking parameters
    q = []
    for k, v in parse_qsl(fragmentless.query, keep_blank_values=True):
        lk = k.lower()
        if lk in TRACKING_KEYS:
            continue
        if any(lk.startswith(pref) for pref in TRACKING_KEYS_PREFIXES):
            continue
        q.append((k, v))

    cleaned = fragmentless._replace(query=urlencode(q, doseq=True))

    # Normalize host casing
    netloc = cleaned.netloc.lower()
    cleaned = cleaned._replace(netloc=netloc)

    return urlunparse(cleaned)

Step 2: Robots.txt (be polite by default)

Python includes a robots parser in the standard library.

import urllib.robotparser


def robots_parser_for(site_root: str, user_agent: str):
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(site_root.rstrip("/") + "/robots.txt")
    try:
        rp.read()
    except Exception:
        # If robots fails to load, you decide policy.
        # Conservative approach: treat as allowed, but keep rate limits strict.
        pass
    return rp

Later we’ll call:

rp.can_fetch(USER_AGENT, url)

Step 3: Fetcher with retries + optional ProxiesAPI

This is the single place to integrate ProxiesAPI.

import os
import time
import random
import requests

TIMEOUT = (10, 30)
USER_AGENT = "ProxiesAPI-Guides-Crawler/1.0 (+https://proxiesapi.com)"

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")

session = requests.Session()


def fetch(url: str, *, use_proxiesapi: bool = False, max_retries: int = 4) -> str:
    last_err = None

    for attempt in range(1, max_retries + 1):
        try:
            if use_proxiesapi:
                if not PROXIESAPI_KEY:
                    raise RuntimeError("Missing PROXIESAPI_KEY")

                r = session.get(
                    "https://api.proxiesapi.com",
                    params={
                        "auth_key": PROXIESAPI_KEY,
                        "url": url,
                    },
                    headers={
                        "User-Agent": USER_AGENT,
                        "Accept": "text/html,application/xhtml+xml",
                    },
                    timeout=TIMEOUT,
                )
            else:
                r = session.get(
                    url,
                    headers={
                        "User-Agent": USER_AGENT,
                        "Accept": "text/html,application/xhtml+xml",
                    },
                    timeout=TIMEOUT,
                )

            r.raise_for_status()
            return r.text

        except Exception as e:
            last_err = e
            time.sleep(min(20, (2 ** (attempt - 1)) + random.random()))

    raise RuntimeError(f"fetch failed: {url}") from last_err

Step 4: Parse links from HTML

Keep the parser boring:

extract <a href>
normalize
filter by domain scope

from bs4 import BeautifulSoup


def extract_links(page_url: str, html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")

    links = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        n = normalize_url(page_url, href)
        if n:
            links.append(n)

    return links

Step 5: Persisted queue with SQLite (resume safely)

A crawler without persistence is a one-off script. SQLite makes it resumable.

Schema:

urls(url PRIMARY KEY, status, depth, last_error, fetched_at)

Status values:

queued
fetching
done
error

import sqlite3
from datetime import datetime


def db_connect(path: str = "crawler.sqlite"):
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS urls (
            url TEXT PRIMARY KEY,
            status TEXT NOT NULL,
            depth INTEGER NOT NULL,
            last_error TEXT,
            fetched_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_urls_status ON urls(status)")
    return conn


def enqueue(conn, url: str, depth: int):
    conn.execute(
        "INSERT OR IGNORE INTO urls(url,status,depth) VALUES(?,?,?)",
        (url, "queued", depth),
    )


def next_queued(conn):
    row = conn.execute(
        "SELECT url, depth FROM urls WHERE status='queued' ORDER BY depth ASC LIMIT 1"
    ).fetchone()
    return row


def mark(conn, url: str, status: str, err: str | None = None):
    conn.execute(
        "UPDATE urls SET status=?, last_error=?, fetched_at=? WHERE url=?",
        (status, err, datetime.utcnow().isoformat(), url),
    )

Step 6: Rate limiting + crawl loop

We’ll implement politeness delay:

a global delay between requests
a per-domain delay can be added later

We’ll also limit:

max_pages
max_depth

from urllib.parse import urlparse


def crawl(
    start_url: str,
    *,
    max_pages: int = 200,
    max_depth: int = 3,
    delay_s: float = 1.0,
    use_proxiesapi: bool = False,
):
    conn = db_connect()

    start = normalize_url(start_url, start_url)
    if not start:
        raise ValueError("Invalid start URL")

    root = urlparse(start)
    site_root = f"{root.scheme}://{root.netloc}"
    rp = robots_parser_for(site_root, USER_AGENT)

    enqueue(conn, start, depth=0)
    conn.commit()

    fetched = 0

    while fetched < max_pages:
        nxt = next_queued(conn)
        if not nxt:
            break

        url, depth = nxt

        if depth > max_depth:
            mark(conn, url, "done", err="max_depth")
            conn.commit()
            continue

        if not rp.can_fetch(USER_AGENT, url):
            mark(conn, url, "done", err="robots_disallow")
            conn.commit()
            continue

        mark(conn, url, "fetching")
        conn.commit()

        try:
            html = fetch(url, use_proxiesapi=use_proxiesapi)
            links = extract_links(url, html)

            # scope: stay on same host
            for link in links:
                if urlparse(link).netloc != root.netloc:
                    continue
                enqueue(conn, link, depth=depth + 1)

            mark(conn, url, "done")
            conn.commit()

            fetched += 1
            print("done", fetched, url, "new_links", len(links))

        except Exception as e:
            mark(conn, url, "error", err=str(e)[:500])
            conn.commit()

        time.sleep(delay_s)

    print("crawl finished. fetched", fetched)

Run it:

if __name__ == "__main__":
    crawl(
        "https://example.com",
        max_pages=100,
        max_depth=2,
        delay_s=1.5,
        use_proxiesapi=False,
    )

Comparison: crawler vs scraper (quick mental model)

Crawler: discovers URLs (graph traversal)
Scraper: extracts structured fields from known pages

Most real projects combine both:

crawler discovers product/detail pages
scraper extracts price/title/etc.

Common upgrades (what to do next)

Per-host rate limits (token bucket)
Content-type filtering (skip PDFs/images)
URL allow/deny patterns (only /docs/)
Sitemaps: ingest sitemap.xml before crawling
Incremental crawl: re-check only changed pages
Storage: store HTML hashes + extracted data

Where ProxiesAPI helps (honestly)

For a single small domain, you may not need proxies at all.

But as soon as you crawl:

multiple sites,
higher request volume,
or targets with strict throttling,

…the fetch layer becomes the failure point.

ProxiesAPI helps keep that layer consistent (retries, rotation, higher success rates), so your crawler can focus on correctness: URL logic, robots, and data quality.

Make your crawler resilient with ProxiesAPI

Get 1,000 free API calls View pricing

A practical asyncio + aiohttp guide for web scraping: bounded concurrency, semaphores, retries with backoff, timeouts, per-host limits, and batch exporting. Includes a complete working template.

guide#python#asyncio#aiohttp

Async Web Scraping in Python: asyncio + aiohttp (Concurrency Without Getting Banned)

Learn production-grade async scraping in Python with asyncio + aiohttp: bounded concurrency, per-host limits, retry/backoff, timeouts, and proxy rotation patterns. Includes a complete working crawler template.

guide#python#asyncio#aiohttp

Rotating Proxies: What They Are, How Rotation Works, and When You Need Them

A practical, non-hype guide to rotating proxies: request vs session rotation, sticky IPs, block signals, and how to wire rotation into a scraper (including ProxiesAPI-ready examples).

guides#rotating proxies#proxies#web-scraping

Error Code 520 When Scraping: What It Means and a Practical Fix Checklist

Cloudflare 520 errors are vague by design. This guide explains what a 520 actually means, the most common scraping causes, and a step-by-step debugging flow with resilient retry and proxy patterns.

guide#error code 520#cloudflare#web-scraping

Python Web Crawler Tutorial: Build Your First Crawler (URLs, Robots, Rate Limits)

Related guides