Python Web Crawler Tutorial: Build Your First Crawler (URLs, Robots, Rate Limits)

A web crawler is just a loop:

  1. take a URL from a queue
  2. fetch it
  3. extract links
  4. add new URLs back into the queue

But in the real world, everything around that loop matters:

  • URLs duplicate endlessly (tracking params, fragments, redirect loops)
  • sites publish robots.txt rules you should respect
  • rate limits and politeness keep you from getting blocked
  • transient network errors happen constantly

In this tutorial you’ll build a small-but-serious crawler in Python that covers the fundamentals:

  • URL normalization + canonicalization
  • domain scoping
  • robots.txt checks
  • a queue with persistent storage (SQLite)
  • rate limiting + retries/backoff
  • optional ProxiesAPI integration at the fetch layer

By the end you’ll have a crawler you can extend into a site auditor, docs indexer, price monitor, or content discovery bot.

Make your crawler resilient with ProxiesAPI

Once your crawler grows beyond a handful of pages, network failures and throttling become the bottleneck. ProxiesAPI helps keep fetches stable (rotation, retries, higher success rates) while your crawler logic stays clean.


Before you crawl: ethics + scope

A crawler can create real load. Set constraints up front:

  • one domain only (at first)
  • max pages per run
  • delay between requests
  • respect robots.txt
  • identify yourself (User-Agent)

Also: don’t crawl pages behind logins or paywalls unless you have permission.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll also use the standard library:

  • sqlite3 for storage
  • urllib.parse for URL handling
  • robotparser for robots.txt

Architecture (simple and extendable)

We’ll structure the crawler in four layers:

  1. Frontier (queue): which URLs to visit next
  2. Fetcher: HTTP requests with retries
  3. Parser: extract links and any data you care about
  4. Storage: keep visited state so you can resume

Step 1: URL normalization (stop duplicates)

URL normalization is what prevents your crawler from exploding:

  • remove fragments (#section)
  • drop common tracking params (utm_*)
  • normalize scheme/hostname casing
  • resolve relative links
from urllib.parse import urljoin, urlparse, urlunparse, parse_qsl, urlencode

TRACKING_KEYS_PREFIXES = ("utm_",)
TRACKING_KEYS = {"gclid", "fbclid"}


def normalize_url(base_url: str, href: str) -> str | None:
    if not href:
        return None

    # Resolve relative URLs
    abs_url = urljoin(base_url, href)
    p = urlparse(abs_url)

    if p.scheme not in ("http", "https"):
        return None

    # Strip fragments
    fragmentless = p._replace(fragment="")

    # Remove common tracking parameters
    q = []
    for k, v in parse_qsl(fragmentless.query, keep_blank_values=True):
        lk = k.lower()
        if lk in TRACKING_KEYS:
            continue
        if any(lk.startswith(pref) for pref in TRACKING_KEYS_PREFIXES):
            continue
        q.append((k, v))

    cleaned = fragmentless._replace(query=urlencode(q, doseq=True))

    # Normalize host casing
    netloc = cleaned.netloc.lower()
    cleaned = cleaned._replace(netloc=netloc)

    return urlunparse(cleaned)

Step 2: Robots.txt (be polite by default)

Python includes a robots parser in the standard library.

import urllib.robotparser


def robots_parser_for(site_root: str, user_agent: str):
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(site_root.rstrip("/") + "/robots.txt")
    try:
        rp.read()
    except Exception:
        # If robots fails to load, you decide policy.
        # Conservative approach: treat as allowed, but keep rate limits strict.
        pass
    return rp

Later we’ll call:

rp.can_fetch(USER_AGENT, url)

Step 3: Fetcher with retries + optional ProxiesAPI

This is the single place to integrate ProxiesAPI.

import os
import time
import random
import requests

TIMEOUT = (10, 30)
USER_AGENT = "ProxiesAPI-Guides-Crawler/1.0 (+https://proxiesapi.com)"

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")

session = requests.Session()


def fetch(url: str, *, use_proxiesapi: bool = False, max_retries: int = 4) -> str:
    last_err = None

    for attempt in range(1, max_retries + 1):
        try:
            if use_proxiesapi:
                if not PROXIESAPI_KEY:
                    raise RuntimeError("Missing PROXIESAPI_KEY")

                r = session.get(
                    "https://api.proxiesapi.com",
                    params={
                        "auth_key": PROXIESAPI_KEY,
                        "url": url,
                    },
                    headers={
                        "User-Agent": USER_AGENT,
                        "Accept": "text/html,application/xhtml+xml",
                    },
                    timeout=TIMEOUT,
                )
            else:
                r = session.get(
                    url,
                    headers={
                        "User-Agent": USER_AGENT,
                        "Accept": "text/html,application/xhtml+xml",
                    },
                    timeout=TIMEOUT,
                )

            r.raise_for_status()
            return r.text

        except Exception as e:
            last_err = e
            time.sleep(min(20, (2 ** (attempt - 1)) + random.random()))

    raise RuntimeError(f"fetch failed: {url}") from last_err

Keep the parser boring:

  • extract <a href>
  • normalize
  • filter by domain scope
from bs4 import BeautifulSoup


def extract_links(page_url: str, html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")

    links = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        n = normalize_url(page_url, href)
        if n:
            links.append(n)

    return links

Step 5: Persisted queue with SQLite (resume safely)

A crawler without persistence is a one-off script. SQLite makes it resumable.

Schema:

  • urls(url PRIMARY KEY, status, depth, last_error, fetched_at)

Status values:

  • queued
  • fetching
  • done
  • error
import sqlite3
from datetime import datetime


def db_connect(path: str = "crawler.sqlite"):
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS urls (
            url TEXT PRIMARY KEY,
            status TEXT NOT NULL,
            depth INTEGER NOT NULL,
            last_error TEXT,
            fetched_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_urls_status ON urls(status)")
    return conn


def enqueue(conn, url: str, depth: int):
    conn.execute(
        "INSERT OR IGNORE INTO urls(url,status,depth) VALUES(?,?,?)",
        (url, "queued", depth),
    )


def next_queued(conn):
    row = conn.execute(
        "SELECT url, depth FROM urls WHERE status='queued' ORDER BY depth ASC LIMIT 1"
    ).fetchone()
    return row


def mark(conn, url: str, status: str, err: str | None = None):
    conn.execute(
        "UPDATE urls SET status=?, last_error=?, fetched_at=? WHERE url=?",
        (status, err, datetime.utcnow().isoformat(), url),
    )

Step 6: Rate limiting + crawl loop

We’ll implement politeness delay:

  • a global delay between requests
  • a per-domain delay can be added later

We’ll also limit:

  • max_pages
  • max_depth
from urllib.parse import urlparse


def crawl(
    start_url: str,
    *,
    max_pages: int = 200,
    max_depth: int = 3,
    delay_s: float = 1.0,
    use_proxiesapi: bool = False,
):
    conn = db_connect()

    start = normalize_url(start_url, start_url)
    if not start:
        raise ValueError("Invalid start URL")

    root = urlparse(start)
    site_root = f"{root.scheme}://{root.netloc}"
    rp = robots_parser_for(site_root, USER_AGENT)

    enqueue(conn, start, depth=0)
    conn.commit()

    fetched = 0

    while fetched < max_pages:
        nxt = next_queued(conn)
        if not nxt:
            break

        url, depth = nxt

        if depth > max_depth:
            mark(conn, url, "done", err="max_depth")
            conn.commit()
            continue

        if not rp.can_fetch(USER_AGENT, url):
            mark(conn, url, "done", err="robots_disallow")
            conn.commit()
            continue

        mark(conn, url, "fetching")
        conn.commit()

        try:
            html = fetch(url, use_proxiesapi=use_proxiesapi)
            links = extract_links(url, html)

            # scope: stay on same host
            for link in links:
                if urlparse(link).netloc != root.netloc:
                    continue
                enqueue(conn, link, depth=depth + 1)

            mark(conn, url, "done")
            conn.commit()

            fetched += 1
            print("done", fetched, url, "new_links", len(links))

        except Exception as e:
            mark(conn, url, "error", err=str(e)[:500])
            conn.commit()

        time.sleep(delay_s)

    print("crawl finished. fetched", fetched)

Run it:

if __name__ == "__main__":
    crawl(
        "https://example.com",
        max_pages=100,
        max_depth=2,
        delay_s=1.5,
        use_proxiesapi=False,
    )

Comparison: crawler vs scraper (quick mental model)

  • Crawler: discovers URLs (graph traversal)
  • Scraper: extracts structured fields from known pages

Most real projects combine both:

  1. crawler discovers product/detail pages
  2. scraper extracts price/title/etc.

Common upgrades (what to do next)

  1. Per-host rate limits (token bucket)
  2. Content-type filtering (skip PDFs/images)
  3. URL allow/deny patterns (only /docs/)
  4. Sitemaps: ingest sitemap.xml before crawling
  5. Incremental crawl: re-check only changed pages
  6. Storage: store HTML hashes + extracted data

Where ProxiesAPI helps (honestly)

For a single small domain, you may not need proxies at all.

But as soon as you crawl:

  • multiple sites,
  • higher request volume,
  • or targets with strict throttling,

…the fetch layer becomes the failure point.

ProxiesAPI helps keep that layer consistent (retries, rotation, higher success rates), so your crawler can focus on correctness: URL logic, robots, and data quality.

Make your crawler resilient with ProxiesAPI

Once your crawler grows beyond a handful of pages, network failures and throttling become the bottleneck. ProxiesAPI helps keep fetches stable (rotation, retries, higher success rates) while your crawler logic stays clean.

Related guides

Async Web Scraping in Python: asyncio + aiohttp (Concurrency Without Getting Banned)
Learn production-grade async scraping in Python with asyncio + aiohttp: bounded concurrency, per-host limits, retry/backoff, timeouts, and proxy rotation patterns. Includes a complete working crawler template.
guide#python#asyncio#aiohttp
Web Scraping Tools (2026): The Buyer’s Guide — What to Use and When
A practical guide to choosing web scraping tools in 2026: browser automation vs frameworks vs no-code extractors vs hosted scraping APIs — plus cost, reliability, and when proxies matter.
guide#web scraping tools#web-scraping#python
eBay Price Tracker: How to Monitor Prices Automatically
End-to-end tracker blueprint: URLs → scrape → normalize → alerting, with practical rate limiting + proxies.
guide#ebay#price-tracking#python
Web Scraping Dynamic Content: How to Handle JavaScript-Rendered Pages
Decision tree for JS sites: XHR capture, HTML endpoints, or headless—plus when proxies matter.
guide#web-scraping#javascript#dynamic-content