Retries, Timeouts, and Backoff for Web Scraping (Python): Production Defaults That Work

Most scraping tutorials die the moment you scale:

  • request #937 times out
  • half your crawl silently returns “empty” HTML
  • you retry too aggressively and start getting 429s

Parsing isn’t the hard part.

Networking behavior is.

In this post, we’ll set up production-grade defaults for Python scraping:

  • timeouts you can defend
  • a retry policy that won’t melt the target
  • exponential backoff + jitter
  • a “soft block” detector to catch fake-success responses
Stop losing crawls to flaky networking

If you’re already doing retries and still seeing random failures, it’s usually proxy/network variability. ProxiesAPI is built to make large crawls boring and predictable.


1) The timeout mistake everyone makes

Most code does this:

requests.get(url)

Which means:

  • you have no timeout (can hang forever)
  • you have no separation between connect/read

A sane starting point

If you’re scraping the normal web, these defaults work surprisingly well:

  • connect timeout: 5–10s
  • read timeout: 20–40s
  • total per request: ≤ 60s

In requests, you can’t set a total timeout directly, but you can set connect+read:

TIMEOUT = (10, 30)  # (connect, read)

r = requests.get(url, timeout=TIMEOUT)

2) Retry policy: what to retry vs what to fail fast

Retry these

  • 429 (rate-limited)
  • 408 (request timeout)
  • 5xx (server-side issues)
  • transient connection errors / DNS hiccups

Don’t retry these (usually)

  • 404 (not found)
  • 401/403 (auth/blocked — retries usually waste time)

Exception: if 403 is intermittent and you have evidence it’s a transient edge block, retry once with a longer delay.


3) Exponential backoff + jitter (drop-in)

This is the single highest-leverage upgrade for scraper stability.

import random
import time

def backoff_seconds(attempt: int, base: float = 0.8, cap: float = 30.0) -> float:
    """Exponential backoff with jitter.

    attempt: 1,2,3...
    """
    exp = min(cap, base * (2 ** (attempt - 1)))
    jitter = random.uniform(0, exp * 0.2)
    return exp + jitter

for attempt in range(1, 6):
    delay = backoff_seconds(attempt)
    print(attempt, round(delay, 2))

Example output (varies):

1 0.88
2 1.72
3 3.54
4 7.12
5 14.44

4) Soft-block detection (catch fake success)

A lot of “blocks” are HTTP 200 but the HTML is useless.

Common signatures:

  • “enable javascript”
  • “access denied”
  • suspiciously tiny HTML

Here’s a pragmatic detector:

import re

SOFT_BLOCK_PATTERNS = [
    r"enable javascript",
    r"access denied",
    r"unusual traffic",
    r"verify you are human",
]

def looks_soft_blocked(html: str) -> bool:
    if not html:
        return True

    # tiny pages are rarely real content pages
    if len(html) < 2000:
        return True

    low = html.lower()
    for p in SOFT_BLOCK_PATTERNS:
        if re.search(p, low):
            return True

    return False

This doesn’t “solve” blocks.

It prevents the worse failure mode: quietly accepting garbage as success.


5) A real, production-style fetch() with retries

import requests

RETRY_STATUSES = {408, 429, 500, 502, 503, 504}

session = requests.Session()

def fetch(url: str, *, max_attempts: int = 5) -> str:
    last_err = None

    for attempt in range(1, max_attempts + 1):
        try:
            r = session.get(
                url,
                timeout=(10, 30),
                headers={"User-Agent": "Mozilla/5.0"},
            )

            # status handling
            if r.status_code in RETRY_STATUSES:
                delay = backoff_seconds(attempt)
                print(f"retryable status {r.status_code} attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            r.raise_for_status()

            html = r.text
            if looks_soft_blocked(html):
                delay = backoff_seconds(attempt)
                print(f"soft-block suspected attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            return html

        except requests.RequestException as e:
            last_err = e
            delay = backoff_seconds(attempt)
            print(f"request error attempt={attempt} sleep={delay:.2f}s err={e}")
            time.sleep(delay)

    raise RuntimeError(f"Failed after {max_attempts} attempts: {last_err}")

Terminal simulation (typical)

retryable status 429 attempt=1 sleep=0.92s
soft-block suspected attempt=2 sleep=1.78s
request error attempt=3 sleep=3.39s err=HTTPSConnectionPool(...): Read timed out

Where ProxiesAPI fits (without overpromising)

A proxy API helps when your crawl starts failing due to:

  • inconsistent IP reputation
  • unreliable proxy pools
  • request variability at scale

ProxiesAPI’s job is not “magic bypass”.

It’s making large crawls predictable.

Minimal integration sketch

# Pseudocode: adapt to ProxiesAPI’s exact proxy endpoint format.
PROXY = "http://USER:PASS@proxy.proxiesapi.com:PORT"

def fetch_via_proxy(url: str) -> str:
    r = session.get(
        url,
        timeout=(10, 30),
        proxies={"http": PROXY, "https": PROXY},
        headers={"User-Agent": "Mozilla/5.0"},
    )
    r.raise_for_status()
    return r.text

The QA checklist I use

  • Every request has connect+read timeouts
  • Retries are bounded (max_attempts)
  • Backoff includes jitter
  • 404 is not retried
  • 429/5xx are retried
  • Soft-blocks are detected and treated as failure
  • Logs include attempt count + delay reason

Next step

If you want, I’ll turn this into a small scrape_utils.py you can reuse across all scraping tutorials.

Stop losing crawls to flaky networking

If you’re already doing retries and still seeing random failures, it’s usually proxy/network variability. ProxiesAPI is built to make large crawls boring and predictable.

Related guides

Soft-Block Detection for Web Scraping (Python): Catch ‘HTTP 200 but Wrong Page’
Most scrapers fail silently: the request succeeds but the HTML is a block/consent/login page. Here’s how to detect soft-blocks before parsing.
engineering#python#web-scraping#retries
How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)
A real-world IMDb scraping tutorial covering browser-rendered HTML, verified selectors, sample output, and why naive requests can fail.
scraping-tutorials#python#beautifulsoup#web-scraping
How to Scrape MDN Docs Pages with Python
Extract headings and table-of-contents structure from MDN docs pages with Python and BeautifulSoup.
tutorial#python#mdn#web-scraping
How to Scrape the Python Docs Module Index with Python
Build a searchable dataset from the Python docs module index using Python and BeautifulSoup.
tutorial#python#docs#web-scraping