Retries, Timeouts, and Backoff for Web Scraping (Python): Production Defaults That Work

Mar 08, 2026 · engineering · #python, #web-scraping, #retries, #timeouts, #backoff, #requests

Most scraping tutorials die the moment you scale:

request #937 times out
half your crawl silently returns “empty” HTML
you retry too aggressively and start getting 429s

Parsing isn’t the hard part.

Networking behavior is.

In this post, we’ll set up production-grade defaults for Python scraping:

timeouts you can defend
a retry policy that won’t melt the target
exponential backoff + jitter
a “soft block” detector to catch fake-success responses

Stop losing crawls to flaky networking

If you’re already doing retries and still seeing random failures, it’s usually proxy/network variability. ProxiesAPI is built to make large crawls boring and predictable.

Get 1,000 free API calls View pricing

1) The timeout mistake everyone makes

Most code does this:

requests.get(url)

Which means:

you have no timeout (can hang forever)
you have no separation between connect/read

A sane starting point

If you’re scraping the normal web, these defaults work surprisingly well:

connect timeout: 5–10s
read timeout: 20–40s
total per request: ≤ 60s

In requests, you can’t set a total timeout directly, but you can set connect+read:

TIMEOUT = (10, 30)  # (connect, read)

r = requests.get(url, timeout=TIMEOUT)

2) Retry policy: what to retry vs what to fail fast

Retry these

429 (rate-limited)
408 (request timeout)
5xx (server-side issues)
transient connection errors / DNS hiccups

Don’t retry these (usually)

404 (not found)
401/403 (auth/blocked — retries usually waste time)

Exception: if 403 is intermittent and you have evidence it’s a transient edge block, retry once with a longer delay.

3) Exponential backoff + jitter (drop-in)

This is the single highest-leverage upgrade for scraper stability.

import random
import time

def backoff_seconds(attempt: int, base: float = 0.8, cap: float = 30.0) -> float:
    """Exponential backoff with jitter.

    attempt: 1,2,3...
    """
    exp = min(cap, base * (2 ** (attempt - 1)))
    jitter = random.uniform(0, exp * 0.2)
    return exp + jitter

for attempt in range(1, 6):
    delay = backoff_seconds(attempt)
    print(attempt, round(delay, 2))

Example output (varies):

4) Soft-block detection (catch fake success)

A lot of “blocks” are HTTP 200 but the HTML is useless.

Common signatures:

“enable javascript”
“access denied”
suspiciously tiny HTML

Here’s a pragmatic detector:

import re

SOFT_BLOCK_PATTERNS = [
    r"enable javascript",
    r"access denied",
    r"unusual traffic",
    r"verify you are human",
]

def looks_soft_blocked(html: str) -> bool:
    if not html:
        return True

    # tiny pages are rarely real content pages
    if len(html) < 2000:
        return True

    low = html.lower()
    for p in SOFT_BLOCK_PATTERNS:
        if re.search(p, low):
            return True

    return False

This doesn’t “solve” blocks.

It prevents the worse failure mode: quietly accepting garbage as success.

5) A real, production-style fetch() with retries

import requests

RETRY_STATUSES = {408, 429, 500, 502, 503, 504}

session = requests.Session()

def fetch(url: str, *, max_attempts: int = 5) -> str:
    last_err = None

    for attempt in range(1, max_attempts + 1):
        try:
            r = session.get(
                url,
                timeout=(10, 30),
                headers={"User-Agent": "Mozilla/5.0"},
            )

            # status handling
            if r.status_code in RETRY_STATUSES:
                delay = backoff_seconds(attempt)
                print(f"retryable status {r.status_code} attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            r.raise_for_status()

            html = r.text
            if looks_soft_blocked(html):
                delay = backoff_seconds(attempt)
                print(f"soft-block suspected attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            return html

        except requests.RequestException as e:
            last_err = e
            delay = backoff_seconds(attempt)
            print(f"request error attempt={attempt} sleep={delay:.2f}s err={e}")
            time.sleep(delay)

    raise RuntimeError(f"Failed after {max_attempts} attempts: {last_err}")

Terminal simulation (typical)

retryable status 429 attempt=1 sleep=0.92s
soft-block suspected attempt=2 sleep=1.78s
request error attempt=3 sleep=3.39s err=HTTPSConnectionPool(...): Read timed out

Where ProxiesAPI fits (without overpromising)

A proxy API helps when your crawl starts failing due to:

inconsistent IP reputation
unreliable proxy pools
request variability at scale

ProxiesAPI’s job is not “magic bypass”.

It’s making large crawls predictable.

Minimal integration sketch

# Pseudocode: adapt to ProxiesAPI’s exact proxy endpoint format.
PROXY = "http://USER:PASS@proxy.proxiesapi.com:PORT"

def fetch_via_proxy(url: str) -> str:
    r = session.get(
        url,
        timeout=(10, 30),
        proxies={"http": PROXY, "https": PROXY},
        headers={"User-Agent": "Mozilla/5.0"},
    )
    r.raise_for_status()
    return r.text

The QA checklist I use

Every request has connect+read timeouts
Retries are bounded (max_attempts)
Backoff includes jitter
404 is not retried
429/5xx are retried
Soft-blocks are detected and treated as failure
Logs include attempt count + delay reason

Next step

If you want, I’ll turn this into a small scrape_utils.py you can reuse across all scraping tutorials.

Stop losing crawls to flaky networking

If you’re already doing retries and still seeing random failures, it’s usually proxy/network variability. ProxiesAPI is built to make large crawls boring and predictable.

Get 1,000 free API calls View pricing

Learn a production-safe retry strategy with status-code rules, backoff, and a Python helper you can drop into any scraper.

engineering#python#web-scraping#retries

Web Scraping with Python Requests: Proxies, Retries, and Timeouts (2026)

Make Python Requests reliable for scraping: proxy configuration, timeouts, retries with backoff, common failure modes, and when to use ProxiesAPI for a stable fetch layer.

guide#python#requests#proxy

Python Requests with Proxy: Setup and Rotation Guide

A practical guide to using proxies with Python Requests: basic config, authenticated proxies, session rotation, retries, timeouts, and a simpler ProxiesAPI fetch pattern.

guide#python#requests#proxy

Soft-Block Detection for Web Scraping (Python): Catch ‘HTTP 200 but Wrong Page’

Most scrapers fail silently: the request succeeds but the HTML is a block/consent/login page. Here’s how to detect soft-blocks before parsing.

engineering#python#web-scraping#retries

Retries, Timeouts, and Backoff for Web Scraping (Python): Production Defaults That Work

Related guides