Retries, Timeouts, and Backoff for Web Scraping (Python): Production Defaults That Work
Most scraping tutorials die the moment you scale:
- request #937 times out
- half your crawl silently returns “empty” HTML
- you retry too aggressively and start getting 429s
Parsing isn’t the hard part.
Networking behavior is.
In this post, we’ll set up production-grade defaults for Python scraping:
- timeouts you can defend
- a retry policy that won’t melt the target
- exponential backoff + jitter
- a “soft block” detector to catch fake-success responses
If you’re already doing retries and still seeing random failures, it’s usually proxy/network variability. ProxiesAPI is built to make large crawls boring and predictable.
1) The timeout mistake everyone makes
Most code does this:
requests.get(url)
Which means:
- you have no timeout (can hang forever)
- you have no separation between connect/read
A sane starting point
If you’re scraping the normal web, these defaults work surprisingly well:
- connect timeout: 5–10s
- read timeout: 20–40s
- total per request: ≤ 60s
In requests, you can’t set a total timeout directly, but you can set connect+read:
TIMEOUT = (10, 30) # (connect, read)
r = requests.get(url, timeout=TIMEOUT)
2) Retry policy: what to retry vs what to fail fast
Retry these
- 429 (rate-limited)
- 408 (request timeout)
- 5xx (server-side issues)
- transient connection errors / DNS hiccups
Don’t retry these (usually)
- 404 (not found)
- 401/403 (auth/blocked — retries usually waste time)
Exception: if 403 is intermittent and you have evidence it’s a transient edge block, retry once with a longer delay.
3) Exponential backoff + jitter (drop-in)
This is the single highest-leverage upgrade for scraper stability.
import random
import time
def backoff_seconds(attempt: int, base: float = 0.8, cap: float = 30.0) -> float:
"""Exponential backoff with jitter.
attempt: 1,2,3...
"""
exp = min(cap, base * (2 ** (attempt - 1)))
jitter = random.uniform(0, exp * 0.2)
return exp + jitter
for attempt in range(1, 6):
delay = backoff_seconds(attempt)
print(attempt, round(delay, 2))
Example output (varies):
1 0.88
2 1.72
3 3.54
4 7.12
5 14.44
4) Soft-block detection (catch fake success)
A lot of “blocks” are HTTP 200 but the HTML is useless.
Common signatures:
- “enable javascript”
- “access denied”
- suspiciously tiny HTML
Here’s a pragmatic detector:
import re
SOFT_BLOCK_PATTERNS = [
r"enable javascript",
r"access denied",
r"unusual traffic",
r"verify you are human",
]
def looks_soft_blocked(html: str) -> bool:
if not html:
return True
# tiny pages are rarely real content pages
if len(html) < 2000:
return True
low = html.lower()
for p in SOFT_BLOCK_PATTERNS:
if re.search(p, low):
return True
return False
This doesn’t “solve” blocks.
It prevents the worse failure mode: quietly accepting garbage as success.
5) A real, production-style fetch() with retries
import requests
RETRY_STATUSES = {408, 429, 500, 502, 503, 504}
session = requests.Session()
def fetch(url: str, *, max_attempts: int = 5) -> str:
last_err = None
for attempt in range(1, max_attempts + 1):
try:
r = session.get(
url,
timeout=(10, 30),
headers={"User-Agent": "Mozilla/5.0"},
)
# status handling
if r.status_code in RETRY_STATUSES:
delay = backoff_seconds(attempt)
print(f"retryable status {r.status_code} attempt={attempt} sleep={delay:.2f}s")
time.sleep(delay)
continue
r.raise_for_status()
html = r.text
if looks_soft_blocked(html):
delay = backoff_seconds(attempt)
print(f"soft-block suspected attempt={attempt} sleep={delay:.2f}s")
time.sleep(delay)
continue
return html
except requests.RequestException as e:
last_err = e
delay = backoff_seconds(attempt)
print(f"request error attempt={attempt} sleep={delay:.2f}s err={e}")
time.sleep(delay)
raise RuntimeError(f"Failed after {max_attempts} attempts: {last_err}")
Terminal simulation (typical)
retryable status 429 attempt=1 sleep=0.92s
soft-block suspected attempt=2 sleep=1.78s
request error attempt=3 sleep=3.39s err=HTTPSConnectionPool(...): Read timed out
Where ProxiesAPI fits (without overpromising)
A proxy API helps when your crawl starts failing due to:
- inconsistent IP reputation
- unreliable proxy pools
- request variability at scale
ProxiesAPI’s job is not “magic bypass”.
It’s making large crawls predictable.
Minimal integration sketch
# Pseudocode: adapt to ProxiesAPI’s exact proxy endpoint format.
PROXY = "http://USER:PASS@proxy.proxiesapi.com:PORT"
def fetch_via_proxy(url: str) -> str:
r = session.get(
url,
timeout=(10, 30),
proxies={"http": PROXY, "https": PROXY},
headers={"User-Agent": "Mozilla/5.0"},
)
r.raise_for_status()
return r.text
The QA checklist I use
- Every request has connect+read timeouts
- Retries are bounded (max_attempts)
- Backoff includes jitter
- 404 is not retried
- 429/5xx are retried
- Soft-blocks are detected and treated as failure
- Logs include attempt count + delay reason
Next step
If you want, I’ll turn this into a small scrape_utils.py you can reuse across all scraping tutorials.
If you’re already doing retries and still seeing random failures, it’s usually proxy/network variability. ProxiesAPI is built to make large crawls boring and predictable.