Retry Policies for Web Scrapers: What to Retry vs Fail Fast
When a scraper fails, the instinct is usually wrong.
People either:
- retry everything forever
- fail on the first timeout
- or silently accept empty HTML as success
That is how you end up with bad datasets, fake ranking drops, and jobs that look green in the dashboard while your outputs are garbage.
A good retry policy is not about “trying harder.” It is about being selective.
In this guide, we’ll build a retry policy for Python scrapers that answers the only question that matters:
What should you retry, and what should fail fast?
Once your retry logic is sane, the next bottleneck is network consistency. ProxiesAPI gives you a simple fetch endpoint you can plug into the same retry wrapper without rebuilding your whole scraper.
The core idea
Not every error means the same thing.
A scraper failure usually falls into one of four buckets:
- Transient network failure — DNS error, connection reset, read timeout
- Temporary upstream failure — 502, 503, 504, occasional 429
- Permanent response — 404, 410, malformed URL, bad auth
- Soft block / fake success — HTTP 200 but the HTML is useless
Your policy should treat each bucket differently.
If you retry permanent failures, you waste time and hammer the target.
If you do not retry transient failures, you create false negatives.
If you accept soft-blocked HTML as success, you poison your own data.
What to retry vs fail fast
Here is the practical matrix I recommend for most scrapers.
| Condition | Default action | Why |
|---|---|---|
| Connection error / timeout | Retry | Often transient |
| HTTP 408 | Retry | Request timeout usually recovers |
| HTTP 429 | Retry with longer delay | You were rate limited |
| HTTP 500 / 502 / 503 / 504 | Retry | Upstream instability |
| HTTP 404 | Fail fast | Usually permanent |
| HTTP 410 | Fail fast | Explicitly gone |
| HTTP 401 / 403 | Usually fail fast | Often auth or block issue |
| HTTP 200 with block page | Retry a limited number of times | It is not real content |
That “usually” on 403 matters.
Sometimes a 403 is transient on a site sitting behind a flaky edge rule. But you should only retry it if you have evidence that it occasionally succeeds on the same workflow. Otherwise, repeated 403 retries are just noise.
Start with explicit timeouts
A retry policy without timeouts is fake reliability.
If you do this:
requests.get(url)
that request can hang forever.
Use explicit connect and read timeouts instead:
TIMEOUT = (10, 30) # connect timeout, read timeout
That means:
- if the connection cannot start within 10 seconds, bail
- if the server stops sending useful data for 30 seconds, bail
Those are sane defaults for most scraping jobs.
A reusable retry helper in Python
This example uses requests, exponential backoff, and a soft-block detector. It is designed to be dropped into a normal scraper without extra dependencies.
import random
import re
import time
import requests
from requests import Response
from requests.exceptions import RequestException, Timeout, ConnectionError
TIMEOUT = (10, 30)
RETRY_STATUSES = {408, 429, 500, 502, 503, 504}
FAIL_FAST_STATUSES = {400, 401, 403, 404, 410, 422}
SOFT_BLOCK_PATTERNS = [
r"enable javascript",
r"access denied",
r"verify you are human",
r"unusual traffic",
r"temporarily unavailable",
]
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0"
})
def backoff_seconds(attempt: int, base: float = 1.0, cap: float = 30.0) -> float:
exp = min(cap, base * (2 ** (attempt - 1)))
jitter = random.uniform(0, exp * 0.25)
return exp + jitter
def looks_soft_blocked(html: str) -> bool:
if not html or len(html.strip()) < 500:
return True
lowered = html.lower()
for pattern in SOFT_BLOCK_PATTERNS:
if re.search(pattern, lowered):
return True
return False
def fetch_html(url: str, max_attempts: int = 5) -> str:
last_error = None
for attempt in range(1, max_attempts + 1):
try:
response: Response = session.get(url, timeout=TIMEOUT)
status = response.status_code
if status in FAIL_FAST_STATUSES:
raise RuntimeError(f"fail-fast status {status} for {url}")
if status in RETRY_STATUSES:
delay = backoff_seconds(attempt)
print(f"retryable status={status} attempt={attempt} sleep={delay:.2f}s")
time.sleep(delay)
continue
response.raise_for_status()
html = response.text
if looks_soft_blocked(html):
delay = backoff_seconds(attempt)
print(f"soft-block suspected attempt={attempt} sleep={delay:.2f}s")
time.sleep(delay)
continue
return html
except (Timeout, ConnectionError) as exc:
last_error = exc
delay = backoff_seconds(attempt)
print(f"network error attempt={attempt} sleep={delay:.2f}s err={exc}")
time.sleep(delay)
except RequestException as exc:
last_error = exc
delay = backoff_seconds(attempt)
print(f"request error attempt={attempt} sleep={delay:.2f}s err={exc}")
time.sleep(delay)
raise RuntimeError(f"failed after {max_attempts} attempts: {last_error}")
Example terminal output
Here is the kind of output you actually want during a flaky crawl:
retryable status=429 attempt=1 sleep=1.13s
retryable status=502 attempt=2 sleep=2.32s
soft-block suspected attempt=3 sleep=4.79s
That output is useful because it tells you:
- what failed
- which attempt you are on
- how long the scraper is pausing
A silent retry loop is dangerous. If you do not log retries, you cannot distinguish “slow but healthy” from “quietly broken.”
Why 404 should usually fail fast
This is one of the most common mistakes in scraper codebases.
People write broad retry wrappers that treat every non-200 as retryable.
That is wrong.
If a page is genuinely gone, retrying five times does not improve reliability. It increases latency and hides the real issue.
For example, if you are scraping product detail pages from a catalog and a product is deleted, your correct outcome is:
- mark the URL as missing
- store that result cleanly
- move on
Not:
- retry for 45 seconds
- then throw a generic error
Why 429 deserves special treatment
A 429 Too Many Requests is not the same as a random 500.
It means the target is telling you, clearly, that your request rate is the problem.
So the right response is:
- retry
- wait longer than normal
- reduce concurrency if the pattern persists
Here is a simple way to add a longer delay for 429s:
def retry_delay_for_status(status: int, attempt: int) -> float:
if status == 429:
return backoff_seconds(attempt, base=3.0, cap=60.0)
return backoff_seconds(attempt)
Then plug it into your fetcher:
if status in RETRY_STATUSES:
delay = retry_delay_for_status(status, attempt)
print(f"retryable status={status} attempt={attempt} sleep={delay:.2f}s")
time.sleep(delay)
continue
That one change makes your scraper much less likely to spiral into self-inflicted throttling.
Soft blocks are the sneaky failure mode
HTTP status codes are only half the story.
A lot of block pages return 200 OK.
That means this can happen:
- request succeeds
- parser finds zero target elements
- exporter writes empty rows
- dashboard says the job passed
That is not success. That is silent corruption.
Your fetch layer should reject obviously bad HTML before the parser sees it.
A few common signals:
- tiny page size
- “enable javascript” wall
- “access denied” text
- “verify you are human” challenge page
- unexpected template missing your expected anchors
If your target normally has 30 product cards and suddenly there are zero, that should be suspicious by default.
Adding ProxiesAPI to the same retry policy
The nice part about a good retry policy is that it does not care whether you are fetching directly or via a proxy API.
You only change the URL construction.
The ProxiesAPI format is:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
Here is the equivalent in Python:
from urllib.parse import quote_plus
def build_proxiesapi_url(target_url: str, api_key: str) -> str:
encoded = quote_plus(target_url)
return f"http://api.proxiesapi.com/?key={api_key}&url={encoded}"
target = "https://example.com/products"
proxy_url = build_proxiesapi_url(target, "API_KEY")
html = fetch_html(proxy_url)
print(html[:300])
That is exactly how a stable scraper should evolve.
First fix your retry behavior. Then swap the transport layer when direct requests are no longer reliable enough.
A complete practical example
Let’s say you are scraping a category page and extracting article links.
from bs4 import BeautifulSoup
def parse_links(html: str) -> list[str]:
soup = BeautifulSoup(html, "html.parser")
links = []
for a in soup.select("a[href]"):
href = a.get("href")
if href and href.startswith("http"):
links.append(href)
return links
if __name__ == "__main__":
url = "https://example.com/blog"
html = fetch_html(url)
links = parse_links(html)
print(f"found {len(links)} links")
print(links[:5])
Example output:
found 42 links
['https://example.com/post-1', 'https://example.com/post-2', 'https://example.com/post-3']
The important thing is not the parser.
The important thing is that your parser only runs after the network layer has decided the response is credible.
Recommended defaults for most scrapers
If you need a starting point, use this:
- max attempts: 5
- connect timeout: 10s
- read timeout: 30s
- retry statuses: 408, 429, 500, 502, 503, 504
- fail-fast statuses: 400, 401, 403, 404, 410, 422
- backoff: exponential with jitter
- log every retry
- treat tiny or challenge pages as soft blocks
These defaults will not solve every site.
But they will eliminate the most common reliability mistakes.
The real principle
A retry policy is not there to hide failures.
It is there to separate:
- brief turbulence you should absorb
- from real failures you should record honestly
That distinction is what makes the difference between a scraper that looks busy and a scraper you can trust.
If you get that right, everything else gets easier:
- cleaner metrics
- fewer false alarms
- better datasets
- faster debugging
And if direct requests stop being predictable, you can keep the same policy and point it at a ProxiesAPI URL instead of rebuilding your whole stack.
That is the kind of engineering choice that compounds.
Once your retry logic is sane, the next bottleneck is network consistency. ProxiesAPI gives you a simple fetch endpoint you can plug into the same retry wrapper without rebuilding your whole scraper.