Retry Policies for Web Scrapers: What to Retry vs Fail Fast

Mar 15, 2026 · engineering · #python, #web-scraping, #retries, #timeouts, #backoff, #requests

When a scraper fails, the instinct is usually wrong.

People either:

retry everything forever
fail on the first timeout
or silently accept empty HTML as success

That is how you end up with bad datasets, fake ranking drops, and jobs that look green in the dashboard while your outputs are garbage.

A good retry policy is not about “trying harder.” It is about being selective.

In this guide, we’ll build a retry policy for Python scrapers that answers the only question that matters:

What should you retry, and what should fail fast?

Make scraper retries boring with a stable fetch layer

Once your retry logic is sane, the next bottleneck is network consistency. ProxiesAPI gives you a simple fetch endpoint you can plug into the same retry wrapper without rebuilding your whole scraper.

Get 1,000 free API calls View pricing

The core idea

Not every error means the same thing.

A scraper failure usually falls into one of four buckets:

Transient network failure — DNS error, connection reset, read timeout
Temporary upstream failure — 502, 503, 504, occasional 429
Permanent response — 404, 410, malformed URL, bad auth
Soft block / fake success — HTTP 200 but the HTML is useless

Your policy should treat each bucket differently.

If you retry permanent failures, you waste time and hammer the target.

If you do not retry transient failures, you create false negatives.

If you accept soft-blocked HTML as success, you poison your own data.

Condition	Default action	Why
Connection error / timeout	Retry	Often transient
HTTP 408	Retry	Request timeout usually recovers
HTTP 429	Retry with longer delay	You were rate limited
HTTP 500 / 502 / 503 / 504	Retry	Upstream instability
HTTP 404	Fail fast	Usually permanent
HTTP 410	Fail fast	Explicitly gone
HTTP 401 / 403	Usually fail fast	Often auth or block issue
HTTP 200 with block page	Retry a limited number of times	It is not real content

Sometimes a 403 is transient on a site sitting behind a flaky edge rule. But you should only retry it if you have evidence that it occasionally succeeds on the same workflow. Otherwise, repeated 403 retries are just noise.

Start with explicit timeouts

A retry policy without timeouts is fake reliability.

If you do this:

requests.get(url)

that request can hang forever.

Use explicit connect and read timeouts instead:

TIMEOUT = (10, 30)  # connect timeout, read timeout

That means:

if the connection cannot start within 10 seconds, bail
if the server stops sending useful data for 30 seconds, bail

Those are sane defaults for most scraping jobs.

A reusable retry helper in Python

This example uses requests, exponential backoff, and a soft-block detector. It is designed to be dropped into a normal scraper without extra dependencies.

import random
import re
import time
import requests
from requests import Response
from requests.exceptions import RequestException, Timeout, ConnectionError

TIMEOUT = (10, 30)
RETRY_STATUSES = {408, 429, 500, 502, 503, 504}
FAIL_FAST_STATUSES = {400, 401, 403, 404, 410, 422}
SOFT_BLOCK_PATTERNS = [
    r"enable javascript",
    r"access denied",
    r"verify you are human",
    r"unusual traffic",
    r"temporarily unavailable",
]

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0"
})


def backoff_seconds(attempt: int, base: float = 1.0, cap: float = 30.0) -> float:
    exp = min(cap, base * (2 ** (attempt - 1)))
    jitter = random.uniform(0, exp * 0.25)
    return exp + jitter


def looks_soft_blocked(html: str) -> bool:
    if not html or len(html.strip()) < 500:
        return True

    lowered = html.lower()
    for pattern in SOFT_BLOCK_PATTERNS:
        if re.search(pattern, lowered):
            return True

    return False


def fetch_html(url: str, max_attempts: int = 5) -> str:
    last_error = None

    for attempt in range(1, max_attempts + 1):
        try:
            response: Response = session.get(url, timeout=TIMEOUT)
            status = response.status_code

            if status in FAIL_FAST_STATUSES:
                raise RuntimeError(f"fail-fast status {status} for {url}")

            if status in RETRY_STATUSES:
                delay = backoff_seconds(attempt)
                print(f"retryable status={status} attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            response.raise_for_status()
            html = response.text

            if looks_soft_blocked(html):
                delay = backoff_seconds(attempt)
                print(f"soft-block suspected attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            return html

        except (Timeout, ConnectionError) as exc:
            last_error = exc
            delay = backoff_seconds(attempt)
            print(f"network error attempt={attempt} sleep={delay:.2f}s err={exc}")
            time.sleep(delay)

        except RequestException as exc:
            last_error = exc
            delay = backoff_seconds(attempt)
            print(f"request error attempt={attempt} sleep={delay:.2f}s err={exc}")
            time.sleep(delay)

    raise RuntimeError(f"failed after {max_attempts} attempts: {last_error}")

Example terminal output

Here is the kind of output you actually want during a flaky crawl:

retryable status=429 attempt=1 sleep=1.13s
retryable status=502 attempt=2 sleep=2.32s
soft-block suspected attempt=3 sleep=4.79s

That output is useful because it tells you:

what failed
which attempt you are on
how long the scraper is pausing

A silent retry loop is dangerous. If you do not log retries, you cannot distinguish “slow but healthy” from “quietly broken.”

Why 404 should usually fail fast

This is one of the most common mistakes in scraper codebases.

People write broad retry wrappers that treat every non-200 as retryable.

That is wrong.

If a page is genuinely gone, retrying five times does not improve reliability. It increases latency and hides the real issue.

For example, if you are scraping product detail pages from a catalog and a product is deleted, your correct outcome is:

mark the URL as missing
store that result cleanly
move on

Not:

retry for 45 seconds
then throw a generic error

Why 429 deserves special treatment

A 429 Too Many Requests is not the same as a random 500.

It means the target is telling you, clearly, that your request rate is the problem.

So the right response is:

retry
wait longer than normal
reduce concurrency if the pattern persists

Here is a simple way to add a longer delay for 429s:

def retry_delay_for_status(status: int, attempt: int) -> float:
    if status == 429:
        return backoff_seconds(attempt, base=3.0, cap=60.0)
    return backoff_seconds(attempt)

Then plug it into your fetcher:

if status in RETRY_STATUSES:
    delay = retry_delay_for_status(status, attempt)
    print(f"retryable status={status} attempt={attempt} sleep={delay:.2f}s")
    time.sleep(delay)
    continue

That one change makes your scraper much less likely to spiral into self-inflicted throttling.

Soft blocks are the sneaky failure mode

HTTP status codes are only half the story.

A lot of block pages return 200 OK.

That means this can happen:

request succeeds
parser finds zero target elements
exporter writes empty rows
dashboard says the job passed

That is not success. That is silent corruption.

Your fetch layer should reject obviously bad HTML before the parser sees it.

A few common signals:

tiny page size
“enable javascript” wall
“access denied” text
“verify you are human” challenge page
unexpected template missing your expected anchors

If your target normally has 30 product cards and suddenly there are zero, that should be suspicious by default.

Adding ProxiesAPI to the same retry policy

The nice part about a good retry policy is that it does not care whether you are fetching directly or via a proxy API.

You only change the URL construction.

The ProxiesAPI format is:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

Here is the equivalent in Python:

from urllib.parse import quote_plus


def build_proxiesapi_url(target_url: str, api_key: str) -> str:
    encoded = quote_plus(target_url)
    return f"http://api.proxiesapi.com/?key={api_key}&url={encoded}"


target = "https://example.com/products"
proxy_url = build_proxiesapi_url(target, "API_KEY")
html = fetch_html(proxy_url)
print(html[:300])

That is exactly how a stable scraper should evolve.

First fix your retry behavior. Then swap the transport layer when direct requests are no longer reliable enough.

A complete practical example

Let’s say you are scraping a category page and extracting article links.

from bs4 import BeautifulSoup


def parse_links(html: str) -> list[str]:
    soup = BeautifulSoup(html, "html.parser")
    links = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        if href and href.startswith("http"):
            links.append(href)
    return links


if __name__ == "__main__":
    url = "https://example.com/blog"
    html = fetch_html(url)
    links = parse_links(html)
    print(f"found {len(links)} links")
    print(links[:5])

Example output:

found 42 links
['https://example.com/post-1', 'https://example.com/post-2', 'https://example.com/post-3']

The important thing is not the parser.

The important thing is that your parser only runs after the network layer has decided the response is credible.

Recommended defaults for most scrapers

If you need a starting point, use this:

max attempts: 5
connect timeout: 10s
read timeout: 30s
retry statuses: 408, 429, 500, 502, 503, 504
fail-fast statuses: 400, 401, 403, 404, 410, 422
backoff: exponential with jitter
log every retry
treat tiny or challenge pages as soft blocks

These defaults will not solve every site.

But they will eliminate the most common reliability mistakes.

The real principle

A retry policy is not there to hide failures.

It is there to separate:

brief turbulence you should absorb
from real failures you should record honestly

That distinction is what makes the difference between a scraper that looks busy and a scraper you can trust.

If you get that right, everything else gets easier:

cleaner metrics
fewer false alarms
better datasets
faster debugging

And if direct requests stop being predictable, you can keep the same policy and point it at a ProxiesAPI URL instead of rebuilding your whole stack.

That is the kind of engineering choice that compounds.

Make scraper retries boring with a stable fetch layer

Once your retry logic is sane, the next bottleneck is network consistency. ProxiesAPI gives you a simple fetch endpoint you can plug into the same retry wrapper without rebuilding your whole scraper.

Get 1,000 free API calls View pricing

Most scrapers fail because of networking, not parsing. Here are sane timeout defaults, a retry policy that won’t DDoS a site, and a drop-in requests/httpx implementation.

engineering#python#web-scraping#retries

Web Scraping with Python Requests: Proxies, Retries, and Timeouts (2026)

Make Python Requests reliable for scraping: proxy configuration, timeouts, retries with backoff, common failure modes, and when to use ProxiesAPI for a stable fetch layer.

guide#python#requests#proxy

Python Requests with Proxy: Setup and Rotation Guide

A practical guide to using proxies with Python Requests: basic config, authenticated proxies, session rotation, retries, timeouts, and a simpler ProxiesAPI fetch pattern.

guide#python#requests#proxy

Soft-Block Detection for Web Scraping (Python): Catch ‘HTTP 200 but Wrong Page’

Most scrapers fail silently: the request succeeds but the HTML is a block/consent/login page. Here’s how to detect soft-blocks before parsing.

engineering#python#web-scraping#retries