Retry Policies for Web Scrapers: What to Retry vs Fail Fast

When a scraper fails, the instinct is usually wrong.

People either:

  • retry everything forever
  • fail on the first timeout
  • or silently accept empty HTML as success

That is how you end up with bad datasets, fake ranking drops, and jobs that look green in the dashboard while your outputs are garbage.

A good retry policy is not about “trying harder.” It is about being selective.

In this guide, we’ll build a retry policy for Python scrapers that answers the only question that matters:

What should you retry, and what should fail fast?

Make scraper retries boring with a stable fetch layer

Once your retry logic is sane, the next bottleneck is network consistency. ProxiesAPI gives you a simple fetch endpoint you can plug into the same retry wrapper without rebuilding your whole scraper.


The core idea

Not every error means the same thing.

A scraper failure usually falls into one of four buckets:

  1. Transient network failure — DNS error, connection reset, read timeout
  2. Temporary upstream failure — 502, 503, 504, occasional 429
  3. Permanent response — 404, 410, malformed URL, bad auth
  4. Soft block / fake success — HTTP 200 but the HTML is useless

Your policy should treat each bucket differently.

If you retry permanent failures, you waste time and hammer the target.

If you do not retry transient failures, you create false negatives.

If you accept soft-blocked HTML as success, you poison your own data.


What to retry vs fail fast

Here is the practical matrix I recommend for most scrapers.

ConditionDefault actionWhy
Connection error / timeoutRetryOften transient
HTTP 408RetryRequest timeout usually recovers
HTTP 429Retry with longer delayYou were rate limited
HTTP 500 / 502 / 503 / 504RetryUpstream instability
HTTP 404Fail fastUsually permanent
HTTP 410Fail fastExplicitly gone
HTTP 401 / 403Usually fail fastOften auth or block issue
HTTP 200 with block pageRetry a limited number of timesIt is not real content

That “usually” on 403 matters.

Sometimes a 403 is transient on a site sitting behind a flaky edge rule. But you should only retry it if you have evidence that it occasionally succeeds on the same workflow. Otherwise, repeated 403 retries are just noise.


Start with explicit timeouts

A retry policy without timeouts is fake reliability.

If you do this:

requests.get(url)

that request can hang forever.

Use explicit connect and read timeouts instead:

TIMEOUT = (10, 30)  # connect timeout, read timeout

That means:

  • if the connection cannot start within 10 seconds, bail
  • if the server stops sending useful data for 30 seconds, bail

Those are sane defaults for most scraping jobs.


A reusable retry helper in Python

This example uses requests, exponential backoff, and a soft-block detector. It is designed to be dropped into a normal scraper without extra dependencies.

import random
import re
import time
import requests
from requests import Response
from requests.exceptions import RequestException, Timeout, ConnectionError

TIMEOUT = (10, 30)
RETRY_STATUSES = {408, 429, 500, 502, 503, 504}
FAIL_FAST_STATUSES = {400, 401, 403, 404, 410, 422}
SOFT_BLOCK_PATTERNS = [
    r"enable javascript",
    r"access denied",
    r"verify you are human",
    r"unusual traffic",
    r"temporarily unavailable",
]

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0"
})


def backoff_seconds(attempt: int, base: float = 1.0, cap: float = 30.0) -> float:
    exp = min(cap, base * (2 ** (attempt - 1)))
    jitter = random.uniform(0, exp * 0.25)
    return exp + jitter


def looks_soft_blocked(html: str) -> bool:
    if not html or len(html.strip()) < 500:
        return True

    lowered = html.lower()
    for pattern in SOFT_BLOCK_PATTERNS:
        if re.search(pattern, lowered):
            return True

    return False


def fetch_html(url: str, max_attempts: int = 5) -> str:
    last_error = None

    for attempt in range(1, max_attempts + 1):
        try:
            response: Response = session.get(url, timeout=TIMEOUT)
            status = response.status_code

            if status in FAIL_FAST_STATUSES:
                raise RuntimeError(f"fail-fast status {status} for {url}")

            if status in RETRY_STATUSES:
                delay = backoff_seconds(attempt)
                print(f"retryable status={status} attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            response.raise_for_status()
            html = response.text

            if looks_soft_blocked(html):
                delay = backoff_seconds(attempt)
                print(f"soft-block suspected attempt={attempt} sleep={delay:.2f}s")
                time.sleep(delay)
                continue

            return html

        except (Timeout, ConnectionError) as exc:
            last_error = exc
            delay = backoff_seconds(attempt)
            print(f"network error attempt={attempt} sleep={delay:.2f}s err={exc}")
            time.sleep(delay)

        except RequestException as exc:
            last_error = exc
            delay = backoff_seconds(attempt)
            print(f"request error attempt={attempt} sleep={delay:.2f}s err={exc}")
            time.sleep(delay)

    raise RuntimeError(f"failed after {max_attempts} attempts: {last_error}")

Example terminal output

Here is the kind of output you actually want during a flaky crawl:

retryable status=429 attempt=1 sleep=1.13s
retryable status=502 attempt=2 sleep=2.32s
soft-block suspected attempt=3 sleep=4.79s

That output is useful because it tells you:

  • what failed
  • which attempt you are on
  • how long the scraper is pausing

A silent retry loop is dangerous. If you do not log retries, you cannot distinguish “slow but healthy” from “quietly broken.”


Why 404 should usually fail fast

This is one of the most common mistakes in scraper codebases.

People write broad retry wrappers that treat every non-200 as retryable.

That is wrong.

If a page is genuinely gone, retrying five times does not improve reliability. It increases latency and hides the real issue.

For example, if you are scraping product detail pages from a catalog and a product is deleted, your correct outcome is:

  • mark the URL as missing
  • store that result cleanly
  • move on

Not:

  • retry for 45 seconds
  • then throw a generic error

Why 429 deserves special treatment

A 429 Too Many Requests is not the same as a random 500.

It means the target is telling you, clearly, that your request rate is the problem.

So the right response is:

  • retry
  • wait longer than normal
  • reduce concurrency if the pattern persists

Here is a simple way to add a longer delay for 429s:

def retry_delay_for_status(status: int, attempt: int) -> float:
    if status == 429:
        return backoff_seconds(attempt, base=3.0, cap=60.0)
    return backoff_seconds(attempt)

Then plug it into your fetcher:

if status in RETRY_STATUSES:
    delay = retry_delay_for_status(status, attempt)
    print(f"retryable status={status} attempt={attempt} sleep={delay:.2f}s")
    time.sleep(delay)
    continue

That one change makes your scraper much less likely to spiral into self-inflicted throttling.


Soft blocks are the sneaky failure mode

HTTP status codes are only half the story.

A lot of block pages return 200 OK.

That means this can happen:

  • request succeeds
  • parser finds zero target elements
  • exporter writes empty rows
  • dashboard says the job passed

That is not success. That is silent corruption.

Your fetch layer should reject obviously bad HTML before the parser sees it.

A few common signals:

  • tiny page size
  • “enable javascript” wall
  • “access denied” text
  • “verify you are human” challenge page
  • unexpected template missing your expected anchors

If your target normally has 30 product cards and suddenly there are zero, that should be suspicious by default.


Adding ProxiesAPI to the same retry policy

The nice part about a good retry policy is that it does not care whether you are fetching directly or via a proxy API.

You only change the URL construction.

The ProxiesAPI format is:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

Here is the equivalent in Python:

from urllib.parse import quote_plus


def build_proxiesapi_url(target_url: str, api_key: str) -> str:
    encoded = quote_plus(target_url)
    return f"http://api.proxiesapi.com/?key={api_key}&url={encoded}"


target = "https://example.com/products"
proxy_url = build_proxiesapi_url(target, "API_KEY")
html = fetch_html(proxy_url)
print(html[:300])

That is exactly how a stable scraper should evolve.

First fix your retry behavior. Then swap the transport layer when direct requests are no longer reliable enough.


A complete practical example

Let’s say you are scraping a category page and extracting article links.

from bs4 import BeautifulSoup


def parse_links(html: str) -> list[str]:
    soup = BeautifulSoup(html, "html.parser")
    links = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        if href and href.startswith("http"):
            links.append(href)
    return links


if __name__ == "__main__":
    url = "https://example.com/blog"
    html = fetch_html(url)
    links = parse_links(html)
    print(f"found {len(links)} links")
    print(links[:5])

Example output:

found 42 links
['https://example.com/post-1', 'https://example.com/post-2', 'https://example.com/post-3']

The important thing is not the parser.

The important thing is that your parser only runs after the network layer has decided the response is credible.


If you need a starting point, use this:

  • max attempts: 5
  • connect timeout: 10s
  • read timeout: 30s
  • retry statuses: 408, 429, 500, 502, 503, 504
  • fail-fast statuses: 400, 401, 403, 404, 410, 422
  • backoff: exponential with jitter
  • log every retry
  • treat tiny or challenge pages as soft blocks

These defaults will not solve every site.

But they will eliminate the most common reliability mistakes.


The real principle

A retry policy is not there to hide failures.

It is there to separate:

  • brief turbulence you should absorb
  • from real failures you should record honestly

That distinction is what makes the difference between a scraper that looks busy and a scraper you can trust.

If you get that right, everything else gets easier:

  • cleaner metrics
  • fewer false alarms
  • better datasets
  • faster debugging

And if direct requests stop being predictable, you can keep the same policy and point it at a ProxiesAPI URL instead of rebuilding your whole stack.

That is the kind of engineering choice that compounds.

Make scraper retries boring with a stable fetch layer

Once your retry logic is sane, the next bottleneck is network consistency. ProxiesAPI gives you a simple fetch endpoint you can plug into the same retry wrapper without rebuilding your whole scraper.

Related guides

Retries, Timeouts, and Backoff for Web Scraping (Python): Production Defaults That Work
Most scrapers fail because of networking, not parsing. Here are sane timeout defaults, a retry policy that won’t DDoS a site, and a drop-in requests/httpx implementation.
engineering#python#web-scraping#retries
Soft-Block Detection for Web Scraping (Python): Catch ‘HTTP 200 but Wrong Page’
Most scrapers fail silently: the request succeeds but the HTML is a block/consent/login page. Here’s how to detect soft-blocks before parsing.
engineering#python#web-scraping#retries
Python Proxy Setup for Scraping: Requests, Retries, and Timeouts
Target keyword: python proxy — show a production-safe Python requests setup with proxy routing, backoff, and failure handling.
guide#python proxy#python#requests
Scrape OpenStreetMap Wiki pages with Python
Collect category pages and linked wiki entries into a structured index for research or monitoring.
tutorial#python#openstreetmap#osm