How to Scrape Data Without Getting Blocked (Practical Playbook)

If you’ve ever built a scraper that works perfectly for 50 requests… and collapses at 5,000, you already know the truth:

Getting blocked is the default outcome of naive scraping.

This playbook is the practical version — the stuff that actually reduces blocks in production:

  • rate limiting (the #1 fix)
  • realistic headers + sessions
  • retries + backoff
  • caching and incremental updates
  • proxy rotation (when you truly need it)
  • browser fallback (when HTTP scraping isn’t enough)
  • monitoring so you know when things break
Stabilize long scrapes with ProxiesAPI

When blocks start (403/429) and reliability matters, ProxiesAPI helps by providing a more stable fetch layer while you keep your parsing logic unchanged.


The blocking spectrum (know what you’re fighting)

Sites block scrapers in a few common ways:

  1. Soft throttling: slower responses, timeouts
  2. HTTP rate limits: 429 Too Many Requests
  3. Hard blocks: 403 Forbidden, captcha pages
  4. Content poisoning: you get “valid HTML” but it’s a block page
  5. Behavioral detection: fingerprinting, JS challenges

Your goal isn’t “never get blocked.” Your goal is:

  • detect blocks quickly
  • recover automatically
  • keep your dataset quality high

Rule 1: Slow down (most people don’t)

If you do only one thing, do this:

  • add delays
  • lower concurrency
  • spread work over time

The fastest way to get blocked is “no delay + high parallelism + no retries”.

A simple rate limiter

import random
import time


def sleep_jitter(min_s=1.0, max_s=2.5):
    time.sleep(random.uniform(min_s, max_s))

Use it between requests.

For bigger crawls, consider token-bucket rate limiting or per-host limits.


Rule 2: Use sessions + consistent headers

A common bot smell is:

  • new connection per request
  • missing Accept-Language
  • weird User-Agent

Use a requests.Session() and a sane header set.

import requests

session = requests.Session()

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

r = session.get("https://example.com", headers=HEADERS, timeout=(10, 30))
r.raise_for_status()

Don’t randomize everything. Consistency often looks more human.


Rule 3: Timeouts + retries (with backoff)

Blocks often show up as transient failures. Your scraper should be able to recover.

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class FetchError(Exception):
    pass

@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(FetchError),
)
def fetch(session: requests.Session, url: str) -> str:
    r = session.get(url, timeout=(10, 30))

    if r.status_code in (403, 429):
        raise FetchError(f"blocked: {r.status_code}")

    if r.status_code >= 500:
        raise FetchError(f"server error: {r.status_code}")

    r.raise_for_status()
    return r.text

Key idea: don’t retry 404s; do retry 403/429/5xx/timeouts.


Rule 4: Detect block pages (content poisoning)

Sometimes you get status 200 but the content is:

  • a captcha
  • “Access denied”
  • an interstitial

Add a lightweight detector:

def looks_blocked(html: str) -> bool:
    h = html.lower()
    return any(x in h for x in [
        "captcha",
        "access denied",
        "unusual traffic",
        "verify you are a human",
        "blocked",
    ])

If looks_blocked(html) is true, treat it like a 403.


Rule 5: Cache aggressively (scrape less)

If your job re-fetches the same pages repeatedly, you’ll:

  • waste money
  • increase blocks
  • slow down pipelines

Caching options (ascending complexity):

  • save raw HTML files to disk
  • store responses in SQLite
  • store “last-seen” hashes per URL and only re-parse on change

Even a simple file cache helps:

from pathlib import Path
import hashlib

CACHE_DIR = Path(".cache")
CACHE_DIR.mkdir(exist_ok=True)


def cache_path(url: str) -> Path:
    h = hashlib.sha256(url.encode("utf-8")).hexdigest()
    return CACHE_DIR / f"{h}.html"

When to use proxies (and when not to)

Proxies are not a magic “unblock button”. Use them when:

  • you’re doing high-volume crawling
  • the site rate-limits per IP
  • you need geographic access

Don’t use them to compensate for bad behavior:

  • zero delays
  • aggressive concurrency
  • no caching

How ProxiesAPI fits

ProxiesAPI can help when you start seeing persistent 403/429 and timeouts.

The clean integration approach is:

  • keep your parsing and scheduling code unchanged
  • route the fetch step through ProxiesAPI
import os
import requests

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")


def fetch_via_proxiesapi(session: requests.Session, url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY")

    r = session.get(
        "https://api.proxiesapi.com",
        params={"auth_key": PROXIESAPI_KEY, "url": url},
        timeout=(10, 30),
    )
    r.raise_for_status()
    return r.text

(Adjust params to match your ProxiesAPI docs/plan.)


Browser fallback: Playwright for JS-heavy sites

If a site is JS-heavy, HTTP scraping may never work reliably.

Playwright pattern:

from playwright.sync_api import sync_playwright


def fetch_rendered(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
        return html

Use browsers selectively — they are slower and more detectable at scale.


Monitoring: know when you’re getting blocked

Add basic metrics:

  • response status counts
  • average latency
  • block-page detector hits
  • rows extracted per run

Even logging to a CSV can be enough to catch regressions.


Practical checklist (copy/paste)

Before you run a scraper in production:

  • connect/read timeouts set
  • retries with exponential backoff
  • session + realistic headers
  • per-host rate limiting
  • cache or incremental crawl strategy
  • block page detection
  • proxy layer ready (if needed)
  • browser fallback plan (if JS-heavy)
  • alerts on sudden failure spikes

Bottom line

You avoid blocks by being boring:

  • steady rate
  • consistent requests
  • fewer unnecessary fetches

And when you need more reliability at scale:

  • add a proxy layer like ProxiesAPI
  • keep parsing deterministic
  • monitor and adapt
Stabilize long scrapes with ProxiesAPI

When blocks start (403/429) and reliability matters, ProxiesAPI helps by providing a more stable fetch layer while you keep your parsing logic unchanged.

Related guides

How to Scrape Data Without Getting Blocked (A Practical Playbook)
A step-by-step anti-block strategy for web scraping: request fingerprinting, sessions, rate limits, retries, proxies, and when to use a real browser—without burning IPs or writing brittle code.
guide#web-scraping#anti-bot#rate-limiting
How to Scrape Data Without Getting Blocked (2026 Playbook)
Blocking failure modes + the exact checklist: fingerprints, rate limits, retries, proxy strategy, and soft-block detection — with practical examples you can copy.
guide#web-scraping#anti-bot#proxies
Web Scraping Tools (2026): The Buyer's Guide — What to Use and When
A practical 2026 decision guide to web scraping tools: Python libraries, headless browsers, proxy APIs, turnkey services, and managed datasets—plus a no-nonsense selection framework.
guide#web-scraping#web scraping tools#python
Web Scraping Tools: The 2026 Buyer's Guide (What to Use and When)
A practical buyer’s guide to web scraping tools in 2026: Requests/BS4, Scrapy, Playwright, Apify, proxies, and hosted scrapers—plus a decision checklist and comparison table.
guide#web-scraping#tools#python