How to Scrape Data Without Getting Blocked (Practical Playbook)

May 03, 2026 · guide · #web-scraping, #anti-bot, #proxies, #python, #playwright, #rate-limiting, #retries

If you’ve ever built a scraper that works perfectly for 50 requests… and collapses at 5,000, you already know the truth:

Getting blocked is the default outcome of naive scraping.

This playbook is the practical version — the stuff that actually reduces blocks in production:

rate limiting (the #1 fix)
realistic headers + sessions
retries + backoff
caching and incremental updates
proxy rotation (when you truly need it)
browser fallback (when HTTP scraping isn’t enough)
monitoring so you know when things break

Stabilize long scrapes with ProxiesAPI

When blocks start (403/429) and reliability matters, ProxiesAPI helps by providing a more stable fetch layer while you keep your parsing logic unchanged.

Get 1,000 free API calls View pricing

The blocking spectrum (know what you’re fighting)

Sites block scrapers in a few common ways:

Soft throttling: slower responses, timeouts
HTTP rate limits: 429 Too Many Requests
Hard blocks: 403 Forbidden, captcha pages
Content poisoning: you get “valid HTML” but it’s a block page
Behavioral detection: fingerprinting, JS challenges

Your goal isn’t “never get blocked.” Your goal is:

detect blocks quickly
recover automatically
keep your dataset quality high

Rule 1: Slow down (most people don’t)

If you do only one thing, do this:

add delays
lower concurrency
spread work over time

The fastest way to get blocked is “no delay + high parallelism + no retries”.

A simple rate limiter

import random
import time


def sleep_jitter(min_s=1.0, max_s=2.5):
    time.sleep(random.uniform(min_s, max_s))

Use it between requests.

For bigger crawls, consider token-bucket rate limiting or per-host limits.

Rule 2: Use sessions + consistent headers

A common bot smell is:

new connection per request
missing Accept-Language
weird User-Agent

Use a requests.Session() and a sane header set.

import requests

session = requests.Session()

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

r = session.get("https://example.com", headers=HEADERS, timeout=(10, 30))
r.raise_for_status()

Don’t randomize everything. Consistency often looks more human.

Rule 3: Timeouts + retries (with backoff)

Blocks often show up as transient failures. Your scraper should be able to recover.

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class FetchError(Exception):
    pass

@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(FetchError),
)
def fetch(session: requests.Session, url: str) -> str:
    r = session.get(url, timeout=(10, 30))

    if r.status_code in (403, 429):
        raise FetchError(f"blocked: {r.status_code}")

    if r.status_code >= 500:
        raise FetchError(f"server error: {r.status_code}")

    r.raise_for_status()
    return r.text

Key idea: don’t retry 404s; do retry 403/429/5xx/timeouts.

Rule 4: Detect block pages (content poisoning)

Sometimes you get status 200 but the content is:

a captcha
“Access denied”
an interstitial

Add a lightweight detector:

def looks_blocked(html: str) -> bool:
    h = html.lower()
    return any(x in h for x in [
        "captcha",
        "access denied",
        "unusual traffic",
        "verify you are a human",
        "blocked",
    ])

If looks_blocked(html) is true, treat it like a 403.

Rule 5: Cache aggressively (scrape less)

If your job re-fetches the same pages repeatedly, you’ll:

waste money
increase blocks
slow down pipelines

Caching options (ascending complexity):

save raw HTML files to disk
store responses in SQLite
store “last-seen” hashes per URL and only re-parse on change

Even a simple file cache helps:

from pathlib import Path
import hashlib

CACHE_DIR = Path(".cache")
CACHE_DIR.mkdir(exist_ok=True)


def cache_path(url: str) -> Path:
    h = hashlib.sha256(url.encode("utf-8")).hexdigest()
    return CACHE_DIR / f"{h}.html"

When to use proxies (and when not to)

Proxies are not a magic “unblock button”. Use them when:

you’re doing high-volume crawling
the site rate-limits per IP
you need geographic access

Don’t use them to compensate for bad behavior:

zero delays
aggressive concurrency
no caching

How ProxiesAPI fits

ProxiesAPI can help when you start seeing persistent 403/429 and timeouts.

The clean integration approach is:

keep your parsing and scheduling code unchanged
route the fetch step through ProxiesAPI

import os
import requests

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")


def fetch_via_proxiesapi(session: requests.Session, url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY")

    r = session.get(
        "https://api.proxiesapi.com",
        params={"auth_key": PROXIESAPI_KEY, "url": url},
        timeout=(10, 30),
    )
    r.raise_for_status()
    return r.text

(Adjust params to match your ProxiesAPI docs/plan.)

Browser fallback: Playwright for JS-heavy sites

If a site is JS-heavy, HTTP scraping may never work reliably.

Playwright pattern:

from playwright.sync_api import sync_playwright


def fetch_rendered(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
        return html

Use browsers selectively — they are slower and more detectable at scale.

Monitoring: know when you’re getting blocked

Add basic metrics:

response status counts
average latency
block-page detector hits
rows extracted per run

Even logging to a CSV can be enough to catch regressions.

Practical checklist (copy/paste)

Before you run a scraper in production:

connect/read timeouts set
retries with exponential backoff
session + realistic headers
per-host rate limiting
cache or incremental crawl strategy
block page detection
proxy layer ready (if needed)
browser fallback plan (if JS-heavy)
alerts on sudden failure spikes

Bottom line

You avoid blocks by being boring:

steady rate
consistent requests
fewer unnecessary fetches

And when you need more reliability at scale:

add a proxy layer like ProxiesAPI
keep parsing deterministic
monitor and adapt

Stabilize long scrapes with ProxiesAPI

When blocks start (403/429) and reliability matters, ProxiesAPI helps by providing a more stable fetch layer while you keep your parsing logic unchanged.

Get 1,000 free API calls View pricing

A step-by-step anti-block strategy for web scraping: request fingerprinting, sessions, rate limits, retries, proxies, and when to use a real browser—without burning IPs or writing brittle code.

guide#web-scraping#anti-bot#rate-limiting

Selenium Web Scraping with Python: Complete Guide

A practical Selenium web scraping with Python guide: setup, waits, selectors, anti-bot basics, exporting data, and when Selenium is the wrong tool. Includes comparison tables and a ProxiesAPI-friendly architecture pattern.

guide#python#selenium#web-scraping

Web Scraping Rate Limiting: How to Throttle Requests Without Killing Throughput

Design rate limiting for scrapers that stays polite enough to reduce bans but fast enough for production, with practical token-bucket patterns, concurrency controls, and retry strategy.

guide#rate-limiting#web-scraping#python

Google Trends Scraping: API Options and DIY Methods

Compare official and unofficial ways to fetch Google Trends data, plus a DIY approach with throttling, retries, and proxy rotation for stability.

guide#google-trends#web-scraping#python

How to Scrape Data Without Getting Blocked (Practical Playbook)

Related guides