How to Scrape Data Without Getting Blocked (Practical Playbook)

May 04, 2026 · guide · #how to scrape data without getting blocked, #web scraping, #python, #proxies, #rate limiting, #retries

Getting blocked is the default state of web scraping at scale.

Not because you’re doing something “wrong” — but because modern sites actively protect:

infrastructure cost (your bot traffic is expensive)
user experience (bots can degrade performance)
fraud surfaces (inventory hoarding, price scraping, credential stuffing)
business models (data is valuable)

This guide is a practical playbook you can apply to most scraping systems.

It’s opinionated, boring, and effective.

When blocks become your bottleneck, add ProxiesAPI

Most anti-block wins come from good engineering (timeouts, pacing, retries). When you still need higher success rates at scale, ProxiesAPI gives you a managed proxy layer and more consistent runs.

Get 1,000 free API calls View pricing

First principles: why you get blocked

Most blocks happen for one of these reasons:

Too many requests too quickly (429)
Bad fingerprints (headers, TLS, inconsistent UA)
Predictable patterns (no jitter, sequential IDs)
IP reputation (datacenter IPs, burned ranges)
JS challenges (bot pages that require browser execution)
Behavior anomalies (never loading assets, no cookies)

Your job is to reduce “bot-like” signals and make your crawler behave like a careful, boring client.

The anti-block stack (in order of ROI)

1) Timeouts + retries (non-negotiable)

If you don’t have timeouts, you don’t have a scraper — you have a process that can hang forever.

Use:

a connect timeout (e.g., 10s)
a read timeout (e.g., 30–60s)
exponential backoff with capped retries

import random
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

TIMEOUT = (10, 40)

s = requests.Session()
s.headers.update({
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
})

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15",
]


def is_retryable(status: int) -> bool:
    return status in (403, 408, 409, 425, 429, 500, 502, 503, 504)


@retry(wait=wait_exponential(multiplier=1, min=2, max=20), stop=stop_after_attempt(6))
def fetch(url: str, *, proxies: dict | None = None) -> str:
    s.headers["User-Agent"] = random.choice(USER_AGENTS)
    r = s.get(url, timeout=TIMEOUT, proxies=proxies)

    if is_retryable(r.status_code):
        raise requests.HTTPError(f"HTTP {r.status_code} for {url}")

    r.raise_for_status()
    return r.text

2) Pacing + jitter (stop hammering)

Your traffic should look like:

consistent
slow enough
not perfectly periodic

import time
import random

BASE_SLEEP = 1.0

for url in urls:
    html = fetch(url)
    # parse...
    time.sleep(BASE_SLEEP + random.random() * 0.8)

For many sites, 1–3 seconds between requests per domain is a good starting point.

3) Don’t scrape what you don’t need

Most scrapers waste requests.

High-leverage cuts:

don’t refetch unchanged pages (cache)
don’t follow links you can derive from IDs
stop early when results are empty
only collect fields you actually use

Header hygiene (small changes, big wins)

Common mistakes:

missing Accept-Language
weird UAs (or always the same UA)
no referer on internal navigation

A realistic baseline:

s.headers.update({
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
    "Upgrade-Insecure-Requests": "1",
})

If you’re crawling a site deeply, also set a referer when moving from list → detail:

r = s.get(detail_url, headers={"Referer": list_url}, timeout=TIMEOUT)

Handle 429 correctly (respect Retry-After)

Many teams treat 429 as “retry faster”. That’s backwards.

If you see Retry-After, honor it.

import time

r = s.get(url, timeout=TIMEOUT)
if r.status_code == 429:
    ra = r.headers.get("Retry-After")
    sleep_s = int(ra) if ra and ra.isdigit() else 30
    time.sleep(sleep_s)
    # then retry

Proxies: when you need them (and when you don’t)

Proxies help when:

your IP reputation is the limiting factor
you need geographic distribution
you’re doing high-volume requests

Proxies do not fix:

broken parsing
scraping faster than the site can tolerate
JS challenges that require a browser

Using ProxiesAPI with requests

PROXIES = {
    "http": "http://YOUR_PROXIESAPI_PROXY",
    "https": "http://YOUR_PROXIESAPI_PROXY",
}

html = fetch("https://example.com", proxies=PROXIES)

Keep concurrency sane. Rotation is not a license to DDoS.

Browser fallback (when HTML isn’t enough)

If the raw HTML is empty (or a placeholder), you need a browser-based fetch.

Playwright makes this straightforward:

from playwright.sync_api import sync_playwright


def fetch_rendered(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(viewport={"width": 1280, "height": 720})
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
        return html

A hybrid strategy is often best:

use requests for list pages
use Playwright for a small fraction of “hard” detail pages

Monitoring: blocks should be visible

If you can’t see blocks, you can’t fix them.

Track:

success rate per domain
status code distribution
mean/95p latency
retries per request

A simple log line per request is enough to start.

Practical troubleshooting checklist

When a site starts blocking you:

Slow down (half your rate)
Add jitter
Improve headers + UA rotation
Add retries for 403/429/5xx
Cache aggressively
Add proxy layer (ProxiesAPI)
If HTML is useless: browser fallback

Common anti-patterns (avoid these)

“Let’s use 500 threads”
no timeouts
retry loops without caps
scraping every page every day even if unchanged
parsing with brittle deep selectors without tests

Final word

The fastest path to not getting blocked is not “secret tricks”.

It’s:

good engineering fundamentals
respectful traffic patterns
observability
and, when necessary, a managed proxy layer like ProxiesAPI to stabilize the network.

When blocks become your bottleneck, add ProxiesAPI

Most anti-block wins come from good engineering (timeouts, pacing, retries). When you still need higher success rates at scale, ProxiesAPI gives you a managed proxy layer and more consistent runs.

Get 1,000 free API calls View pricing

A practical anti-blocking playbook for web scraping: rate limits, headers, retries, session handling, proxy rotation, browser fallback, and monitoring—plus proven Python patterns.

guide#web-scraping#anti-bot#proxies

Python Proxy Setup for Scraping: Requests, Retries, and Timeouts

Target keyword: python proxy — show a production-safe Python requests setup with proxy routing, backoff, and failure handling.

guide#python proxy#python#requests

Price Scraping: How to Monitor Competitor Prices Automatically

A practical blueprint for price scraping and competitor price monitoring: what to track, how to crawl responsibly, change detection, and how to keep scrapers stable at scale.

seo#price scraping#price monitoring#web scraping

Web Scraping Tools: The 2026 Buyer’s Guide (What to Use and When)

A decision framework comparing Python libraries, headless browsers, proxy APIs, and turnkey scrapers. Includes practical recommendations by use case, budget, and scale.

guide#web scraping tools#python#playwright

How to Scrape Data Without Getting Blocked (Practical Playbook)

Related guides