How to Scrape Data Without Getting Blocked (Practical Playbook)

Getting blocked is the default state of web scraping at scale.

Not because you’re doing something “wrong” — but because modern sites actively protect:

  • infrastructure cost (your bot traffic is expensive)
  • user experience (bots can degrade performance)
  • fraud surfaces (inventory hoarding, price scraping, credential stuffing)
  • business models (data is valuable)

This guide is a practical playbook you can apply to most scraping systems.

It’s opinionated, boring, and effective.

When blocks become your bottleneck, add ProxiesAPI

Most anti-block wins come from good engineering (timeouts, pacing, retries). When you still need higher success rates at scale, ProxiesAPI gives you a managed proxy layer and more consistent runs.


First principles: why you get blocked

Most blocks happen for one of these reasons:

  1. Too many requests too quickly (429)
  2. Bad fingerprints (headers, TLS, inconsistent UA)
  3. Predictable patterns (no jitter, sequential IDs)
  4. IP reputation (datacenter IPs, burned ranges)
  5. JS challenges (bot pages that require browser execution)
  6. Behavior anomalies (never loading assets, no cookies)

Your job is to reduce “bot-like” signals and make your crawler behave like a careful, boring client.


The anti-block stack (in order of ROI)

1) Timeouts + retries (non-negotiable)

If you don’t have timeouts, you don’t have a scraper — you have a process that can hang forever.

Use:

  • a connect timeout (e.g., 10s)
  • a read timeout (e.g., 30–60s)
  • exponential backoff with capped retries
import random
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

TIMEOUT = (10, 40)

s = requests.Session()
s.headers.update({
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
})

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15",
]


def is_retryable(status: int) -> bool:
    return status in (403, 408, 409, 425, 429, 500, 502, 503, 504)


@retry(wait=wait_exponential(multiplier=1, min=2, max=20), stop=stop_after_attempt(6))
def fetch(url: str, *, proxies: dict | None = None) -> str:
    s.headers["User-Agent"] = random.choice(USER_AGENTS)
    r = s.get(url, timeout=TIMEOUT, proxies=proxies)

    if is_retryable(r.status_code):
        raise requests.HTTPError(f"HTTP {r.status_code} for {url}")

    r.raise_for_status()
    return r.text

2) Pacing + jitter (stop hammering)

Your traffic should look like:

  • consistent
  • slow enough
  • not perfectly periodic
import time
import random

BASE_SLEEP = 1.0

for url in urls:
    html = fetch(url)
    # parse...
    time.sleep(BASE_SLEEP + random.random() * 0.8)

For many sites, 1–3 seconds between requests per domain is a good starting point.

3) Don’t scrape what you don’t need

Most scrapers waste requests.

High-leverage cuts:

  • don’t refetch unchanged pages (cache)
  • don’t follow links you can derive from IDs
  • stop early when results are empty
  • only collect fields you actually use

Header hygiene (small changes, big wins)

Common mistakes:

  • missing Accept-Language
  • weird UAs (or always the same UA)
  • no referer on internal navigation

A realistic baseline:

s.headers.update({
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
    "Upgrade-Insecure-Requests": "1",
})

If you’re crawling a site deeply, also set a referer when moving from list → detail:

r = s.get(detail_url, headers={"Referer": list_url}, timeout=TIMEOUT)

Handle 429 correctly (respect Retry-After)

Many teams treat 429 as “retry faster”. That’s backwards.

If you see Retry-After, honor it.

import time

r = s.get(url, timeout=TIMEOUT)
if r.status_code == 429:
    ra = r.headers.get("Retry-After")
    sleep_s = int(ra) if ra and ra.isdigit() else 30
    time.sleep(sleep_s)
    # then retry

Proxies: when you need them (and when you don’t)

Proxies help when:

  • your IP reputation is the limiting factor
  • you need geographic distribution
  • you’re doing high-volume requests

Proxies do not fix:

  • broken parsing
  • scraping faster than the site can tolerate
  • JS challenges that require a browser

Using ProxiesAPI with requests

PROXIES = {
    "http": "http://YOUR_PROXIESAPI_PROXY",
    "https": "http://YOUR_PROXIESAPI_PROXY",
}

html = fetch("https://example.com", proxies=PROXIES)

Keep concurrency sane. Rotation is not a license to DDoS.


Browser fallback (when HTML isn’t enough)

If the raw HTML is empty (or a placeholder), you need a browser-based fetch.

Playwright makes this straightforward:

from playwright.sync_api import sync_playwright


def fetch_rendered(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(viewport={"width": 1280, "height": 720})
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
        return html

A hybrid strategy is often best:

  • use requests for list pages
  • use Playwright for a small fraction of “hard” detail pages

Monitoring: blocks should be visible

If you can’t see blocks, you can’t fix them.

Track:

  • success rate per domain
  • status code distribution
  • mean/95p latency
  • retries per request

A simple log line per request is enough to start.


Practical troubleshooting checklist

When a site starts blocking you:

  1. Slow down (half your rate)
  2. Add jitter
  3. Improve headers + UA rotation
  4. Add retries for 403/429/5xx
  5. Cache aggressively
  6. Add proxy layer (ProxiesAPI)
  7. If HTML is useless: browser fallback

Common anti-patterns (avoid these)

  • “Let’s use 500 threads”
  • no timeouts
  • retry loops without caps
  • scraping every page every day even if unchanged
  • parsing with brittle deep selectors without tests

Final word

The fastest path to not getting blocked is not “secret tricks”.

It’s:

  • good engineering fundamentals
  • respectful traffic patterns
  • observability
  • and, when necessary, a managed proxy layer like ProxiesAPI to stabilize the network.
When blocks become your bottleneck, add ProxiesAPI

Most anti-block wins come from good engineering (timeouts, pacing, retries). When you still need higher success rates at scale, ProxiesAPI gives you a managed proxy layer and more consistent runs.

Related guides

How to Scrape Data Without Getting Blocked (Practical Playbook)
A practical anti-blocking playbook for web scraping: rate limits, headers, retries, session handling, proxy rotation, browser fallback, and monitoring—plus proven Python patterns.
guide#web-scraping#anti-bot#proxies
Python Proxy Setup for Scraping: Requests, Retries, and Timeouts
Target keyword: python proxy — show a production-safe Python requests setup with proxy routing, backoff, and failure handling.
guide#python proxy#python#requests
Price Scraping: How to Monitor Competitor Prices Automatically
A practical blueprint for price scraping and competitor price monitoring: what to track, how to crawl responsibly, change detection, and how to keep scrapers stable at scale.
seo#price scraping#price monitoring#web scraping
Web Scraping Tools: The 2026 Buyer’s Guide (What to Use and When)
A decision framework comparing Python libraries, headless browsers, proxy APIs, and turnkey scrapers. Includes practical recommendations by use case, budget, and scale.
guide#web scraping tools#python#playwright