How to Scrape Data Without Getting Blocked (A Practical Playbook)

Getting blocked is the default outcome of “naive scraping.”

Most scrapers fail because they:

  • send too many requests too quickly
  • look like a bot (fingerprints, missing headers, no cookies)
  • retry aggressively (turning transient failures into permanent blocks)
  • hammer the same IP until it’s burned

This guide is a practical playbook you can apply to almost any target site.

We’ll cover:

  • how sites detect scrapers
  • the highest-leverage fixes (rate, sessions, headers)
  • retries and backoff that don’t self-sabotage
  • proxy rotation and when it helps
  • browser automation (Playwright) as a fallback
Reduce blocks at scale with ProxiesAPI

Blocks are usually a systems problem, not a single bad request. ProxiesAPI helps by giving you a stable proxy layer (rotation + reputation) so your retry/backoff strategy actually works in production.


1) Understand the detection layers

Most blocking is a combination of these layers:

  1. Network reputation: IP address history, ASN, geo, datacenter vs residential
  2. Request fingerprint: headers, TLS fingerprint, HTTP/2 behavior
  3. Behavior: request rate, burst patterns, navigation flow
  4. State: cookies, sessions, CSRF tokens
  5. Rendering: JS challenges, bot detection scripts

The key idea: you don’t “fix blocks” with one trick. You build a system that looks less suspicious and that degrades gracefully when it gets challenged.


2) The biggest win: slow down (with jitter)

If you take only one action, do this.

Bad pattern

  • 50 parallel requests
  • no delay
  • retry instantly

Better pattern

  • 2–10 concurrency (depending on site)
  • random delay (jitter)
  • exponential backoff

Python example:

import random
import time


def jitter_sleep(min_s=0.4, max_s=1.4):
    time.sleep(random.uniform(min_s, max_s))

Even tiny jitter breaks the “perfectly periodic bot” pattern.


3) Use sessions (cookies matter)

A lot of sites expect continuity.

Use requests.Session() so cookies persist:

import requests

session = requests.Session()

r = session.get("https://example.com")
# subsequent requests reuse cookies
r2 = session.get("https://example.com/page")

If the site sets anti-bot cookies, a stateless scraper will look abnormal.


4) Use real timeouts (don’t hang)

Hanging connections often cause:

  • queues backing up
  • retries stacking
  • bursts when they recover

Use timeouts:

TIMEOUT = (10, 30)  # connect, read
r = session.get(url, timeout=TIMEOUT)

5) Retries: fewer, smarter, slower

Naive retries are how you turn a temporary 503 into an IP ban.

A sane retry policy:

  • only retry on transient errors (429/5xx/timeouts)
  • exponential backoff
  • cap retries (e.g. 3–5)
  • if you see captcha/interstitial, stop and cool down

Example using tenacity:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=30))
def fetch(url):
    r = session.get(url, timeout=(10, 30))
    if r.status_code in (429, 500, 502, 503, 504):
        r.raise_for_status()
    return r.text

6) Headers: don’t cosplay, just be normal

You don’t need a 200-line header set.

But you should have:

  • a modern User-Agent
  • Accept
  • Accept-Language

Example:

HEADERS = {
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
  "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language": "en-US,en;q=0.9",
}

r = session.get(url, headers=HEADERS, timeout=(10, 30))

Avoid claiming to be Safari on iOS if you’re not.


7) Respect 429s (and Retry-After)

If the site says “slow down”, slow down.

If you ignore 429s, you’re training the system to escalate.

Pattern:

  • on 429, check Retry-After
  • sleep
  • reduce concurrency

8) Proxies: what they do (and don’t) fix

Proxies help when the limiting factor is IP reputation or IP-based rate limits.

They do not fix:

  • broken parsing
  • unrealistic behavior
  • JS challenges that require a real browser

Where ProxiesAPI fits

ProxiesAPI is most useful when:

  • you’re scraping many URLs
  • you need consistent throughput
  • you want automatic rotation without managing proxy pools

Minimal requests integration:

PROXIES = {
  "http": "http://YOUR_PROXIESAPI_PROXY",
  "https": "http://YOUR_PROXIESAPI_PROXY",
}

r = session.get(url, proxies=PROXIES, timeout=(10, 30))

Operationally, proxies work best when you also:

  • slow down
  • keep sessions stable (sticky sessions where needed)
  • back off on challenges

9) Use a browser only when you must

If content is JS-rendered, HTML scraping returns empty pages.

Use Playwright to fetch a rendered snapshot:

from playwright.sync_api import sync_playwright


def fetch_rendered(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle", timeout=60000)
        html = page.content()
        browser.close()
        return html

Browsers are heavier and often more detectable. Use them as a fallback.


10) Cache and dedupe (stop re-scraping)

Caching is an anti-block technique.

If you fetch the same URL 10 times during development, you look suspicious and you waste bandwidth.

Start with something simple:

  • save HTML to disk
  • reuse it for parsing iterations

11) Monitor block signals and fail safely

You should detect these signals:

  • captcha keywords ("unusual traffic", "verify you are human")
  • redirect loops
  • HTML unexpectedly tiny
  • 403/429 spikes

When detected:

  • stop
  • cool down
  • rotate IP/session
  • lower rate

A minimal “anti-block” template (Python)

import random
import time
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()


def jitter(min_s=0.5, max_s=1.5):
    time.sleep(random.uniform(min_s, max_s))


def looks_blocked(html: str) -> bool:
    t = (html or "").lower()
    return any(x in t for x in ["captcha", "unusual traffic", "verify you are human"]) or len(html) < 5000


@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=30))
def get(url: str, proxies=None) -> str:
    r = session.get(url, headers=HEADERS, timeout=TIMEOUT, proxies=proxies)
    if r.status_code in (429, 500, 502, 503, 504):
        r.raise_for_status()
    html = r.text
    if looks_blocked(html):
        raise RuntimeError("Blocked or challenged")
    return html


def main(urls: list[str]):
    for u in urls:
        html = get(u)
        print("ok", u, len(html))
        jitter()


if __name__ == "__main__":
    main(["https://example.com"])

Final checklist (pin this)

  • timeouts on every request
  • session cookies enabled
  • concurrency limited
  • jitter between requests
  • exponential backoff retries
  • cache + dedupe
  • proxy rotation (ProxiesAPI) when scaling
  • browser fallback only when needed

If you implement just those, you’ll stop getting blocked “mysteriously” and start running scrapers like a real system.

Reduce blocks at scale with ProxiesAPI

Blocks are usually a systems problem, not a single bad request. ProxiesAPI helps by giving you a stable proxy layer (rotation + reputation) so your retry/backoff strategy actually works in production.

Related guides

How to Scrape Data Without Getting Blocked (Practical Playbook)
A practical anti-blocking playbook for web scraping: rate limits, headers, retries, session handling, proxy rotation, browser fallback, and monitoring—plus proven Python patterns.
guide#web-scraping#anti-bot#proxies
How to Scrape Data Without Getting Blocked (2026 Playbook)
Blocking failure modes + the exact checklist: fingerprints, rate limits, retries, proxy strategy, and soft-block detection — with practical examples you can copy.
guide#web-scraping#anti-bot#proxies
Web Scraping Tools (2026): The Buyer's Guide — What to Use and When
A practical 2026 decision guide to web scraping tools: Python libraries, headless browsers, proxy APIs, turnkey services, and managed datasets—plus a no-nonsense selection framework.
guide#web-scraping#web scraping tools#python
Web Scraping Tools: The 2026 Buyer's Guide (What to Use and When)
A practical buyer’s guide to web scraping tools in 2026: Requests/BS4, Scrapy, Playwright, Apify, proxies, and hosted scrapers—plus a decision checklist and comparison table.
guide#web-scraping#tools#python