How to Scrape Data Without Getting Blocked (A Practical Playbook)

May 06, 2026 · guide · #web-scraping, #anti-bot, #rate-limiting, #retries, #proxies, #playwright, #python

Getting blocked is the default outcome of “naive scraping.”

Most scrapers fail because they:

send too many requests too quickly
look like a bot (fingerprints, missing headers, no cookies)
retry aggressively (turning transient failures into permanent blocks)
hammer the same IP until it’s burned

This guide is a practical playbook you can apply to almost any target site.

We’ll cover:

how sites detect scrapers
the highest-leverage fixes (rate, sessions, headers)
retries and backoff that don’t self-sabotage
proxy rotation and when it helps
browser automation (Playwright) as a fallback

Reduce blocks at scale with ProxiesAPI

Blocks are usually a systems problem, not a single bad request. ProxiesAPI helps by giving you a stable proxy layer (rotation + reputation) so your retry/backoff strategy actually works in production.

Get 1,000 free API calls View pricing

1) Understand the detection layers

Most blocking is a combination of these layers:

Network reputation: IP address history, ASN, geo, datacenter vs residential
Request fingerprint: headers, TLS fingerprint, HTTP/2 behavior
Behavior: request rate, burst patterns, navigation flow
State: cookies, sessions, CSRF tokens
Rendering: JS challenges, bot detection scripts

The key idea: you don’t “fix blocks” with one trick. You build a system that looks less suspicious and that degrades gracefully when it gets challenged.

2) The biggest win: slow down (with jitter)

If you take only one action, do this.

Bad pattern

50 parallel requests
no delay
retry instantly

Better pattern

2–10 concurrency (depending on site)
random delay (jitter)
exponential backoff

Python example:

import random
import time


def jitter_sleep(min_s=0.4, max_s=1.4):
    time.sleep(random.uniform(min_s, max_s))

Even tiny jitter breaks the “perfectly periodic bot” pattern.

3) Use sessions (cookies matter)

A lot of sites expect continuity.

Use requests.Session() so cookies persist:

import requests

session = requests.Session()

r = session.get("https://example.com")
# subsequent requests reuse cookies
r2 = session.get("https://example.com/page")

If the site sets anti-bot cookies, a stateless scraper will look abnormal.

4) Use real timeouts (don’t hang)

Hanging connections often cause:

queues backing up
retries stacking
bursts when they recover

Use timeouts:

TIMEOUT = (10, 30)  # connect, read
r = session.get(url, timeout=TIMEOUT)

5) Retries: fewer, smarter, slower

Naive retries are how you turn a temporary 503 into an IP ban.

A sane retry policy:

only retry on transient errors (429/5xx/timeouts)
exponential backoff
cap retries (e.g. 3–5)
if you see captcha/interstitial, stop and cool down

Example using tenacity:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=30))
def fetch(url):
    r = session.get(url, timeout=(10, 30))
    if r.status_code in (429, 500, 502, 503, 504):
        r.raise_for_status()
    return r.text

6) Headers: don’t cosplay, just be normal

You don’t need a 200-line header set.

But you should have:

a modern User-Agent
Accept
Accept-Language

Example:

HEADERS = {
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
  "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language": "en-US,en;q=0.9",
}

r = session.get(url, headers=HEADERS, timeout=(10, 30))

Avoid claiming to be Safari on iOS if you’re not.

7) Respect 429s (and Retry-After)

If the site says “slow down”, slow down.

If you ignore 429s, you’re training the system to escalate.

Pattern:

on 429, check Retry-After
sleep
reduce concurrency

8) Proxies: what they do (and don’t) fix

Proxies help when the limiting factor is IP reputation or IP-based rate limits.

They do not fix:

broken parsing
unrealistic behavior
JS challenges that require a real browser

Where ProxiesAPI fits

ProxiesAPI is most useful when:

you’re scraping many URLs
you need consistent throughput
you want automatic rotation without managing proxy pools

Minimal requests integration:

PROXIES = {
  "http": "http://YOUR_PROXIESAPI_PROXY",
  "https": "http://YOUR_PROXIESAPI_PROXY",
}

r = session.get(url, proxies=PROXIES, timeout=(10, 30))

Operationally, proxies work best when you also:

slow down
keep sessions stable (sticky sessions where needed)
back off on challenges

9) Use a browser only when you must

If content is JS-rendered, HTML scraping returns empty pages.

Use Playwright to fetch a rendered snapshot:

from playwright.sync_api import sync_playwright


def fetch_rendered(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle", timeout=60000)
        html = page.content()
        browser.close()
        return html

Browsers are heavier and often more detectable. Use them as a fallback.

10) Cache and dedupe (stop re-scraping)

Caching is an anti-block technique.

If you fetch the same URL 10 times during development, you look suspicious and you waste bandwidth.

Start with something simple:

save HTML to disk
reuse it for parsing iterations

11) Monitor block signals and fail safely

You should detect these signals:

captcha keywords ("unusual traffic", "verify you are human")
redirect loops
HTML unexpectedly tiny
403/429 spikes

When detected:

stop
cool down
rotate IP/session
lower rate

A minimal “anti-block” template (Python)

import random
import time
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()


def jitter(min_s=0.5, max_s=1.5):
    time.sleep(random.uniform(min_s, max_s))


def looks_blocked(html: str) -> bool:
    t = (html or "").lower()
    return any(x in t for x in ["captcha", "unusual traffic", "verify you are human"]) or len(html) < 5000


@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=30))
def get(url: str, proxies=None) -> str:
    r = session.get(url, headers=HEADERS, timeout=TIMEOUT, proxies=proxies)
    if r.status_code in (429, 500, 502, 503, 504):
        r.raise_for_status()
    html = r.text
    if looks_blocked(html):
        raise RuntimeError("Blocked or challenged")
    return html


def main(urls: list[str]):
    for u in urls:
        html = get(u)
        print("ok", u, len(html))
        jitter()


if __name__ == "__main__":
    main(["https://example.com"])

Final checklist (pin this)

If you implement just those, you’ll stop getting blocked “mysteriously” and start running scrapers like a real system.

Reduce blocks at scale with ProxiesAPI

Get 1,000 free API calls View pricing

A practical anti-blocking playbook for web scraping: rate limits, headers, retries, session handling, proxy rotation, browser fallback, and monitoring—plus proven Python patterns.

guide#web-scraping#anti-bot#proxies

Selenium Web Scraping with Python: Complete Guide

A practical Selenium web scraping with Python guide: setup, waits, selectors, anti-bot basics, exporting data, and when Selenium is the wrong tool. Includes comparison tables and a ProxiesAPI-friendly architecture pattern.

guide#python#selenium#web-scraping

Web Scraping Rate Limiting: How to Throttle Requests Without Killing Throughput

Design rate limiting for scrapers that stays polite enough to reduce bans but fast enough for production, with practical token-bucket patterns, concurrency controls, and retry strategy.

guide#rate-limiting#web-scraping#python

Google Trends Scraping: API Options and DIY Methods

Compare official and unofficial ways to fetch Google Trends data, plus a DIY approach with throttling, retries, and proxy rotation for stability.

guide#google-trends#web-scraping#python

How to Scrape Data Without Getting Blocked (A Practical Playbook)

Related guides