How to Scrape Data Without Getting Blocked: A Practical Playbook

If you scrape long enough, you will get blocked.

The real skill isn’t “never get blocked.” It’s:

  • minimize block rate
  • detect blocks early
  • recover gracefully
  • keep your data pipeline moving

This guide is a practical playbook for how to scrape data without getting blocked in 2026.

We’ll cover:

  • the 80/20 fundamentals (pacing + retries)
  • request fingerprints (headers, TLS, browser vs HTTP)
  • cookies and sessions
  • block detection (429/403 + HTML signals)
  • when headless browsers help (and when they hurt)
  • where proxies help (and what they don’t solve)
Reduce block rates as you scale with ProxiesAPI

Blocks usually come from aggressive behavior, not a single bad header. ProxiesAPI helps by spreading traffic across IPs and stabilizing connectivity so your backoff + pacing strategy can work at scale.


1) Start with the #1 lever: pacing

Most scrapers get blocked because they’re too fast, too bursty, or too consistent.

What “pacing” actually means

  • limit requests per second
  • add jitter (randomness)
  • cap concurrency
  • back off when errors spike

Bad:

  • 50 concurrent requests
  • no delay
  • infinite retry loops

Good:

  • concurrency 2–5 for most sites
  • delay 0.5–2.0s between requests
  • exponential backoff on 429/503

Example (Python): jitter + backoff

import random
import time


def sleep_jitter(base: float = 0.8, jitter: float = 0.8):
    time.sleep(base + random.random() * jitter)


def backoff(attempt: int):
    time.sleep(min(60, (2 ** attempt)) + random.random())

2) Use timeouts everywhere

Timeouts prevent your scraper from hanging and causing self-inflicted congestion.

Recommended defaults:

  • connect timeout: 5–15s
  • read timeout: 20–60s

Python requests:

TIMEOUT = (10, 30)
r = requests.get(url, timeout=TIMEOUT)

Node axios:

await axios.get(url, { timeout: 30000 });

3) Don’t fight the site’s structure

If the site has:

  • predictable pagination
  • stable category pages
  • public JSON embedded in the HTML

…use that.

If you brute-force internal endpoints or click every UI element, you increase risk.

Practical rule:

  • prefer list pages + detail pages
  • avoid scraping logged-in flows unless you truly need it

4) Headers: be realistic, not weird

A common anti-pattern is “header salad” (copy-pasting 30 headers from DevTools).

Most of the time you want:

  • User-Agent
  • Accept-Language
  • Accept

Example:

HEADERS = {
  "User-Agent": "Mozilla/5.0 ... Chrome/124.0.0.0 Safari/537.36",
  "Accept-Language": "en-US,en;q=0.9",
  "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

Avoid:

  • fake mobile UAs while sending desktop headers
  • inconsistent locale/timezone patterns across requests

5) Cookies and sessions matter

Some sites tolerate anonymous traffic poorly.

If a site sets cookies and expects them back:

  • reuse a session
  • persist cookies between runs

Python:

s = requests.Session()
html = s.get(url, headers=HEADERS, timeout=(10,30)).text

Browser scrapers:

  • keep a single context
  • reuse it for multiple pages

6) Detect blocks early (don’t keep hammering)

You should treat block detection as a first-class part of your scraper.

Signal 1: Status codes

  • 429 = rate limit (slow down)
  • 403 = forbidden (could be IP, fingerprint, or behavior)
  • 503 = overloaded / bot defense

Signal 2: HTML content patterns

Common block phrases:

  • “unusual traffic”
  • “verify you are a human”
  • “captcha”
  • “access denied”

Example detector:

import re

BLOCK_PATTERNS = [
    re.compile(r"captcha", re.I),
    re.compile(r"unusual traffic", re.I),
    re.compile(r"verify you are human", re.I),
    re.compile(r"access denied", re.I),
]


def looks_blocked(html: str) -> bool:
    if not html:
        return False
    return any(p.search(html) for p in BLOCK_PATTERNS)

If you detect blocks:

  • stop the crawl
  • cool down (minutes, not seconds)
  • rotate IP (if available)
  • reduce concurrency

7) When to use a headless browser (Playwright)

Use Playwright when:

  • data is not in HTML
  • content requires JS hydration
  • pagination is behind “Load more”

Don’t use Playwright just because it feels more powerful.

Headless browsers:

  • are slower
  • cost more
  • are more fingerprintable

Practical hybrid approach:

  • use Playwright to get cookies / tokens
  • then do bulk crawling via HTTP where possible

8) Proxies: what they help with (and what they don’t)

Proxies are not a cheat code.

They help with:

  • IP-based rate limits
  • geo restrictions (in some cases)
  • spreading load across IPs for large crawls

They do not automatically solve:

  • bot fingerprinting
  • bad pacing
  • broken selectors
  • login walls

Where ProxiesAPI fits

ProxiesAPI is most useful when you already have:

  • a polite crawler
  • stable parsing
  • good backoff

… and you want to scale volume without turning failures into a fire drill.

Python requests with a proxy

PROXY = "http://USER:PASS@PROXY_HOST:PORT"  # from ProxiesAPI

r = requests.get(
    url,
    headers=HEADERS,
    proxies={"http": PROXY, "https": PROXY},
    timeout=(10, 30),
)

Playwright with a proxy

browser = p.chromium.launch(
    headless=True,
    proxy={
        "server": "http://PROXY_HOST:PORT",
        "username": "USER",
        "password": "PASS",
    },
)

9) Use a “circuit breaker” in your crawler

A circuit breaker prevents runaway crawls when the site starts blocking.

Example policy:

  • if block rate > 20% over last 50 requests → stop and alert
  • if 5 blocks in a row → stop and alert

Pseudo:

if consecutive_blocks >= 5: stop
if blocks / recent_requests > 0.2: stop

This single feature saves you from burning hours and getting IPs flagged.


10) Practical comparison table: anti-block levers

LeverCostImpactNotes
Lower concurrencyLowHighBiggest win for most scrapers
Add jitterLowHighReduces “robotic” cadence
Add retries + backoffLowHighOnly retry the right errors
Session cookiesLowMediumHelps on stateful sites
Rotate IPs (ProxiesAPI)$$Medium–HighUseful once volume increases
Headless browser$$MediumUse only when needed
Fingerprint spoofing$$$VariableOften brittle

A minimal “don’t get blocked” checklist

  • concurrency capped (2–5)
  • jittered delays
  • timeouts everywhere
  • retry only on transient errors
  • block detection + cool down
  • store failures for debugging
  • add proxies only when scaling volume

Closing

If you want to scrape data without getting blocked, don’t start by shopping for proxy providers.

Start by making your scraper boring:

  • predictable load
  • conservative concurrency
  • clean error handling

Once that’s solid, ProxiesAPI can help you scale without your failure rate exploding.

Reduce block rates as you scale with ProxiesAPI

Blocks usually come from aggressive behavior, not a single bad header. ProxiesAPI helps by spreading traffic across IPs and stabilizing connectivity so your backoff + pacing strategy can work at scale.

Related guides

Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial
A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.
guide#javascript#nodejs#web-scraping
Google Trends Scraping: API Options and DIY Methods (2026)
Compare official and unofficial ways to fetch Google Trends data, plus a DIY approach with throttling, retries, and proxy rotation for stability.
guide#google-trends#web-scraping#python
How to Scrape Google Search Results with Python (Without Getting Blocked)
A practical SERP scraping workflow in Python: handle consent/interstitials, parse organic results defensively, rotate IPs, backoff on blocks, and export clean results. Includes a ProxiesAPI-backed fetch layer.
guide#how to scrape google search results with python#python#serp
Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
Pull routes + dates, parse price cards reliably, and export a clean dataset with retries + proxy rotation.
tutorial#python#google-flights#web-scraping