How to Scrape Data Without Getting Blocked: A Practical Playbook
If you scrape long enough, you will get blocked.
The real skill isn’t “never get blocked.” It’s:
- minimize block rate
- detect blocks early
- recover gracefully
- keep your data pipeline moving
This guide is a practical playbook for how to scrape data without getting blocked in 2026.
We’ll cover:
- the 80/20 fundamentals (pacing + retries)
- request fingerprints (headers, TLS, browser vs HTTP)
- cookies and sessions
- block detection (429/403 + HTML signals)
- when headless browsers help (and when they hurt)
- where proxies help (and what they don’t solve)
Blocks usually come from aggressive behavior, not a single bad header. ProxiesAPI helps by spreading traffic across IPs and stabilizing connectivity so your backoff + pacing strategy can work at scale.
1) Start with the #1 lever: pacing
Most scrapers get blocked because they’re too fast, too bursty, or too consistent.
What “pacing” actually means
- limit requests per second
- add jitter (randomness)
- cap concurrency
- back off when errors spike
Bad:
- 50 concurrent requests
- no delay
- infinite retry loops
Good:
- concurrency 2–5 for most sites
- delay 0.5–2.0s between requests
- exponential backoff on 429/503
Example (Python): jitter + backoff
import random
import time
def sleep_jitter(base: float = 0.8, jitter: float = 0.8):
time.sleep(base + random.random() * jitter)
def backoff(attempt: int):
time.sleep(min(60, (2 ** attempt)) + random.random())
2) Use timeouts everywhere
Timeouts prevent your scraper from hanging and causing self-inflicted congestion.
Recommended defaults:
- connect timeout: 5–15s
- read timeout: 20–60s
Python requests:
TIMEOUT = (10, 30)
r = requests.get(url, timeout=TIMEOUT)
Node axios:
await axios.get(url, { timeout: 30000 });
3) Don’t fight the site’s structure
If the site has:
- predictable pagination
- stable category pages
- public JSON embedded in the HTML
…use that.
If you brute-force internal endpoints or click every UI element, you increase risk.
Practical rule:
- prefer list pages + detail pages
- avoid scraping logged-in flows unless you truly need it
4) Headers: be realistic, not weird
A common anti-pattern is “header salad” (copy-pasting 30 headers from DevTools).
Most of the time you want:
User-AgentAccept-LanguageAccept
Example:
HEADERS = {
"User-Agent": "Mozilla/5.0 ... Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
Avoid:
- fake mobile UAs while sending desktop headers
- inconsistent locale/timezone patterns across requests
5) Cookies and sessions matter
Some sites tolerate anonymous traffic poorly.
If a site sets cookies and expects them back:
- reuse a session
- persist cookies between runs
Python:
s = requests.Session()
html = s.get(url, headers=HEADERS, timeout=(10,30)).text
Browser scrapers:
- keep a single context
- reuse it for multiple pages
6) Detect blocks early (don’t keep hammering)
You should treat block detection as a first-class part of your scraper.
Signal 1: Status codes
429= rate limit (slow down)403= forbidden (could be IP, fingerprint, or behavior)503= overloaded / bot defense
Signal 2: HTML content patterns
Common block phrases:
- “unusual traffic”
- “verify you are a human”
- “captcha”
- “access denied”
Example detector:
import re
BLOCK_PATTERNS = [
re.compile(r"captcha", re.I),
re.compile(r"unusual traffic", re.I),
re.compile(r"verify you are human", re.I),
re.compile(r"access denied", re.I),
]
def looks_blocked(html: str) -> bool:
if not html:
return False
return any(p.search(html) for p in BLOCK_PATTERNS)
If you detect blocks:
- stop the crawl
- cool down (minutes, not seconds)
- rotate IP (if available)
- reduce concurrency
7) When to use a headless browser (Playwright)
Use Playwright when:
- data is not in HTML
- content requires JS hydration
- pagination is behind “Load more”
Don’t use Playwright just because it feels more powerful.
Headless browsers:
- are slower
- cost more
- are more fingerprintable
Practical hybrid approach:
- use Playwright to get cookies / tokens
- then do bulk crawling via HTTP where possible
8) Proxies: what they help with (and what they don’t)
Proxies are not a cheat code.
They help with:
- IP-based rate limits
- geo restrictions (in some cases)
- spreading load across IPs for large crawls
They do not automatically solve:
- bot fingerprinting
- bad pacing
- broken selectors
- login walls
Where ProxiesAPI fits
ProxiesAPI is most useful when you already have:
- a polite crawler
- stable parsing
- good backoff
… and you want to scale volume without turning failures into a fire drill.
Python requests with a proxy
PROXY = "http://USER:PASS@PROXY_HOST:PORT" # from ProxiesAPI
r = requests.get(
url,
headers=HEADERS,
proxies={"http": PROXY, "https": PROXY},
timeout=(10, 30),
)
Playwright with a proxy
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://PROXY_HOST:PORT",
"username": "USER",
"password": "PASS",
},
)
9) Use a “circuit breaker” in your crawler
A circuit breaker prevents runaway crawls when the site starts blocking.
Example policy:
- if block rate > 20% over last 50 requests → stop and alert
- if 5 blocks in a row → stop and alert
Pseudo:
if consecutive_blocks >= 5: stop
if blocks / recent_requests > 0.2: stop
This single feature saves you from burning hours and getting IPs flagged.
10) Practical comparison table: anti-block levers
| Lever | Cost | Impact | Notes |
|---|---|---|---|
| Lower concurrency | Low | High | Biggest win for most scrapers |
| Add jitter | Low | High | Reduces “robotic” cadence |
| Add retries + backoff | Low | High | Only retry the right errors |
| Session cookies | Low | Medium | Helps on stateful sites |
| Rotate IPs (ProxiesAPI) | $$ | Medium–High | Useful once volume increases |
| Headless browser | $$ | Medium | Use only when needed |
| Fingerprint spoofing | $$$ | Variable | Often brittle |
A minimal “don’t get blocked” checklist
- concurrency capped (2–5)
- jittered delays
- timeouts everywhere
- retry only on transient errors
- block detection + cool down
- store failures for debugging
- add proxies only when scaling volume
Closing
If you want to scrape data without getting blocked, don’t start by shopping for proxy providers.
Start by making your scraper boring:
- predictable load
- conservative concurrency
- clean error handling
Once that’s solid, ProxiesAPI can help you scale without your failure rate exploding.
Blocks usually come from aggressive behavior, not a single bad header. ProxiesAPI helps by spreading traffic across IPs and stabilizing connectivity so your backoff + pacing strategy can work at scale.