How to Scrape Data Without Getting Blocked (A Practical Playbook)
Getting blocked is the default outcome of “naive scraping.”
Most scrapers fail because they:
- send too many requests too quickly
- look like a bot (fingerprints, missing headers, no cookies)
- retry aggressively (turning transient failures into permanent blocks)
- hammer the same IP until it’s burned
This guide is a practical playbook you can apply to almost any target site.
We’ll cover:
- how sites detect scrapers
- the highest-leverage fixes (rate, sessions, headers)
- retries and backoff that don’t self-sabotage
- proxy rotation and when it helps
- browser automation (Playwright) as a fallback
Blocks are usually a systems problem, not a single bad request. ProxiesAPI helps by giving you a stable proxy layer (rotation + reputation) so your retry/backoff strategy actually works in production.
1) Understand the detection layers
Most blocking is a combination of these layers:
- Network reputation: IP address history, ASN, geo, datacenter vs residential
- Request fingerprint: headers, TLS fingerprint, HTTP/2 behavior
- Behavior: request rate, burst patterns, navigation flow
- State: cookies, sessions, CSRF tokens
- Rendering: JS challenges, bot detection scripts
The key idea: you don’t “fix blocks” with one trick. You build a system that looks less suspicious and that degrades gracefully when it gets challenged.
2) The biggest win: slow down (with jitter)
If you take only one action, do this.
Bad pattern
- 50 parallel requests
- no delay
- retry instantly
Better pattern
- 2–10 concurrency (depending on site)
- random delay (jitter)
- exponential backoff
Python example:
import random
import time
def jitter_sleep(min_s=0.4, max_s=1.4):
time.sleep(random.uniform(min_s, max_s))
Even tiny jitter breaks the “perfectly periodic bot” pattern.
3) Use sessions (cookies matter)
A lot of sites expect continuity.
Use requests.Session() so cookies persist:
import requests
session = requests.Session()
r = session.get("https://example.com")
# subsequent requests reuse cookies
r2 = session.get("https://example.com/page")
If the site sets anti-bot cookies, a stateless scraper will look abnormal.
4) Use real timeouts (don’t hang)
Hanging connections often cause:
- queues backing up
- retries stacking
- bursts when they recover
Use timeouts:
TIMEOUT = (10, 30) # connect, read
r = session.get(url, timeout=TIMEOUT)
5) Retries: fewer, smarter, slower
Naive retries are how you turn a temporary 503 into an IP ban.
A sane retry policy:
- only retry on transient errors (429/5xx/timeouts)
- exponential backoff
- cap retries (e.g. 3–5)
- if you see captcha/interstitial, stop and cool down
Example using tenacity:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=30))
def fetch(url):
r = session.get(url, timeout=(10, 30))
if r.status_code in (429, 500, 502, 503, 504):
r.raise_for_status()
return r.text
6) Headers: don’t cosplay, just be normal
You don’t need a 200-line header set.
But you should have:
- a modern User-Agent
- Accept
- Accept-Language
Example:
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
r = session.get(url, headers=HEADERS, timeout=(10, 30))
Avoid claiming to be Safari on iOS if you’re not.
7) Respect 429s (and Retry-After)
If the site says “slow down”, slow down.
If you ignore 429s, you’re training the system to escalate.
Pattern:
- on 429, check
Retry-After - sleep
- reduce concurrency
8) Proxies: what they do (and don’t) fix
Proxies help when the limiting factor is IP reputation or IP-based rate limits.
They do not fix:
- broken parsing
- unrealistic behavior
- JS challenges that require a real browser
Where ProxiesAPI fits
ProxiesAPI is most useful when:
- you’re scraping many URLs
- you need consistent throughput
- you want automatic rotation without managing proxy pools
Minimal requests integration:
PROXIES = {
"http": "http://YOUR_PROXIESAPI_PROXY",
"https": "http://YOUR_PROXIESAPI_PROXY",
}
r = session.get(url, proxies=PROXIES, timeout=(10, 30))
Operationally, proxies work best when you also:
- slow down
- keep sessions stable (sticky sessions where needed)
- back off on challenges
9) Use a browser only when you must
If content is JS-rendered, HTML scraping returns empty pages.
Use Playwright to fetch a rendered snapshot:
from playwright.sync_api import sync_playwright
def fetch_rendered(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle", timeout=60000)
html = page.content()
browser.close()
return html
Browsers are heavier and often more detectable. Use them as a fallback.
10) Cache and dedupe (stop re-scraping)
Caching is an anti-block technique.
If you fetch the same URL 10 times during development, you look suspicious and you waste bandwidth.
Start with something simple:
- save HTML to disk
- reuse it for parsing iterations
11) Monitor block signals and fail safely
You should detect these signals:
- captcha keywords ("unusual traffic", "verify you are human")
- redirect loops
- HTML unexpectedly tiny
- 403/429 spikes
When detected:
- stop
- cool down
- rotate IP/session
- lower rate
A minimal “anti-block” template (Python)
import random
import time
import requests
from tenacity import retry, stop_after_attempt, wait_exponential
TIMEOUT = (10, 30)
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
session = requests.Session()
def jitter(min_s=0.5, max_s=1.5):
time.sleep(random.uniform(min_s, max_s))
def looks_blocked(html: str) -> bool:
t = (html or "").lower()
return any(x in t for x in ["captcha", "unusual traffic", "verify you are human"]) or len(html) < 5000
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=30))
def get(url: str, proxies=None) -> str:
r = session.get(url, headers=HEADERS, timeout=TIMEOUT, proxies=proxies)
if r.status_code in (429, 500, 502, 503, 504):
r.raise_for_status()
html = r.text
if looks_blocked(html):
raise RuntimeError("Blocked or challenged")
return html
def main(urls: list[str]):
for u in urls:
html = get(u)
print("ok", u, len(html))
jitter()
if __name__ == "__main__":
main(["https://example.com"])
Final checklist (pin this)
- timeouts on every request
- session cookies enabled
- concurrency limited
- jitter between requests
- exponential backoff retries
- cache + dedupe
- proxy rotation (ProxiesAPI) when scaling
- browser fallback only when needed
If you implement just those, you’ll stop getting blocked “mysteriously” and start running scrapers like a real system.
Blocks are usually a systems problem, not a single bad request. ProxiesAPI helps by giving you a stable proxy layer (rotation + reputation) so your retry/backoff strategy actually works in production.