How to Scrape Data Without Getting Blocked (Practical Playbook)
Getting blocked is the default state of web scraping at scale.
Not because you’re doing something “wrong” — but because modern sites actively protect:
- infrastructure cost (your bot traffic is expensive)
- user experience (bots can degrade performance)
- fraud surfaces (inventory hoarding, price scraping, credential stuffing)
- business models (data is valuable)
This guide is a practical playbook you can apply to most scraping systems.
It’s opinionated, boring, and effective.
Most anti-block wins come from good engineering (timeouts, pacing, retries). When you still need higher success rates at scale, ProxiesAPI gives you a managed proxy layer and more consistent runs.
First principles: why you get blocked
Most blocks happen for one of these reasons:
- Too many requests too quickly (429)
- Bad fingerprints (headers, TLS, inconsistent UA)
- Predictable patterns (no jitter, sequential IDs)
- IP reputation (datacenter IPs, burned ranges)
- JS challenges (bot pages that require browser execution)
- Behavior anomalies (never loading assets, no cookies)
Your job is to reduce “bot-like” signals and make your crawler behave like a careful, boring client.
The anti-block stack (in order of ROI)
1) Timeouts + retries (non-negotiable)
If you don’t have timeouts, you don’t have a scraper — you have a process that can hang forever.
Use:
- a connect timeout (e.g., 10s)
- a read timeout (e.g., 30–60s)
- exponential backoff with capped retries
import random
import requests
from tenacity import retry, stop_after_attempt, wait_exponential
TIMEOUT = (10, 40)
s = requests.Session()
s.headers.update({
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
})
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15",
]
def is_retryable(status: int) -> bool:
return status in (403, 408, 409, 425, 429, 500, 502, 503, 504)
@retry(wait=wait_exponential(multiplier=1, min=2, max=20), stop=stop_after_attempt(6))
def fetch(url: str, *, proxies: dict | None = None) -> str:
s.headers["User-Agent"] = random.choice(USER_AGENTS)
r = s.get(url, timeout=TIMEOUT, proxies=proxies)
if is_retryable(r.status_code):
raise requests.HTTPError(f"HTTP {r.status_code} for {url}")
r.raise_for_status()
return r.text
2) Pacing + jitter (stop hammering)
Your traffic should look like:
- consistent
- slow enough
- not perfectly periodic
import time
import random
BASE_SLEEP = 1.0
for url in urls:
html = fetch(url)
# parse...
time.sleep(BASE_SLEEP + random.random() * 0.8)
For many sites, 1–3 seconds between requests per domain is a good starting point.
3) Don’t scrape what you don’t need
Most scrapers waste requests.
High-leverage cuts:
- don’t refetch unchanged pages (cache)
- don’t follow links you can derive from IDs
- stop early when results are empty
- only collect fields you actually use
Header hygiene (small changes, big wins)
Common mistakes:
- missing
Accept-Language - weird UAs (or always the same UA)
- no referer on internal navigation
A realistic baseline:
s.headers.update({
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
"Upgrade-Insecure-Requests": "1",
})
If you’re crawling a site deeply, also set a referer when moving from list → detail:
r = s.get(detail_url, headers={"Referer": list_url}, timeout=TIMEOUT)
Handle 429 correctly (respect Retry-After)
Many teams treat 429 as “retry faster”. That’s backwards.
If you see Retry-After, honor it.
import time
r = s.get(url, timeout=TIMEOUT)
if r.status_code == 429:
ra = r.headers.get("Retry-After")
sleep_s = int(ra) if ra and ra.isdigit() else 30
time.sleep(sleep_s)
# then retry
Proxies: when you need them (and when you don’t)
Proxies help when:
- your IP reputation is the limiting factor
- you need geographic distribution
- you’re doing high-volume requests
Proxies do not fix:
- broken parsing
- scraping faster than the site can tolerate
- JS challenges that require a browser
Using ProxiesAPI with requests
PROXIES = {
"http": "http://YOUR_PROXIESAPI_PROXY",
"https": "http://YOUR_PROXIESAPI_PROXY",
}
html = fetch("https://example.com", proxies=PROXIES)
Keep concurrency sane. Rotation is not a license to DDoS.
Browser fallback (when HTML isn’t enough)
If the raw HTML is empty (or a placeholder), you need a browser-based fetch.
Playwright makes this straightforward:
from playwright.sync_api import sync_playwright
def fetch_rendered(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={"width": 1280, "height": 720})
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
return html
A hybrid strategy is often best:
- use
requestsfor list pages - use Playwright for a small fraction of “hard” detail pages
Monitoring: blocks should be visible
If you can’t see blocks, you can’t fix them.
Track:
- success rate per domain
- status code distribution
- mean/95p latency
- retries per request
A simple log line per request is enough to start.
Practical troubleshooting checklist
When a site starts blocking you:
- Slow down (half your rate)
- Add jitter
- Improve headers + UA rotation
- Add retries for 403/429/5xx
- Cache aggressively
- Add proxy layer (ProxiesAPI)
- If HTML is useless: browser fallback
Common anti-patterns (avoid these)
- “Let’s use 500 threads”
- no timeouts
- retry loops without caps
- scraping every page every day even if unchanged
- parsing with brittle deep selectors without tests
Final word
The fastest path to not getting blocked is not “secret tricks”.
It’s:
- good engineering fundamentals
- respectful traffic patterns
- observability
- and, when necessary, a managed proxy layer like ProxiesAPI to stabilize the network.
Most anti-block wins come from good engineering (timeouts, pacing, retries). When you still need higher success rates at scale, ProxiesAPI gives you a managed proxy layer and more consistent runs.