How to Scrape Data Without Getting Blocked: A Practical Playbook
If you’ve ever built a scraper that worked for 30 minutes and then started returning:
- 403 Forbidden
- 429 Too Many Requests
- weird HTML (captcha pages, “unusual traffic”)
- infinite redirects
…you’ve seen the real problem: scraping isn’t just parsing HTML. It’s traffic engineering.
This post is a practical playbook for the keyword “how to scrape data without getting blocked”.
We’ll cover:
- what “getting blocked” actually means (and how to detect it)
- the most common blocking signals
- the fixes that work (in order)
- when proxies are the simplest lever (and what ProxiesAPI does)
Most scrapers don’t die from parsing bugs — they die from throttling and IP blocks. ProxiesAPI gives you proxy rotation so your retry/backoff strategy actually has room to work as you scale.
Step 1: Know what “blocked” looks like (don’t guess)
Before you fight blocks, instrument your crawler. Log:
- URL
- status code
- final URL after redirects
- response length
- a short hash of the body
- a snippet of
<title>
Quick triage table
| Symptom | Likely cause | What to log/check |
|---|---|---|
| 429 | rate limiting | Retry-After, request rate, concurrency |
| 403 | bot policy or WAF | HTML title, cookies, headers |
| 200 but wrong HTML | captcha/interstitial | title text, known phrases |
| 5xx spikes | server instability | retry with backoff, change schedule |
| content differs by IP | geo/routing | compare results from 2 IPs |
Minimal Python “block detector”
import hashlib
import re
BLOCK_PHRASES = [
"unusual traffic",
"verify you are human",
"captcha",
"access denied",
]
def detect_block(status_code: int, html: str) -> dict:
title = ""
m = re.search(r"<title>(.*?)</title>", html, flags=re.I | re.S)
if m:
title = re.sub(r"\s+", " ", m.group(1)).strip().lower()
body_lower = html[:20000].lower()
blocked = status_code in (403, 429)
if any(p in title for p in BLOCK_PHRASES):
blocked = True
if any(p in body_lower for p in BLOCK_PHRASES):
blocked = True
return {
"blocked": blocked,
"title": title,
"len": len(html),
"sha1": hashlib.sha1(html.encode("utf-8", errors="ignore")).hexdigest(),
}
Step 2: Fix the easy stuff first (it’s usually your crawler)
2.1 Add timeouts
No timeouts = hung jobs, stacked retries, and a “thundering herd” of re-requests.
2.2 Reduce concurrency
Most targets tolerate low single-digit concurrency per IP better than bursts.
2.3 Add jitter + pacing
If you request every 1000ms like a metronome, you’re easy to detect.
import random
import time
# Between requests
time.sleep(random.uniform(1.0, 3.0))
2.4 Cache results
If you re-fetch the same URL repeatedly, you’re creating your own block.
- cache successful responses
- use conditional requests when possible
- crawl incrementally
Step 3: Use retries correctly (backoff or die)
Bad retries make blocks worse.
Good retries:
- only retry on transient errors (timeouts, 5xx, some 429)
- exponential backoff
- stop after a small number of attempts
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
@retry(stop=stop_after_attempt(4), wait=wait_exponential_jitter(initial=1, max=20))
def fetch(url: str):
...
If you get repeated 403s, stop retrying. That’s not transient.
Step 4: Header and session hygiene (look like a browser, but don’t cosplay)
You don’t need 40 headers. You need:
- realistic
User-Agent AcceptandAccept-Language- consistent cookies (session)
A simple requests.Session() often improves stability.
Step 5: Understand fingerprinting (what’s actually being measured)
Common signals:
- IP reputation (datacenter vs residential)
- request rate patterns
- TLS fingerprint (JA3) / HTTP2 behavior
- missing browser APIs (headless detection)
- cookie/consent flows
If you’re using plain HTTP requests, you’re largely limited to:
- IP rotation
- pacing
- header realism
- avoiding suspicious patterns
If you need browser-level fingerprints, use Playwright (and accept the cost).
Step 6: Proxies — when they help and when they don’t
Proxies are not magic. They’re a lever.
Proxies help when:
- the site throttles by IP (common)
- you need to distribute requests across many IPs
- you see 429s after predictable volume
Proxies don’t help when:
- your selector/parsing is wrong
- the site requires JS rendering (you’re fetching an empty shell)
- you’re getting blocked by account/auth rules
Step 7: ProxiesAPI (practical integration)
ProxiesAPI typically provides a proxy endpoint you route traffic through.
In Python requests, you pass a proxies dict:
import os
import requests
proxy = os.getenv("PROXIESAPI_PROXY_URL")
proxies = {"http": proxy, "https": proxy} if proxy else None
r = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=30)
print(r.text)
What ProxiesAPI gives you:
- a stable way to change egress IPs
- a consistent configuration you can use across scrapers
What it doesn’t guarantee:
- bypassing all bot systems
- solving JS rendering
Decision table: Which fix to try next?
| Your situation | Best next move |
|---|---|
| occasional 5xx/timeouts | retries + backoff |
| frequent 429 | slow down + add caching; then proxies |
| frequent 403 | stop, inspect HTML; likely WAF; consider Playwright + proxies |
| HTML has no data | switch to Playwright (JS rendering) |
| blocks after N requests | rotate IPs (ProxiesAPI) + spread scheduling |
Practical checklist (copy/paste into your scraper README)
-
timeout=(connect, read)set -
Session()used - retries only for transient errors
- jittery sleep between requests
- concurrency limited
- caching enabled
- block detection (title/phrases) and circuit breaker
- proxy integration (ProxiesAPI) for scale
Final word
If your scraper is getting blocked, don’t “try random tricks.”
Treat it like an engineering system:
- measure
- slow down
- retry correctly
- cache
- rotate IPs when volume demands it
That’s how you scrape data without getting blocked — consistently.
Most scrapers don’t die from parsing bugs — they die from throttling and IP blocks. ProxiesAPI gives you proxy rotation so your retry/backoff strategy actually has room to work as you scale.