How to Scrape Data Without Getting Blocked (Practical Playbook)
If you’ve ever built a scraper that works perfectly for 50 requests… and collapses at 5,000, you already know the truth:
Getting blocked is the default outcome of naive scraping.
This playbook is the practical version — the stuff that actually reduces blocks in production:
- rate limiting (the #1 fix)
- realistic headers + sessions
- retries + backoff
- caching and incremental updates
- proxy rotation (when you truly need it)
- browser fallback (when HTTP scraping isn’t enough)
- monitoring so you know when things break
When blocks start (403/429) and reliability matters, ProxiesAPI helps by providing a more stable fetch layer while you keep your parsing logic unchanged.
The blocking spectrum (know what you’re fighting)
Sites block scrapers in a few common ways:
- Soft throttling: slower responses, timeouts
- HTTP rate limits:
429 Too Many Requests - Hard blocks:
403 Forbidden, captcha pages - Content poisoning: you get “valid HTML” but it’s a block page
- Behavioral detection: fingerprinting, JS challenges
Your goal isn’t “never get blocked.” Your goal is:
- detect blocks quickly
- recover automatically
- keep your dataset quality high
Rule 1: Slow down (most people don’t)
If you do only one thing, do this:
- add delays
- lower concurrency
- spread work over time
The fastest way to get blocked is “no delay + high parallelism + no retries”.
A simple rate limiter
import random
import time
def sleep_jitter(min_s=1.0, max_s=2.5):
time.sleep(random.uniform(min_s, max_s))
Use it between requests.
For bigger crawls, consider token-bucket rate limiting or per-host limits.
Rule 2: Use sessions + consistent headers
A common bot smell is:
- new connection per request
- missing
Accept-Language - weird
User-Agent
Use a requests.Session() and a sane header set.
import requests
session = requests.Session()
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
r = session.get("https://example.com", headers=HEADERS, timeout=(10, 30))
r.raise_for_status()
Don’t randomize everything. Consistency often looks more human.
Rule 3: Timeouts + retries (with backoff)
Blocks often show up as transient failures. Your scraper should be able to recover.
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
class FetchError(Exception):
pass
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=30),
retry=retry_if_exception_type(FetchError),
)
def fetch(session: requests.Session, url: str) -> str:
r = session.get(url, timeout=(10, 30))
if r.status_code in (403, 429):
raise FetchError(f"blocked: {r.status_code}")
if r.status_code >= 500:
raise FetchError(f"server error: {r.status_code}")
r.raise_for_status()
return r.text
Key idea: don’t retry 404s; do retry 403/429/5xx/timeouts.
Rule 4: Detect block pages (content poisoning)
Sometimes you get status 200 but the content is:
- a captcha
- “Access denied”
- an interstitial
Add a lightweight detector:
def looks_blocked(html: str) -> bool:
h = html.lower()
return any(x in h for x in [
"captcha",
"access denied",
"unusual traffic",
"verify you are a human",
"blocked",
])
If looks_blocked(html) is true, treat it like a 403.
Rule 5: Cache aggressively (scrape less)
If your job re-fetches the same pages repeatedly, you’ll:
- waste money
- increase blocks
- slow down pipelines
Caching options (ascending complexity):
- save raw HTML files to disk
- store responses in SQLite
- store “last-seen” hashes per URL and only re-parse on change
Even a simple file cache helps:
from pathlib import Path
import hashlib
CACHE_DIR = Path(".cache")
CACHE_DIR.mkdir(exist_ok=True)
def cache_path(url: str) -> Path:
h = hashlib.sha256(url.encode("utf-8")).hexdigest()
return CACHE_DIR / f"{h}.html"
When to use proxies (and when not to)
Proxies are not a magic “unblock button”. Use them when:
- you’re doing high-volume crawling
- the site rate-limits per IP
- you need geographic access
Don’t use them to compensate for bad behavior:
- zero delays
- aggressive concurrency
- no caching
How ProxiesAPI fits
ProxiesAPI can help when you start seeing persistent 403/429 and timeouts.
The clean integration approach is:
- keep your parsing and scheduling code unchanged
- route the fetch step through ProxiesAPI
import os
import requests
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")
def fetch_via_proxiesapi(session: requests.Session, url: str) -> str:
if not PROXIESAPI_KEY:
raise RuntimeError("Set PROXIESAPI_KEY")
r = session.get(
"https://api.proxiesapi.com",
params={"auth_key": PROXIESAPI_KEY, "url": url},
timeout=(10, 30),
)
r.raise_for_status()
return r.text
(Adjust params to match your ProxiesAPI docs/plan.)
Browser fallback: Playwright for JS-heavy sites
If a site is JS-heavy, HTTP scraping may never work reliably.
Playwright pattern:
from playwright.sync_api import sync_playwright
def fetch_rendered(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
return html
Use browsers selectively — they are slower and more detectable at scale.
Monitoring: know when you’re getting blocked
Add basic metrics:
- response status counts
- average latency
- block-page detector hits
- rows extracted per run
Even logging to a CSV can be enough to catch regressions.
Practical checklist (copy/paste)
Before you run a scraper in production:
- connect/read timeouts set
- retries with exponential backoff
- session + realistic headers
- per-host rate limiting
- cache or incremental crawl strategy
- block page detection
- proxy layer ready (if needed)
- browser fallback plan (if JS-heavy)
- alerts on sudden failure spikes
Bottom line
You avoid blocks by being boring:
- steady rate
- consistent requests
- fewer unnecessary fetches
And when you need more reliability at scale:
- add a proxy layer like ProxiesAPI
- keep parsing deterministic
- monitor and adapt
When blocks start (403/429) and reliability matters, ProxiesAPI helps by providing a more stable fetch layer while you keep your parsing logic unchanged.