How to Scrape Data Without Getting Blocked (2026 Playbook)
Getting blocked is the default outcome of careless scraping.
If you’re searching for how to scrape data without getting blocked, you don’t need vague advice like “use proxies”. You need a system.
This is the 2026 playbook I use to keep crawlers alive in production:
- detect blocks early (don’t parse CAPTCHA pages as “success”)
- design request pacing and retries as first-class features
- use sessions/cookies intentionally
- escalate to a browser only when it’s truly required
- add proxies (ProxiesAPI) after you’ve fixed the basics
Blocks aren’t a single problem — they’re a stack of failure modes. ProxiesAPI helps with the IP/routing layer so your retries and pacing have a chance to work.
1) Understand what “blocked” actually means
“Blocked” isn’t only a 403.
In practice you’ll see:
- Hard blocks: 403 Forbidden, 429 Too Many Requests
- Soft blocks: 200 OK but the HTML is a challenge page
- Degradation: partial content, missing key elements, truncated HTML
- Shadow throttling: responses get slower until timeouts hit
- IP reputation issues: everything looks normal but conversion to data is near-zero
The key insight:
If you don’t implement block detection, you won’t even know you’re blocked.
2) The minimum viable anti-block stack
Here’s the “boring but effective” foundation:
- timeouts everywhere
- retries with exponential backoff
- per-domain concurrency limits
- caching (so you don’t re-fetch the same page)
- block detection for both status codes and HTML signatures
If you implement only these, your scraper will already beat 80% of scripts in the wild.
3) Rate limiting: the #1 cause of blocks
Most people try to scrape a site like:
- 20 concurrent requests
- no delay
- no caching
That’s not scraping — it’s a mini DDoS.
A sane baseline
Start with:
- concurrency: 1–3 per domain
- delay: 0.5–2.0s jittered between requests
- backoff: exponential on 429/503
Then scale gradually while watching block signals.
4) Implement real block detection (HTTP + HTML)
Python example (requests)
import re
SOFT_BLOCK_PATTERNS = [
re.compile(r"captcha", re.I),
re.compile(r"unusual traffic", re.I),
re.compile(r"verify you are a human", re.I),
re.compile(r"access denied", re.I),
]
def looks_like_soft_block(html: str) -> bool:
if not html or len(html) < 800:
return True
return any(p.search(html) for p in SOFT_BLOCK_PATTERNS)
Then treat it as a failure and retry/backoff — don’t parse it.
Node example (Axios)
export function looksLikeSoftBlock(html) {
if (!html || html.length < 800) return true;
const lower = html.toLowerCase();
return (
lower.includes("captcha") ||
lower.includes("unusual traffic") ||
lower.includes("verify you are a human") ||
lower.includes("access denied")
);
}
5) Fingerprints: don’t cosplay a browser (just be consistent)
A common myth is that you need to set 50 headers.
You don’t.
But you do want:
- a modern
User-Agent AcceptandAccept-Language- consistent behavior across requests (session + cookies)
Python session pattern
import requests
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
})
6) Proxies: what they solve (and what they don’t)
Proxies help with:
- IP-based rate limits
- reputation problems on a single IP
- distributing traffic across pools
They do not automatically solve:
- broken selectors
- JS-only pages
- bot challenges based on browser fingerprint
- aggressive per-session throttles
ProxiesAPI integration pattern
Don’t hardcode secrets. Read from env and pass to your HTTP client.
import os
proxy = os.getenv("PROXIESAPI_PROXY_URL") # http://USER:PASS@host:port
proxies = {"http": proxy, "https": proxy} if proxy else None
r = session.get(url, proxies=proxies, timeout=(10, 30))
If your block rate drops but your parse success rate is still low, you likely need a browser fallback.
7) Browser fallback: use Playwright like a scalpel
Use a browser when:
- content is JS-rendered
- you need to execute interactions
- you need stable DOM after hydration
Playwright example:
import { chromium } from "playwright";
export async function fetchRenderedHtml(url) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
await page.goto(url, { waitUntil: "domcontentloaded", timeout: 45_000 });
const html = await page.content();
await browser.close();
return html;
}
Then run the same parsing logic.
8) The production checklist (print this)
Crawl design
- per-domain concurrency limit
- jittered delays
- caching layer
- resume support (store progress)
Network hygiene
- connect + read timeouts
- retries with exponential backoff
- circuit breaker when block rate spikes
Block detection
- treat 403/429 as failures (don’t parse)
- detect soft-block HTML signatures
- alert on sudden parse-success drops
Rendering strategy
- HTTP-first
- Playwright/Puppeteer only when needed
Proxy strategy
- use ProxiesAPI (or equivalent) to distribute traffic
- rotate IPs only when you see throttles
- keep sessions consistent when the site expects it
Practical advice: what I’d do tomorrow morning
If you’re currently getting blocked:
- Add timeouts + retries + backoff
- Add soft-block detection
- Lower concurrency to 1–2 per domain
- Add caching
- Only then add proxies (ProxiesAPI)
- Add a browser fallback for the URLs that still fail
That ordering matters.
Blocks aren’t a single problem — they’re a stack of failure modes. ProxiesAPI helps with the IP/routing layer so your retries and pacing have a chance to work.