Error Code 520 When Scraping: What It Means and a Practical Fix Checklist
You’re scraping along, everything is fine… and then you hit:
520 Web Server Returned an Unknown Error
If you search it, you get a lot of “try again later” advice.
That’s not helpful when you’re trying to build a crawler that runs nightly.
This guide covers:
- what a 520 error actually means (in Cloudflare terms)
- the most common scraping-specific causes
- a debugging checklist that finds the root cause fast
- code patterns for retries and backoff that don’t make blocks worse
When you’re scraping at scale, reliability comes from engineering: timeouts, retries, backoff, and a consistent proxy layer. ProxiesAPI helps by making the network path less volatile across many targets.
What is error code 520?
520 is a Cloudflare catch-all error.
It means Cloudflare could not get a valid response from the origin server (or the response was malformed) — but it doesn’t map neatly to a specific standard HTTP status like 502/503/504.
In practice, for scrapers, a 520 typically comes from one of these buckets:
- your request got blocked (WAF / bot protection) and the origin didn’t respond cleanly
- the origin is flaky or overloaded
- Cloudflare rejected or altered the connection to the origin
- your client is causing strange behavior (timeouts, premature disconnects, weird headers)
The key: 520 is a symptom, not a diagnosis.
The 80/20 causes when scraping
1) You’re sending “bot-shaped” traffic
Common triggers:
- no User-Agent / default UA
- missing Accept / Accept-Language headers
- suspicious header ordering or fingerprint mismatch
- high request rate from one IP
2) You’re being challenged and your client can’t complete it
Some sites return a JS / Turnstile / captcha flow.
If your client is plain requests, it won’t run JS and may loop through partial pages.
3) The origin is unstable (not your fault)
If the site is down or overloaded, you’ll see intermittent 520s even from a browser.
Your fix here is:
- retry with backoff
- reduce concurrency
- cache responses
4) Your retries are making the block worse
This is a classic failure mode:
- request fails
- code retries immediately (no backoff)
- you amplify the “bad traffic” signal
- blocks escalate → more failures
Debugging flow (fast, deterministic)
Do this in order. Don’t jump to “buy more proxies” before you know what’s happening.
Step 1: Confirm it’s Cloudflare
Look for response headers like:
server: cloudflarecf-ray: ...
If you can’t see headers (because you’re using a proxy API), fetch one failing URL directly with curl -I from your machine to confirm.
Step 2: Capture the first failing response body
Don’t throw it away.
Save:
- status code
- headers
- first ~2KB of body
If the body is HTML and includes “Attention Required”, “Just a moment…”, or a captcha, you’re blocked/challenged — not “randomly failing”.
Step 3: Reproduce with a browser
Open the same URL in a normal browser:
- If the browser fails too → origin is likely down/unhealthy.
- If the browser works instantly but your scraper fails → your traffic shape is the issue.
Step 4: Reduce the problem (one URL, one request)
Make a minimal script that does exactly one request.
If a single request fails, you don’t have a “scaling” problem. You have an access/fingerprint problem.
A resilient request pattern (Python)
This is the baseline request pattern you should use for any scraping fetch layer:
- timeouts (connect + read)
- retries with exponential backoff + jitter
- sanity checks (HTML too small, wrong content-type, etc.)
import os
import random
import time
import urllib.parse
import requests
TIMEOUT = (10, 60)
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
})
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")
def via_proxiesapi(url: str) -> str:
return "https://api.proxiesapi.com/?" + urllib.parse.urlencode({
"key": PROXIESAPI_KEY,
"url": url,
})
def fetch(url: str, *, use_proxiesapi: bool = False, retries: int = 4) -> str:
last_err = None
for attempt in range(1, retries + 1):
try:
final = via_proxiesapi(url) if (use_proxiesapi and PROXIESAPI_KEY) else url
r = session.get(final, timeout=TIMEOUT)
# Some 520s show up as HTML challenges or tiny “error” pages.
if r.status_code >= 400 and r.text and r.text.lstrip().startswith("<"):
head = r.text[:400].lower()
if "cloudflare" in head or "just a moment" in head or "attention required" in head:
raise RuntimeError(f"Challenge/block page (status={r.status_code})")
r.raise_for_status()
return r.text
except Exception as e:
last_err = e
base = 2.0 ** attempt
jitter = random.uniform(0.0, 0.4 * base)
sleep_s = base + jitter
time.sleep(sleep_s)
raise RuntimeError(last_err)
“Fix” checklist (what to change first)
When you hit 520s consistently, change one variable at a time:
- Add sane headers (User-Agent, Accept, Accept-Language)
- Enforce timeouts
- Add backoff + jitter (no immediate retry loops)
- Lower concurrency
- Add caching so re-runs don’t re-hit the same URLs
- Use a proxy layer (rotate IPs, avoid “one IP hammering”)
If your target uses heavy bot mitigation, you may also need:
- a browser-based fetch (Playwright)
- an unblocker service
But the first 5 steps are still mandatory. They make every approach better.
Quick decision: is it you, or the site?
Use this table:
| Symptom | Likely Cause | Next Move |
|---|---|---|
| Browser fails too | Origin/server issue | Retry with backoff; wait |
| Browser works; scraper fails instantly | Bot protection / fingerprint | Adjust headers; proxy/unblock |
| Works for a while; fails after N requests | Rate limiting/IP reputation | Lower rate; rotate IPs |
| Only fails on some pages | Edge cases/redirects | Log bodies; handle redirects |
Bottom line
Treat 520 like a smoke alarm:
- don’t ignore it
- don’t panic
- isolate the cause and fix the fetch layer first
Once your fetch layer is stable, the rest of your scraper becomes boring — and that’s exactly what you want.
When you’re scraping at scale, reliability comes from engineering: timeouts, retries, backoff, and a consistent proxy layer. ProxiesAPI helps by making the network path less volatile across many targets.