403 Forbidden When Scraping: Why It Happens and 7 Fixes That Work
If you scrape long enough, you will run into this:
403 Client Error: Forbidden for url: https://target-site.example/...
The mistake most people make is assuming every 403 means the same thing.
It doesn’t.
A 403 Forbidden can mean:
- the site does not like your IP
- your request headers look wrong
- you skipped cookies or session setup
- you hit a geo restriction
- you are actually seeing a soft block disguised as success somewhere else
That is why the right response is not “retry harder.” It is to diagnose the failure class first.
This guide breaks the problem into 7 fixes that actually work, in the order I would test them.
Most scraping teams waste money by responding to every 403 with more retries. The better move is to identify what triggered the block, then fix the fetch pattern, session handling, or IP layer in that order.
First: separate 403 from the other common block patterns
A 403 is not the same as a 429, and it is not the same as a “200 OK but useless HTML” page.
| Symptom | What it usually means | First move |
|---|---|---|
403 Forbidden | explicit denial: IP, headers, session, geo, WAF rule | inspect request identity and session flow |
429 Too Many Requests | you are over the rate limit | slow down, back off, reduce concurrency |
200 OK with challenge / captcha HTML | soft block or JS challenge | detect bad content, then escalate transport |
| repeated redirect to login / home | missing auth or missing cookies | persist cookies and follow normal navigation |
Why this matters: the fastest way to waste proxies is treating 403, 429, and challenge pages like one problem.
Why 403 happens in scraping
In practice, most 403s come from one of five buckets:
- IP reputation
- Header / fingerprint mismatch
- Missing cookies or broken sessions
- Geo restrictions
- Aggressive request patterns
The site is not really saying “you are forbidden forever.” It is usually saying “this request does not look acceptable.”
That is good news, because acceptable requests can often be rebuilt.
Fix 1: Make your headers coherent
Many beginner scrapers either:
- send the default
python-requests/...user agent - randomize headers into nonsense combinations
Both are bad.
The goal is not random. The goal is coherent.
Use a believable browser profile and keep related headers aligned:
import requests
session = requests.Session()
session.headers.update(
{
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
}
)
This will not solve every 403, but it removes the easiest reason to block you.
Fix 2: Reuse sessions instead of going in cold every time
Lots of sites expect a request journey that looks something like:
- landing page
- cookie set
- next page
- detail page
Bad scraper behavior looks like:
- new TCP/TLS connection
- no cookies
- direct hit on the hardest endpoint
- repeated fast requests
Better pattern:
def warm_session(session: requests.Session, homepage: str) -> None:
r = session.get(homepage, timeout=(10, 30))
r.raise_for_status()
def fetch_detail(session: requests.Session, url: str) -> requests.Response:
return session.get(url, timeout=(10, 30))
That tiny “warm-up” step is often enough to stop avoidable 403s on mid-friction sites.
Fix 3: Lower concurrency and add real backoff
If you send 100 requests in 2 seconds from one IP, the target may answer with 403 even if the real underlying issue is rate abuse.
Here is a practical retry pattern:
from __future__ import annotations
import random
import time
import requests
RETRYABLE = {403, 429, 500, 502, 503, 504}
def fetch_with_backoff(session: requests.Session, url: str, tries: int = 5) -> requests.Response:
last = None
for attempt in range(1, tries + 1):
response = session.get(url, timeout=(10, 30))
last = response
if response.status_code == 200:
return response
if response.status_code in RETRYABLE:
sleep_for = min(45, attempt * 4) + random.uniform(0.5, 1.8)
time.sleep(sleep_for)
continue
response.raise_for_status()
raise RuntimeError(f"failed after retries, last status={last.status_code if last else 'unknown'}")
Important detail: retries should get calmer, not louder.
Fix 4: Detect soft blocks explicitly
A lot of teams only look at status codes. That misses half the real failures.
Sometimes you get:
200 OK- but the body is a challenge page
- or a “please enable JavaScript” page
- or a captcha shell
Build a content-level check:
def looks_blocked(html: str) -> bool:
text = html.lower()
indicators = [
"captcha",
"access denied",
"forbidden",
"enable javascript",
"cf-chl",
"verify you are human",
]
return any(token in text for token in indicators)
response = fetch_with_backoff(session, url)
if looks_blocked(response.text):
raise RuntimeError("soft block detected despite non-403 response")
This is one of the highest-leverage improvements you can make, because it keeps bad HTML out of your parser and prevents silent data corruption.
Fix 5: Rotate IPs only when the evidence points to IP-based blocking
People jump to proxy rotation too early.
That is expensive, and it also hides upstream mistakes.
Use proxies when you see signs like:
- fresh session still gets immediate 403
- same code works from one network but not another
- region-specific pages differ by country
- bans follow the IP more than the cookie jar
This is where ProxiesAPI fits well: as a cleaner network layer after you already fixed obvious request problems.
A common pattern is:
import os
proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
PROXIES = {"http": proxy_url, "https": proxy_url} if proxy_url else None
response = session.get(url, timeout=(10, 30), proxies=PROXIES)
That keeps the parser untouched while improving transport quality.
Fix 6: Respect geography
Some 403s are not anti-bot blocks at all. They are geo rules.
Examples:
- site available only in US / EU
- product page hidden outside a region
- country-specific consent or policy wall
Quick clues:
- same URL works in browser on VPN but not from your server
- currency / catalog differs by country
- 403 appears only on a subset of routes
If the target is geo-sensitive, use the right exit geography from the start instead of brute-forcing more retries from the wrong country.
Fix 7: Escalate to a browser only when plain HTTP stops making sense
If the site depends on:
- JavaScript-rendered state
- challenge resolution in-browser
- dynamic session setup through navigation
then a headless browser may be the right next step.
But it should be the last escalation, not the first.
Escalation order I recommend:
- coherent headers
- session reuse
- lower concurrency + backoff
- soft-block detection
- better IP / proxy layer
- browser automation
That sequence is cheaper and easier to debug than jumping straight into Playwright for every target.
A practical diagnosis workflow
When I hit 403 on a new target, I run this checklist:
| Check | Question |
|---|---|
| raw status | is it 403, 429, or 200 with bad HTML? |
| headers | am I sending a coherent browser-like request? |
| session | did I warm the homepage and keep cookies? |
| rate | am I bursting too hard from one IP? |
| geo | am I in the right country? |
| IP evidence | does changing IP actually change the outcome? |
| browser need | is the page challenge-driven or JS-critical? |
That workflow solves the root problem faster than randomly swapping libraries.
Example: a safer baseline scraper
from __future__ import annotations
import os
import random
import time
import requests
def build_session() -> requests.Session:
s = requests.Session()
s.headers.update(
{
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
)
return s
def fetch_html(session: requests.Session, url: str) -> str:
proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None
time.sleep(random.uniform(1.0, 2.4))
response = session.get(url, timeout=(10, 30), proxies=proxies)
if response.status_code in (403, 429):
raise RuntimeError(f"blocked: {response.status_code}")
response.raise_for_status()
if looks_blocked(response.text):
raise RuntimeError("soft block detected")
return response.text
This is not “magic 403 bypass” code. It is a calmer, more diagnosable baseline.
What not to do
Here are the common bad responses to 403:
- retrying instantly 20 times
- randomizing every header on every request
- rotating proxies before testing session reuse
- parsing challenge HTML as if it were real page content
- assuming every 403 means “need browser”
Those moves burn time, proxies, and sometimes entire IP ranges.
Final thoughts
The useful mental model is:
403 is not a single error. It is a feedback signal.
Usually the site is telling you one of four things:
- your identity looks wrong
- your session looks broken
- your traffic looks too aggressive
- your location is not acceptable
If you diagnose those in order, most 403 problems become manageable.
And when you do need better IP infrastructure, ProxiesAPI makes the transport upgrade clean because it lets you improve the fetch layer without rewriting the parser.
Most scraping teams waste money by responding to every 403 with more retries. The better move is to identify what triggered the block, then fix the fetch pattern, session handling, or IP layer in that order.