403 Forbidden When Scraping: Why It Happens and 7 Fixes That Work

Jun 22, 2026 · guides · #403 forbidden web scraping, #web-scraping, #anti-bot, #proxies, #python, #rate-limits

If you scrape long enough, you will run into this:

403 Client Error: Forbidden for url: https://target-site.example/...

The mistake most people make is assuming every 403 means the same thing.

It doesn’t.

A 403 Forbidden can mean:

the site does not like your IP
your request headers look wrong
you skipped cookies or session setup
you hit a geo restriction
you are actually seeing a soft block disguised as success somewhere else

That is why the right response is not “retry harder.” It is to diagnose the failure class first.

This guide breaks the problem into 7 fixes that actually work, in the order I would test them.

Treat 403 as a diagnosis problem, not a brute-force problem

Most scraping teams waste money by responding to every 403 with more retries. The better move is to identify what triggered the block, then fix the fetch pattern, session handling, or IP layer in that order.

Get 1,000 free API calls View pricing

First: separate 403 from the other common block patterns

A 403 is not the same as a 429, and it is not the same as a “200 OK but useless HTML” page.

Symptom	What it usually means	First move
`403 Forbidden`	explicit denial: IP, headers, session, geo, WAF rule	inspect request identity and session flow
`429 Too Many Requests`	you are over the rate limit	slow down, back off, reduce concurrency
`200 OK` with challenge / captcha HTML	soft block or JS challenge	detect bad content, then escalate transport
repeated redirect to login / home	missing auth or missing cookies	persist cookies and follow normal navigation

Why this matters: the fastest way to waste proxies is treating 403, 429, and challenge pages like one problem.

Why 403 happens in scraping

In practice, most 403s come from one of five buckets:

IP reputation
Header / fingerprint mismatch
Missing cookies or broken sessions
Geo restrictions
Aggressive request patterns

The site is not really saying “you are forbidden forever.” It is usually saying “this request does not look acceptable.”

That is good news, because acceptable requests can often be rebuilt.

Fix 1: Make your headers coherent

Many beginner scrapers either:

send the default python-requests/... user agent
randomize headers into nonsense combinations

Both are bad.

The goal is not random. The goal is coherent.

Use a believable browser profile and keep related headers aligned:

import requests

session = requests.Session()
session.headers.update(
    {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/125.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
    }
)

This will not solve every 403, but it removes the easiest reason to block you.

Fix 2: Reuse sessions instead of going in cold every time

Lots of sites expect a request journey that looks something like:

landing page
cookie set
next page
detail page

Bad scraper behavior looks like:

new TCP/TLS connection
no cookies
direct hit on the hardest endpoint
repeated fast requests

Better pattern:

def warm_session(session: requests.Session, homepage: str) -> None:
    r = session.get(homepage, timeout=(10, 30))
    r.raise_for_status()


def fetch_detail(session: requests.Session, url: str) -> requests.Response:
    return session.get(url, timeout=(10, 30))

That tiny “warm-up” step is often enough to stop avoidable 403s on mid-friction sites.

Fix 3: Lower concurrency and add real backoff

If you send 100 requests in 2 seconds from one IP, the target may answer with 403 even if the real underlying issue is rate abuse.

Here is a practical retry pattern:

from __future__ import annotations

import random
import time
import requests

RETRYABLE = {403, 429, 500, 502, 503, 504}


def fetch_with_backoff(session: requests.Session, url: str, tries: int = 5) -> requests.Response:
    last = None

    for attempt in range(1, tries + 1):
        response = session.get(url, timeout=(10, 30))
        last = response

        if response.status_code == 200:
            return response

        if response.status_code in RETRYABLE:
            sleep_for = min(45, attempt * 4) + random.uniform(0.5, 1.8)
            time.sleep(sleep_for)
            continue

        response.raise_for_status()

    raise RuntimeError(f"failed after retries, last status={last.status_code if last else 'unknown'}")

Important detail: retries should get calmer, not louder.

Fix 4: Detect soft blocks explicitly

A lot of teams only look at status codes. That misses half the real failures.

Sometimes you get:

200 OK
but the body is a challenge page
or a “please enable JavaScript” page
or a captcha shell

Build a content-level check:

def looks_blocked(html: str) -> bool:
    text = html.lower()
    indicators = [
        "captcha",
        "access denied",
        "forbidden",
        "enable javascript",
        "cf-chl",
        "verify you are human",
    ]
    return any(token in text for token in indicators)


response = fetch_with_backoff(session, url)
if looks_blocked(response.text):
    raise RuntimeError("soft block detected despite non-403 response")

This is one of the highest-leverage improvements you can make, because it keeps bad HTML out of your parser and prevents silent data corruption.

Fix 5: Rotate IPs only when the evidence points to IP-based blocking

People jump to proxy rotation too early.

That is expensive, and it also hides upstream mistakes.

Use proxies when you see signs like:

fresh session still gets immediate 403
same code works from one network but not another
region-specific pages differ by country
bans follow the IP more than the cookie jar

This is where ProxiesAPI fits well: as a cleaner network layer after you already fixed obvious request problems.

A common pattern is:

import os

proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
PROXIES = {"http": proxy_url, "https": proxy_url} if proxy_url else None

response = session.get(url, timeout=(10, 30), proxies=PROXIES)

That keeps the parser untouched while improving transport quality.

Fix 6: Respect geography

Some 403s are not anti-bot blocks at all. They are geo rules.

Examples:

site available only in US / EU
product page hidden outside a region
country-specific consent or policy wall

Quick clues:

same URL works in browser on VPN but not from your server
currency / catalog differs by country
403 appears only on a subset of routes

If the target is geo-sensitive, use the right exit geography from the start instead of brute-forcing more retries from the wrong country.

Fix 7: Escalate to a browser only when plain HTTP stops making sense

If the site depends on:

JavaScript-rendered state
challenge resolution in-browser
dynamic session setup through navigation

then a headless browser may be the right next step.

But it should be the last escalation, not the first.

Escalation order I recommend:

coherent headers
session reuse
lower concurrency + backoff
soft-block detection
better IP / proxy layer
browser automation

That sequence is cheaper and easier to debug than jumping straight into Playwright for every target.

A practical diagnosis workflow

When I hit 403 on a new target, I run this checklist:

Check	Question
raw status	is it 403, 429, or 200 with bad HTML?
headers	am I sending a coherent browser-like request?
session	did I warm the homepage and keep cookies?
rate	am I bursting too hard from one IP?
geo	am I in the right country?
IP evidence	does changing IP actually change the outcome?
browser need	is the page challenge-driven or JS-critical?

That workflow solves the root problem faster than randomly swapping libraries.

Example: a safer baseline scraper

from __future__ import annotations

import os
import random
import time
import requests


def build_session() -> requests.Session:
    s = requests.Session()
    s.headers.update(
        {
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
        }
    )
    return s


def fetch_html(session: requests.Session, url: str) -> str:
    proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
    proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None

    time.sleep(random.uniform(1.0, 2.4))
    response = session.get(url, timeout=(10, 30), proxies=proxies)

    if response.status_code in (403, 429):
        raise RuntimeError(f"blocked: {response.status_code}")

    response.raise_for_status()

    if looks_blocked(response.text):
        raise RuntimeError("soft block detected")

    return response.text

This is not “magic 403 bypass” code. It is a calmer, more diagnosable baseline.

What not to do

Here are the common bad responses to 403:

retrying instantly 20 times
randomizing every header on every request
rotating proxies before testing session reuse
parsing challenge HTML as if it were real page content
assuming every 403 means “need browser”

Those moves burn time, proxies, and sometimes entire IP ranges.

Final thoughts

The useful mental model is:

403 is not a single error. It is a feedback signal.

Usually the site is telling you one of four things:

your identity looks wrong
your session looks broken
your traffic looks too aggressive
your location is not acceptable

If you diagnose those in order, most 403 problems become manageable.

And when you do need better IP infrastructure, ProxiesAPI makes the transport upgrade clean because it lets you improve the fetch layer without rewriting the parser.

Treat 403 as a diagnosis problem, not a brute-force problem

Get 1,000 free API calls View pricing

A practical guide to reducing Cloudflare blocks with better fingerprints, session reuse, rate control, and smarter escalation paths.

guides#bypass cloudflare#cloudflare#web-scraping

Rotating Proxies: What They Are, How Rotation Works, and When You Need Them

A practical, non-hype guide to rotating proxies: request vs session rotation, sticky IPs, block signals, and how to wire rotation into a scraper (including ProxiesAPI-ready examples).

guides#rotating proxies#proxies#web-scraping

Scraping Airbnb Listings: Pricing, Availability, Reviews (What’s Realistic in 2026)

Airbnb is a high-friction target. Here’s what data is realistic to collect in 2026, what gets blocked, safer alternatives, and how to design a risk-aware pipeline.

guides#airbnb#web-scraping#anti-bot

Async Web Scraping in Python: asyncio + aiohttp Guide (Patterns That Don’t Get You Banned)

A practical asyncio + aiohttp guide for web scraping: bounded concurrency, semaphores, retries with backoff, timeouts, per-host limits, and batch exporting. Includes a complete working template.

guide#python#asyncio#aiohttp

403 Forbidden When Scraping: Why It Happens and 7 Fixes That Work

Related guides