403 Forbidden When Scraping: Why It Happens and 7 Fixes That Work

If you scrape long enough, you will run into this:

403 Client Error: Forbidden for url: https://target-site.example/...

The mistake most people make is assuming every 403 means the same thing.

It doesn’t.

A 403 Forbidden can mean:

  • the site does not like your IP
  • your request headers look wrong
  • you skipped cookies or session setup
  • you hit a geo restriction
  • you are actually seeing a soft block disguised as success somewhere else

That is why the right response is not “retry harder.” It is to diagnose the failure class first.

This guide breaks the problem into 7 fixes that actually work, in the order I would test them.

Treat 403 as a diagnosis problem, not a brute-force problem

Most scraping teams waste money by responding to every 403 with more retries. The better move is to identify what triggered the block, then fix the fetch pattern, session handling, or IP layer in that order.


First: separate 403 from the other common block patterns

A 403 is not the same as a 429, and it is not the same as a “200 OK but useless HTML” page.

SymptomWhat it usually meansFirst move
403 Forbiddenexplicit denial: IP, headers, session, geo, WAF ruleinspect request identity and session flow
429 Too Many Requestsyou are over the rate limitslow down, back off, reduce concurrency
200 OK with challenge / captcha HTMLsoft block or JS challengedetect bad content, then escalate transport
repeated redirect to login / homemissing auth or missing cookiespersist cookies and follow normal navigation

Why this matters: the fastest way to waste proxies is treating 403, 429, and challenge pages like one problem.


Why 403 happens in scraping

In practice, most 403s come from one of five buckets:

  1. IP reputation
  2. Header / fingerprint mismatch
  3. Missing cookies or broken sessions
  4. Geo restrictions
  5. Aggressive request patterns

The site is not really saying “you are forbidden forever.” It is usually saying “this request does not look acceptable.”

That is good news, because acceptable requests can often be rebuilt.


Fix 1: Make your headers coherent

Many beginner scrapers either:

  • send the default python-requests/... user agent
  • randomize headers into nonsense combinations

Both are bad.

The goal is not random. The goal is coherent.

Use a believable browser profile and keep related headers aligned:

import requests

session = requests.Session()
session.headers.update(
    {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/125.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
    }
)

This will not solve every 403, but it removes the easiest reason to block you.


Fix 2: Reuse sessions instead of going in cold every time

Lots of sites expect a request journey that looks something like:

  1. landing page
  2. cookie set
  3. next page
  4. detail page

Bad scraper behavior looks like:

  • new TCP/TLS connection
  • no cookies
  • direct hit on the hardest endpoint
  • repeated fast requests

Better pattern:

def warm_session(session: requests.Session, homepage: str) -> None:
    r = session.get(homepage, timeout=(10, 30))
    r.raise_for_status()


def fetch_detail(session: requests.Session, url: str) -> requests.Response:
    return session.get(url, timeout=(10, 30))

That tiny “warm-up” step is often enough to stop avoidable 403s on mid-friction sites.


Fix 3: Lower concurrency and add real backoff

If you send 100 requests in 2 seconds from one IP, the target may answer with 403 even if the real underlying issue is rate abuse.

Here is a practical retry pattern:

from __future__ import annotations

import random
import time
import requests

RETRYABLE = {403, 429, 500, 502, 503, 504}


def fetch_with_backoff(session: requests.Session, url: str, tries: int = 5) -> requests.Response:
    last = None

    for attempt in range(1, tries + 1):
        response = session.get(url, timeout=(10, 30))
        last = response

        if response.status_code == 200:
            return response

        if response.status_code in RETRYABLE:
            sleep_for = min(45, attempt * 4) + random.uniform(0.5, 1.8)
            time.sleep(sleep_for)
            continue

        response.raise_for_status()

    raise RuntimeError(f"failed after retries, last status={last.status_code if last else 'unknown'}")

Important detail: retries should get calmer, not louder.


Fix 4: Detect soft blocks explicitly

A lot of teams only look at status codes. That misses half the real failures.

Sometimes you get:

  • 200 OK
  • but the body is a challenge page
  • or a “please enable JavaScript” page
  • or a captcha shell

Build a content-level check:

def looks_blocked(html: str) -> bool:
    text = html.lower()
    indicators = [
        "captcha",
        "access denied",
        "forbidden",
        "enable javascript",
        "cf-chl",
        "verify you are human",
    ]
    return any(token in text for token in indicators)


response = fetch_with_backoff(session, url)
if looks_blocked(response.text):
    raise RuntimeError("soft block detected despite non-403 response")

This is one of the highest-leverage improvements you can make, because it keeps bad HTML out of your parser and prevents silent data corruption.


Fix 5: Rotate IPs only when the evidence points to IP-based blocking

People jump to proxy rotation too early.

That is expensive, and it also hides upstream mistakes.

Use proxies when you see signs like:

  • fresh session still gets immediate 403
  • same code works from one network but not another
  • region-specific pages differ by country
  • bans follow the IP more than the cookie jar

This is where ProxiesAPI fits well: as a cleaner network layer after you already fixed obvious request problems.

A common pattern is:

import os

proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
PROXIES = {"http": proxy_url, "https": proxy_url} if proxy_url else None

response = session.get(url, timeout=(10, 30), proxies=PROXIES)

That keeps the parser untouched while improving transport quality.


Fix 6: Respect geography

Some 403s are not anti-bot blocks at all. They are geo rules.

Examples:

  • site available only in US / EU
  • product page hidden outside a region
  • country-specific consent or policy wall

Quick clues:

  • same URL works in browser on VPN but not from your server
  • currency / catalog differs by country
  • 403 appears only on a subset of routes

If the target is geo-sensitive, use the right exit geography from the start instead of brute-forcing more retries from the wrong country.


Fix 7: Escalate to a browser only when plain HTTP stops making sense

If the site depends on:

  • JavaScript-rendered state
  • challenge resolution in-browser
  • dynamic session setup through navigation

then a headless browser may be the right next step.

But it should be the last escalation, not the first.

Escalation order I recommend:

  1. coherent headers
  2. session reuse
  3. lower concurrency + backoff
  4. soft-block detection
  5. better IP / proxy layer
  6. browser automation

That sequence is cheaper and easier to debug than jumping straight into Playwright for every target.


A practical diagnosis workflow

When I hit 403 on a new target, I run this checklist:

CheckQuestion
raw statusis it 403, 429, or 200 with bad HTML?
headersam I sending a coherent browser-like request?
sessiondid I warm the homepage and keep cookies?
rateam I bursting too hard from one IP?
geoam I in the right country?
IP evidencedoes changing IP actually change the outcome?
browser needis the page challenge-driven or JS-critical?

That workflow solves the root problem faster than randomly swapping libraries.


Example: a safer baseline scraper

from __future__ import annotations

import os
import random
import time
import requests


def build_session() -> requests.Session:
    s = requests.Session()
    s.headers.update(
        {
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
        }
    )
    return s


def fetch_html(session: requests.Session, url: str) -> str:
    proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
    proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None

    time.sleep(random.uniform(1.0, 2.4))
    response = session.get(url, timeout=(10, 30), proxies=proxies)

    if response.status_code in (403, 429):
        raise RuntimeError(f"blocked: {response.status_code}")

    response.raise_for_status()

    if looks_blocked(response.text):
        raise RuntimeError("soft block detected")

    return response.text

This is not “magic 403 bypass” code. It is a calmer, more diagnosable baseline.


What not to do

Here are the common bad responses to 403:

  • retrying instantly 20 times
  • randomizing every header on every request
  • rotating proxies before testing session reuse
  • parsing challenge HTML as if it were real page content
  • assuming every 403 means “need browser”

Those moves burn time, proxies, and sometimes entire IP ranges.


Final thoughts

The useful mental model is:

403 is not a single error. It is a feedback signal.

Usually the site is telling you one of four things:

  • your identity looks wrong
  • your session looks broken
  • your traffic looks too aggressive
  • your location is not acceptable

If you diagnose those in order, most 403 problems become manageable.

And when you do need better IP infrastructure, ProxiesAPI makes the transport upgrade clean because it lets you improve the fetch layer without rewriting the parser.

Treat 403 as a diagnosis problem, not a brute-force problem

Most scraping teams waste money by responding to every 403 with more retries. The better move is to identify what triggered the block, then fix the fetch pattern, session handling, or IP layer in that order.

Related guides

How to Bypass Cloudflare for Web Scraping Without Burning Your IPs
A practical guide to reducing Cloudflare blocks with better fingerprints, session reuse, rate control, and smarter escalation paths.
guides#bypass cloudflare#cloudflare#web-scraping
Rotating Proxies: What They Are, How Rotation Works, and When You Need Them
A practical, non-hype guide to rotating proxies: request vs session rotation, sticky IPs, block signals, and how to wire rotation into a scraper (including ProxiesAPI-ready examples).
guides#rotating proxies#proxies#web-scraping
Scraping Airbnb Listings: Pricing, Availability, Reviews (What’s Realistic in 2026)
Airbnb is a high-friction target. Here’s what data is realistic to collect in 2026, what gets blocked, safer alternatives, and how to design a risk-aware pipeline.
guides#airbnb#web-scraping#anti-bot
Async Web Scraping in Python: asyncio + aiohttp Guide (Patterns That Don’t Get You Banned)
A practical asyncio + aiohttp guide for web scraping: bounded concurrency, semaphores, retries with backoff, timeouts, per-host limits, and batch exporting. Includes a complete working template.
guide#python#asyncio#aiohttp