HTTP Headers for Web Scraping: User-Agent, Accept-Language, and Beyond

People often treat request headers like a superstition.

They copy a giant blob from Chrome DevTools, paste it into requests, and hope it magically stops blocks.

Sometimes that works. Usually it is unnecessary.

For most scrapers, only a small set of headers meaningfully changes outcomes:

  • User-Agent
  • Accept-Language
  • Accept
  • Referer in a few flows
  • occasionally Cookie when you are continuing a real session

Everything else is situational.

This guide focuses on the headers that actually matter, how to set sane defaults, and when header tuning is worth your time.


The short version

Here is the practical ranking.

HeaderMatters often?Why
User-AgentYesTells the server what client you claim to be
Accept-LanguageYesHelps align locale with browser identity
AcceptYesSignals expected content type
RefererSometimesSome flows expect navigation context
CookieSometimesRequired when continuing an existing session
Accept-EncodingRarely by handrequests handles this well already
Cache-Control / PragmaRarelyUsually not the reason you get blocked
Sec-Fetch-* / sec-ch-ua*Mostly browser-onlyHard to fake consistently with plain requests

The big mistake is assuming headers can compensate for everything else.

They cannot.

If your IP is burned or your request rate is absurd, perfect headers will not rescue you.


1. User-Agent: still the first header to fix

The default python-requests/x.y.z user agent is an immediate tell.

Use a modern browser UA unless you have a reason not to.

UA_CHROME_WINDOWS = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/125.0.0.0 Safari/537.36"
)

Why this matters:

  • some sites explicitly downgrade or block obvious script clients
  • many anti-bot systems score default library identities as suspicious
  • consistent browser-like traffic is easier to blend with

What not to do:

  • rotate to a random UA every request for no reason
  • claim a mobile Safari UA while behaving like a desktop scraper
  • use ancient browser versions that no normal user would run

Session consistency beats chaos.


2. Accept-Language: small header, real signal

This header is underrated.

It tells the server what languages you prefer, and it often affects:

  • page language
  • geolocation assumptions
  • whether your request feels browser-like

A safe default:

"Accept-Language": "en-US,en;q=0.9"

This matters most when it matches the rest of your identity:

  • US-style UA
  • US-ish locale choices
  • US-targeted content collection

If you scrape French or German sites, use a locale that matches the workflow instead of blindly sending en-US.


3. Accept: keep it normal

For HTML scraping, a realistic Accept header helps.

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"

Why?

  • it reflects a browser asking for HTML first
  • it avoids the "I only want anything whatsoever" vibe of a bare request

This is not usually the difference between success and failure by itself, but it is part of a believable request profile.


4. Referer: useful when flows have context

Many simple scrapers do not need Referer.

But it can help when:

  • you move from a search page to a detail page
  • the site expects internal navigation
  • the site behaves differently for deep links

Example:

headers["Referer"] = "https://example.com/search?q=laptops"

Do not invent nonsense referers. Use the page that a human would realistically come from.


5. Cookies: only when you mean it

Cookies are powerful because they represent real session state.

They also create headaches if you do not manage them carefully.

Use them when:

  • you are continuing an existing browsing session
  • the site sets pagination or locale state in cookies
  • you already proved the target needs them

Avoid copying stale cookies into every request forever. That often creates brittle scrapers that break mysteriously later.

With requests, a session object handles most cookie persistence for you.


The headers people obsess over too much

Accept-Encoding

Usually not worth setting manually. requests negotiates compressed responses fine.

Sec-Fetch-*

These are real browser headers, but plain requests is not a browser. Sending a hand-crafted Sec-Fetch-Site without the rest of the browser stack can create more inconsistency than it solves.

sec-ch-ua*

Same story. These client hints make more sense in browser automation than in plain HTTP scraping.

If you are using requests, do not try to impersonate full Chromium internals one header at a time.


Safe defaults for Python requests

This is a good baseline for many HTML targets.

import os
import random
import requests
from urllib.parse import urlencode

TIMEOUT = (10, 30)

USER_AGENTS = [
    (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
]

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "").strip()


def build_headers(referer: str | None = None) -> dict:
    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }
    if referer:
        headers["Referer"] = referer
    return headers


def maybe_proxy(url: str) -> str:
    if not PROXIESAPI_KEY:
        return url
    return "https://api.proxiesapi.com/?" + urlencode({
        "auth_key": PROXIESAPI_KEY,
        "url": url,
    })


def fetch_html(url: str, referer: str | None = None, session: requests.Session | None = None) -> str:
    s = session or requests.Session()
    r = s.get(
        maybe_proxy(url),
        headers=build_headers(referer=referer),
        timeout=TIMEOUT,
    )
    r.raise_for_status()
    return r.text

This is intentionally boring.

That is the point.


When headers are enough, and when they are not

Use this decision table.

SituationHeaders alone enough?What to do
Public HTML site, low request volumeUsually yesGood UA + locale + timeouts
Getting blocked only because of python-requests UAOftenFix UA and keep sessions
Multi-step session with cookiesSometimesUse requests.Session() and real referers
JavaScript-rendered site with bot checksRarelyUse a browser stack
Failing after many requests from one IPNoImprove rate limits and proxy layer

Headers are identity hints, not a complete disguise.

The more your target behaves like a browser application rather than a plain website, the less plain header spoofing can do on its own.


Common header mistakes

Mistake 1: copying every header from DevTools

That blob often includes browser-specific fields that do not make sense for requests.

Mistake 2: rotating everything on every request

If your UA, language, and referer change constantly, you stop looking like a person and start looking like a broken traffic generator.

Mistake 3: ignoring consistency

If you send:

  • Japanese Accept-Language
  • Windows Chrome UA
  • EU proxy IP
  • US-only product URLs

...that can be fine, but it is worth noticing the identity mismatch.

Mistake 4: blaming headers for rate-limit problems

Many block issues are volume problems wearing a header-shaped disguise.


Scraper typeRecommended header strategy
Simple article / docs scraperStable desktop UA + Accept-Language + normal Accept
Search-to-detail crawlerSame as above, plus realistic Referer
Session-based workflowrequests.Session() with persistent cookies
Browser automationLet the browser send most headers natively

The more browser-like your tool is, the less you should manually fake browser-only headers.


Final takeaway

If you remember only one thing, make it this:

A small set of consistent headers beats a giant copied header blob.

For most scrapers, the winning setup is:

  • a modern User-Agent
  • a matching Accept-Language
  • a realistic Accept
  • Referer only when it makes sense
  • persistent cookies only when the workflow needs them

That gives you clean, maintainable requests.

Then, if your crawl still struggles, fix the next layer:

  • pacing
  • session state
  • browser rendering
  • IP quality

Headers matter. They just matter most when they are part of a sane overall scraper, not a cargo-cult paste from DevTools.

Headers help, but they are not your whole anti-block plan

Good headers make your requests less suspicious. ProxiesAPI helps with the network side when clean headers alone are not enough to keep large crawls stable.

Related guides

Scrape ESPN Team Schedules and Game Results with Python
Collect upcoming games, completed results, opponents, dates, networks, and home-away splits from ESPN team schedule pages using the serialized page data behind the HTML.
tutorial#python#espn#sports
Session Cookies for Web Scraping: Keep Logins Stable Without a Browser
Learn how to capture, reuse, persist, and refresh session cookies so authenticated scrapers stay reliable with plain HTTP requests instead of jumping straight to Selenium.
tutorial#web-scraping#session cookies web scraping#python
Scrape GitHub Releases
Collect release tags, publish dates, changelog text, and asset links from GitHub Releases pages with Python so you can monitor repos automatically.
tutorial#python#github#web-scraping
How to Scrape Google Flights Prices with Python (Routes, Dates, and Price Quotes)
A practical guide to extracting flight price quotes from Google Flights responsibly: capture share URLs, fetch server-rendered HTML, parse price cards, and export clean JSON. Includes ProxiesAPI-backed requests + a screenshot.
tutorial#python#google-flights#travel