HTTP Headers for Web Scraping: User-Agent, Accept-Language, and Beyond
People often treat request headers like a superstition.
They copy a giant blob from Chrome DevTools, paste it into requests, and hope it magically stops blocks.
Sometimes that works. Usually it is unnecessary.
For most scrapers, only a small set of headers meaningfully changes outcomes:
User-AgentAccept-LanguageAcceptRefererin a few flows- occasionally
Cookiewhen you are continuing a real session
Everything else is situational.
This guide focuses on the headers that actually matter, how to set sane defaults, and when header tuning is worth your time.
The short version
Here is the practical ranking.
| Header | Matters often? | Why |
|---|---|---|
User-Agent | Yes | Tells the server what client you claim to be |
Accept-Language | Yes | Helps align locale with browser identity |
Accept | Yes | Signals expected content type |
Referer | Sometimes | Some flows expect navigation context |
Cookie | Sometimes | Required when continuing an existing session |
Accept-Encoding | Rarely by hand | requests handles this well already |
Cache-Control / Pragma | Rarely | Usually not the reason you get blocked |
Sec-Fetch-* / sec-ch-ua* | Mostly browser-only | Hard to fake consistently with plain requests |
The big mistake is assuming headers can compensate for everything else.
They cannot.
If your IP is burned or your request rate is absurd, perfect headers will not rescue you.
1. User-Agent: still the first header to fix
The default python-requests/x.y.z user agent is an immediate tell.
Use a modern browser UA unless you have a reason not to.
UA_CHROME_WINDOWS = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
)
Why this matters:
- some sites explicitly downgrade or block obvious script clients
- many anti-bot systems score default library identities as suspicious
- consistent browser-like traffic is easier to blend with
What not to do:
- rotate to a random UA every request for no reason
- claim a mobile Safari UA while behaving like a desktop scraper
- use ancient browser versions that no normal user would run
Session consistency beats chaos.
2. Accept-Language: small header, real signal
This header is underrated.
It tells the server what languages you prefer, and it often affects:
- page language
- geolocation assumptions
- whether your request feels browser-like
A safe default:
"Accept-Language": "en-US,en;q=0.9"
This matters most when it matches the rest of your identity:
- US-style UA
- US-ish locale choices
- US-targeted content collection
If you scrape French or German sites, use a locale that matches the workflow instead of blindly sending en-US.
3. Accept: keep it normal
For HTML scraping, a realistic Accept header helps.
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"
Why?
- it reflects a browser asking for HTML first
- it avoids the "I only want anything whatsoever" vibe of a bare request
This is not usually the difference between success and failure by itself, but it is part of a believable request profile.
4. Referer: useful when flows have context
Many simple scrapers do not need Referer.
But it can help when:
- you move from a search page to a detail page
- the site expects internal navigation
- the site behaves differently for deep links
Example:
headers["Referer"] = "https://example.com/search?q=laptops"
Do not invent nonsense referers. Use the page that a human would realistically come from.
5. Cookies: only when you mean it
Cookies are powerful because they represent real session state.
They also create headaches if you do not manage them carefully.
Use them when:
- you are continuing an existing browsing session
- the site sets pagination or locale state in cookies
- you already proved the target needs them
Avoid copying stale cookies into every request forever. That often creates brittle scrapers that break mysteriously later.
With requests, a session object handles most cookie persistence for you.
The headers people obsess over too much
Accept-Encoding
Usually not worth setting manually. requests negotiates compressed responses fine.
Sec-Fetch-*
These are real browser headers, but plain requests is not a browser. Sending a hand-crafted Sec-Fetch-Site without the rest of the browser stack can create more inconsistency than it solves.
sec-ch-ua*
Same story. These client hints make more sense in browser automation than in plain HTTP scraping.
If you are using requests, do not try to impersonate full Chromium internals one header at a time.
Safe defaults for Python requests
This is a good baseline for many HTML targets.
import os
import random
import requests
from urllib.parse import urlencode
TIMEOUT = (10, 30)
USER_AGENTS = [
(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
]
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "").strip()
def build_headers(referer: str | None = None) -> dict:
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
if referer:
headers["Referer"] = referer
return headers
def maybe_proxy(url: str) -> str:
if not PROXIESAPI_KEY:
return url
return "https://api.proxiesapi.com/?" + urlencode({
"auth_key": PROXIESAPI_KEY,
"url": url,
})
def fetch_html(url: str, referer: str | None = None, session: requests.Session | None = None) -> str:
s = session or requests.Session()
r = s.get(
maybe_proxy(url),
headers=build_headers(referer=referer),
timeout=TIMEOUT,
)
r.raise_for_status()
return r.text
This is intentionally boring.
That is the point.
When headers are enough, and when they are not
Use this decision table.
| Situation | Headers alone enough? | What to do |
|---|---|---|
| Public HTML site, low request volume | Usually yes | Good UA + locale + timeouts |
Getting blocked only because of python-requests UA | Often | Fix UA and keep sessions |
| Multi-step session with cookies | Sometimes | Use requests.Session() and real referers |
| JavaScript-rendered site with bot checks | Rarely | Use a browser stack |
| Failing after many requests from one IP | No | Improve rate limits and proxy layer |
Headers are identity hints, not a complete disguise.
The more your target behaves like a browser application rather than a plain website, the less plain header spoofing can do on its own.
Common header mistakes
Mistake 1: copying every header from DevTools
That blob often includes browser-specific fields that do not make sense for requests.
Mistake 2: rotating everything on every request
If your UA, language, and referer change constantly, you stop looking like a person and start looking like a broken traffic generator.
Mistake 3: ignoring consistency
If you send:
- Japanese
Accept-Language - Windows Chrome UA
- EU proxy IP
- US-only product URLs
...that can be fine, but it is worth noticing the identity mismatch.
Mistake 4: blaming headers for rate-limit problems
Many block issues are volume problems wearing a header-shaped disguise.
Recommended defaults by scraper type
| Scraper type | Recommended header strategy |
|---|---|
| Simple article / docs scraper | Stable desktop UA + Accept-Language + normal Accept |
| Search-to-detail crawler | Same as above, plus realistic Referer |
| Session-based workflow | requests.Session() with persistent cookies |
| Browser automation | Let the browser send most headers natively |
The more browser-like your tool is, the less you should manually fake browser-only headers.
Final takeaway
If you remember only one thing, make it this:
A small set of consistent headers beats a giant copied header blob.
For most scrapers, the winning setup is:
- a modern
User-Agent - a matching
Accept-Language - a realistic
Accept Refereronly when it makes sense- persistent cookies only when the workflow needs them
That gives you clean, maintainable requests.
Then, if your crawl still struggles, fix the next layer:
- pacing
- session state
- browser rendering
- IP quality
Headers matter. They just matter most when they are part of a sane overall scraper, not a cargo-cult paste from DevTools.
Good headers make your requests less suspicious. ProxiesAPI helps with the network side when clean headers alone are not enough to keep large crawls stable.