How to Scrape MDN Docs Pages with Python
MDN docs pages are a great target when you want to build:
- local documentation indexes
- search overlays
- topic maps
- heading datasets for developer tooling
In this guide, we’ll scrape an MDN documentation page, extract its heading structure and table of contents, and turn it into structured output you can reuse.

Docs pages are often scrape-friendly — until you scale the crawl. ProxiesAPI helps keep the request layer clean and reliable.
Target URL
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers
We’ll use the MDN page for HTTP headers, but the same pattern works for many MDN docs pages.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Step 1: Fetch the HTML
import requests
URL = "https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers"
TIMEOUT = (10, 30)
UA = "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"
html = requests.get(URL, timeout=TIMEOUT, headers={"User-Agent": UA}).text
print(len(html))
print(html[:300])
Always inspect first. Documentation pages are often cleaner than consumer sites, but don’t assume the structure.
Step 2: Understand the page structure
For a docs page like this, the useful elements are usually:
- page title
- heading hierarchy (
h2,h3,h4) - links in the table of contents
- optional side navigation or metadata
Your main goal is to separate:
- actual content headings
- navigation chrome
- sidebar noise
Step 3: Extract the page title
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
title = soup.select_one("main h1")
page_title = title.get_text(" ", strip=True) if title else None
print(page_title)
Using main h1 is usually safer than a global h1 on a docs site.
Step 4: Extract the heading hierarchy
The heading structure is the most reusable part of a docs page.
headings = []
for tag in soup.select("main h2, main h3, main h4"):
headings.append({
"level": tag.name,
"text": tag.get_text(" ", strip=True),
"id": tag.get("id"),
})
print(headings[:10])
This gives you a clean ordered outline of the page.
Example output:
[
{"level": "h2", "text": "In this article", "id": "in_this_article"},
{"level": "h2", "text": "See also", "id": "see_also"}
]
Step 5: Extract the table of contents
Many docs pages expose a table of contents that links to in-page anchors.
toc_links = []
for a in soup.select('a[href^="#"]'):
text = a.get_text(" ", strip=True)
href = a.get("href")
if text and href:
toc_links.append({"text": text, "href": href})
print(toc_links[:20])
You may want to filter this later so you only keep links that belong to the main content TOC, not every anchor link on the page.
Step 6: Put it all together
import requests
from bs4 import BeautifulSoup
def scrape_mdn_page(url: str) -> dict:
html = requests.get(
url,
timeout=(10, 30),
headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"},
).text
soup = BeautifulSoup(html, "lxml")
title = soup.select_one("main h1")
page_title = title.get_text(" ", strip=True) if title else None
headings = []
for tag in soup.select("main h2, main h3, main h4"):
headings.append({
"level": tag.name,
"text": tag.get_text(" ", strip=True),
"id": tag.get("id"),
})
toc_links = []
for a in soup.select('a[href^="#"]'):
text = a.get_text(" ", strip=True)
href = a.get("href")
if text and href:
toc_links.append({"text": text, "href": href})
return {
"url": url,
"title": page_title,
"headings": headings,
"toc_links": toc_links,
}
if __name__ == "__main__":
data = scrape_mdn_page("https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers")
print(data)
Why this is useful
Once you can extract docs structure cleanly, you can build:
- an internal docs search tool
- heading-based summarizers
- local doc graphs
- "what changed" diff monitors for docs pages
The heading hierarchy is often more useful than the raw text dump.
Common gotchas
1. Don’t mix navigation with content
Docs sites have a lot of UI chrome. Restricting selectors to main helps a lot.
2. TOC extraction may be noisy
Global anchor extraction is fine for exploration, but you may later want a more specific TOC container selector.
3. IDs matter
Heading IDs are what turn a flat heading list into a reusable local index.
Export to JSON
import json
with open("mdn_page.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
Selector summary
| Field | Selector |
|---|---|
| Page title | main h1 |
| Headings | main h2, main h3, main h4 |
| TOC-style anchor links | a[href^="#"] |
Scaling note
For one or two docs pages, direct requests are enough.
For large documentation crawls, you still need a stable fetch layer.
def fetch_with_proxy(url: str) -> str:
proxy_url = f"http://api.proxiesapi.com/?key=YOUR_API_KEY&url={url}"
return requests.get(proxy_url).text
If you're building a scraping project that needs to scale beyond a few hundred pages, check out Proxies API — we handle proxy rotation, browser fingerprinting, CAPTCHAs, and automatic retries so you can focus on the data extraction logic. Start with 1,000 free API calls.
Docs pages are often scrape-friendly — until you scale the crawl. ProxiesAPI helps keep the request layer clean and reliable.