How to Scrape MDN Docs Pages with Python

MDN docs pages are a great target when you want to build:

  • local documentation indexes
  • search overlays
  • topic maps
  • heading datasets for developer tooling

In this guide, we’ll scrape an MDN documentation page, extract its heading structure and table of contents, and turn it into structured output you can reuse.

MDN docs page

Turn documentation scraping into a stable indexing job

Docs pages are often scrape-friendly — until you scale the crawl. ProxiesAPI helps keep the request layer clean and reliable.


Target URL

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

We’ll use the MDN page for HTTP headers, but the same pattern works for many MDN docs pages.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch the HTML

import requests

URL = "https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers"
TIMEOUT = (10, 30)
UA = "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"

html = requests.get(URL, timeout=TIMEOUT, headers={"User-Agent": UA}).text
print(len(html))
print(html[:300])

Always inspect first. Documentation pages are often cleaner than consumer sites, but don’t assume the structure.


Step 2: Understand the page structure

For a docs page like this, the useful elements are usually:

  • page title
  • heading hierarchy (h2, h3, h4)
  • links in the table of contents
  • optional side navigation or metadata

Your main goal is to separate:

  1. actual content headings
  2. navigation chrome
  3. sidebar noise

Step 3: Extract the page title

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

title = soup.select_one("main h1")
page_title = title.get_text(" ", strip=True) if title else None
print(page_title)

Using main h1 is usually safer than a global h1 on a docs site.


Step 4: Extract the heading hierarchy

The heading structure is the most reusable part of a docs page.

headings = []
for tag in soup.select("main h2, main h3, main h4"):
    headings.append({
        "level": tag.name,
        "text": tag.get_text(" ", strip=True),
        "id": tag.get("id"),
    })

print(headings[:10])

This gives you a clean ordered outline of the page.

Example output:

[
  {"level": "h2", "text": "In this article", "id": "in_this_article"},
  {"level": "h2", "text": "See also", "id": "see_also"}
]

Step 5: Extract the table of contents

Many docs pages expose a table of contents that links to in-page anchors.

toc_links = []
for a in soup.select('a[href^="#"]'):
    text = a.get_text(" ", strip=True)
    href = a.get("href")
    if text and href:
        toc_links.append({"text": text, "href": href})

print(toc_links[:20])

You may want to filter this later so you only keep links that belong to the main content TOC, not every anchor link on the page.


Step 6: Put it all together

import requests
from bs4 import BeautifulSoup


def scrape_mdn_page(url: str) -> dict:
    html = requests.get(
        url,
        timeout=(10, 30),
        headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"},
    ).text

    soup = BeautifulSoup(html, "lxml")

    title = soup.select_one("main h1")
    page_title = title.get_text(" ", strip=True) if title else None

    headings = []
    for tag in soup.select("main h2, main h3, main h4"):
        headings.append({
            "level": tag.name,
            "text": tag.get_text(" ", strip=True),
            "id": tag.get("id"),
        })

    toc_links = []
    for a in soup.select('a[href^="#"]'):
        text = a.get_text(" ", strip=True)
        href = a.get("href")
        if text and href:
            toc_links.append({"text": text, "href": href})

    return {
        "url": url,
        "title": page_title,
        "headings": headings,
        "toc_links": toc_links,
    }


if __name__ == "__main__":
    data = scrape_mdn_page("https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers")
    print(data)

Why this is useful

Once you can extract docs structure cleanly, you can build:

  • an internal docs search tool
  • heading-based summarizers
  • local doc graphs
  • "what changed" diff monitors for docs pages

The heading hierarchy is often more useful than the raw text dump.


Common gotchas

1. Don’t mix navigation with content

Docs sites have a lot of UI chrome. Restricting selectors to main helps a lot.

2. TOC extraction may be noisy

Global anchor extraction is fine for exploration, but you may later want a more specific TOC container selector.

3. IDs matter

Heading IDs are what turn a flat heading list into a reusable local index.


Export to JSON

import json

with open("mdn_page.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

Selector summary

FieldSelector
Page titlemain h1
Headingsmain h2, main h3, main h4
TOC-style anchor linksa[href^="#"]

Scaling note

For one or two docs pages, direct requests are enough.

For large documentation crawls, you still need a stable fetch layer.

def fetch_with_proxy(url: str) -> str:
    proxy_url = f"http://api.proxiesapi.com/?key=YOUR_API_KEY&url={url}"
    return requests.get(proxy_url).text

If you're building a scraping project that needs to scale beyond a few hundred pages, check out Proxies API — we handle proxy rotation, browser fingerprinting, CAPTCHAs, and automatic retries so you can focus on the data extraction logic. Start with 1,000 free API calls.

Turn documentation scraping into a stable indexing job

Docs pages are often scrape-friendly — until you scale the crawl. ProxiesAPI helps keep the request layer clean and reliable.

Related guides