How to Scrape MDN Docs Pages with Python

Mar 12, 2026 · tutorial · #python, #mdn, #web-scraping, #requests, #beautifulsoup, #docs

MDN docs pages are a great target when you want to build:

local documentation indexes
search overlays
topic maps
heading datasets for developer tooling

In this guide, we’ll scrape an MDN documentation page, extract its heading structure and table of contents, and turn it into structured output you can reuse.

MDN docs page

Turn documentation scraping into a stable indexing job

Docs pages are often scrape-friendly — until you scale the crawl. ProxiesAPI helps keep the request layer clean and reliable.

Get 1,000 free API calls View pricing

Target URL

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

We’ll use the MDN page for HTTP headers, but the same pattern works for many MDN docs pages.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch the HTML

import requests

URL = "https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers"
TIMEOUT = (10, 30)
UA = "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"

html = requests.get(URL, timeout=TIMEOUT, headers={"User-Agent": UA}).text
print(len(html))
print(html[:300])

Always inspect first. Documentation pages are often cleaner than consumer sites, but don’t assume the structure.

Step 2: Understand the page structure

For a docs page like this, the useful elements are usually:

page title
heading hierarchy (h2, h3, h4)
links in the table of contents
optional side navigation or metadata

Your main goal is to separate:

actual content headings
navigation chrome
sidebar noise

Step 3: Extract the page title

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

title = soup.select_one("main h1")
page_title = title.get_text(" ", strip=True) if title else None
print(page_title)

Using main h1 is usually safer than a global h1 on a docs site.

Step 4: Extract the heading hierarchy

The heading structure is the most reusable part of a docs page.

headings = []
for tag in soup.select("main h2, main h3, main h4"):
    headings.append({
        "level": tag.name,
        "text": tag.get_text(" ", strip=True),
        "id": tag.get("id"),
    })

print(headings[:10])

This gives you a clean ordered outline of the page.

Example output:

[
  {"level": "h2", "text": "In this article", "id": "in_this_article"},
  {"level": "h2", "text": "See also", "id": "see_also"}
]

Step 5: Extract the table of contents

Many docs pages expose a table of contents that links to in-page anchors.

toc_links = []
for a in soup.select('a[href^="#"]'):
    text = a.get_text(" ", strip=True)
    href = a.get("href")
    if text and href:
        toc_links.append({"text": text, "href": href})

print(toc_links[:20])

You may want to filter this later so you only keep links that belong to the main content TOC, not every anchor link on the page.

Step 6: Put it all together

import requests
from bs4 import BeautifulSoup


def scrape_mdn_page(url: str) -> dict:
    html = requests.get(
        url,
        timeout=(10, 30),
        headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"},
    ).text

    soup = BeautifulSoup(html, "lxml")

    title = soup.select_one("main h1")
    page_title = title.get_text(" ", strip=True) if title else None

    headings = []
    for tag in soup.select("main h2, main h3, main h4"):
        headings.append({
            "level": tag.name,
            "text": tag.get_text(" ", strip=True),
            "id": tag.get("id"),
        })

    toc_links = []
    for a in soup.select('a[href^="#"]'):
        text = a.get_text(" ", strip=True)
        href = a.get("href")
        if text and href:
            toc_links.append({"text": text, "href": href})

    return {
        "url": url,
        "title": page_title,
        "headings": headings,
        "toc_links": toc_links,
    }


if __name__ == "__main__":
    data = scrape_mdn_page("https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers")
    print(data)

Why this is useful

Once you can extract docs structure cleanly, you can build:

an internal docs search tool
heading-based summarizers
local doc graphs
"what changed" diff monitors for docs pages

The heading hierarchy is often more useful than the raw text dump.

Common gotchas

Docs sites have a lot of UI chrome. Restricting selectors to main helps a lot.

2. TOC extraction may be noisy

Global anchor extraction is fine for exploration, but you may later want a more specific TOC container selector.

3. IDs matter

Heading IDs are what turn a flat heading list into a reusable local index.

Export to JSON

import json

with open("mdn_page.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

Selector summary

Field	Selector
Page title	`main h1`
Headings	`main h2, main h3, main h4`
TOC-style anchor links	`a[href^="#"]`

Scaling note

For one or two docs pages, direct requests are enough.

For large documentation crawls, you still need a stable fetch layer.

def fetch_with_proxy(url: str) -> str:
    proxy_url = f"http://api.proxiesapi.com/?key=YOUR_API_KEY&url={url}"
    return requests.get(proxy_url).text

If you're building a scraping project that needs to scale beyond a few hundred pages, check out Proxies API — we handle proxy rotation, browser fingerprinting, CAPTCHAs, and automatic retries so you can focus on the data extraction logic. Start with 1,000 free API calls.

Turn documentation scraping into a stable indexing job

Docs pages are often scrape-friendly — until you scale the crawl. ProxiesAPI helps keep the request layer clean and reliable.

Get 1,000 free API calls View pricing

Build a searchable dataset from the Python docs module index using Python and BeautifulSoup.

tutorial#python#docs#web-scraping

Scrape GitHub Repository Data

Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.

tutorial#python#github#web-scraping

Scrape Secondhand Fashion Listings from Vinted

Show how to extract Vinted search listings, prices, brands, and image URLs into a resale-market dataset with Python, screenshots, and a ProxiesAPI-ready fetch layer.

tutorial#python#vinted#web-scraping

Scrape Book Reviews and Ratings from Goodreads