Scrape Wikipedia list pages with Python

Mar 13, 2026 · Tutorials · #python, #web scraping, #wikipedia, #beautifulsoup, #requests

Wikipedia list pages are one of the best places to start scraping because the markup is relatively structured, the content is public, and many pages follow repeatable table patterns.

In this guide, we’ll scrape a Wikipedia list page with Python, extract rows from a sortable table, follow links to detail pages, and export the result to CSV and JSON.

We’ll use:

requests for HTTP requests
BeautifulSoup for parsing HTML
csv and json for export

We’ll also show how to route requests through ProxiesAPI when you want a proxy-backed fetch flow with minimal setup.

What we’re scraping

For a concrete example, we’ll use the Wikipedia page:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

This page is a good tutorial target because it contains multiple content sections and predictable heading + list structure. To show a more table-like workflow, we’ll also use another common Wikipedia pattern from list pages: extracting linked entries and enriching them by visiting each linked page.

Install dependencies

pip install requests beautifulsoup4

Basic request + parse flow

Start by downloading the page and parsing it.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"
headers = {
    "User-Agent": "Mozilla/5.0 (compatible; tutorial-bot/1.0; +https://example.com/bot)"
}

response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()

soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.get_text(strip=True))

Example output:

List of HTTP status codes - Wikipedia

Inspecting the structure

Wikipedia content usually lives inside div.mw-parser-output.

content = soup.select_one("div.mw-parser-output")
print(content is not None)

For many Wikipedia list pages, your first step is usually one of these:

find a table.wikitable
find lists under headings
collect a[href] links from a content section

Let’s collect the status code items from the page. The page structure includes headings and bullet lists, so we’ll extract codes from list items.

content = soup.select_one("div.mw-parser-output")
items = content.select("ul > li")

for item in items[:10]:
    text = item.get_text(" ", strip=True)
    print(text[:120])

That gives you raw list content, but for a cleaner scraper you usually want structured fields.

Extract structured records from a Wikipedia list page

The following script extracts HTTP code entries by finding list items that begin with a 3-digit status code.

import re
import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; wikipedia-scraper/1.0; +https://example.com/bot)"
}


def fetch_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=30)
    response.raise_for_status()
    return response.text


def parse_status_codes(html: str):
    soup = BeautifulSoup(html, "html.parser")
    content = soup.select_one("div.mw-parser-output")
    records = []

    for li in content.select("ul > li"):
        text = li.get_text(" ", strip=True)
        match = re.match(r"^(\d{3})\s+(.*)", text)
        if not match:
            continue

        code = match.group(1)
        description = match.group(2)
        first_link = li.select_one("a[href^='/wiki/']")
        detail_url = None
        if first_link and first_link.get("href"):
            detail_url = "https://en.wikipedia.org" + first_link["href"]

        records.append({
            "code": code,
            "description": description,
            "detail_url": detail_url,
        })

    return records


if __name__ == "__main__":
    html = fetch_html(URL)
    records = parse_status_codes(html)

    print(f"Extracted {len(records)} records")
    for row in records[:5]:
        print(row)

Example output:

Extracted 64 records
{'code': '100', 'description': 'Continue', 'detail_url': 'https://en.wikipedia.org/wiki/List_of_HTTP_status_codes'}
{'code': '101', 'description': 'Switching Protocols', 'detail_url': 'https://en.wikipedia.org/wiki/List_of_HTTP_status_codes'}
{'code': '102', 'description': 'Processing', 'detail_url': 'https://en.wikipedia.org/wiki/WebDAV'}
{'code': '103', 'description': 'Early Hints', 'detail_url': 'https://en.wikipedia.org/wiki/HTTP_103'}
{'code': '200', 'description': 'OK', 'detail_url': 'https://en.wikipedia.org/wiki/List_of_HTTP_status_codes'}

Scraping a real Wikipedia table

A lot of Wikipedia list pages use table.wikitable. Here’s a reusable function that works well on those pages.

import requests
from bs4 import BeautifulSoup


def scrape_wikitable(url: str):
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")

    table = soup.select_one("table.wikitable")
    if not table:
        raise ValueError("No wikitable found on page")

    headers = [th.get_text(" ", strip=True) for th in table.select("tr th")]
    rows = []

    for tr in table.select("tr"):
        cells = tr.select("td")
        if not cells:
            continue
        values = [td.get_text(" ", strip=True) for td in cells]
        rows.append(values)

    return headers, rows

This is the exact pattern you’ll reuse across “list of X” pages on Wikipedia.

Follow linked detail pages

Once you have the list page, the next step is often enrichment.

For example, you may want to collect:

page title
first paragraph summary
infobox fields
categories

Here’s a helper that grabs the first paragraph from a linked Wikipedia detail page.

import requests
from bs4 import BeautifulSoup


def scrape_wikipedia_summary(url: str):
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")

    title = soup.select_one("h1#firstHeading")
    content = soup.select_one("div.mw-parser-output")

    summary = ""
    if content:
        for p in content.select("p"):
            text = p.get_text(" ", strip=True)
            if text:
                summary = text
                break

    return {
        "title": title.get_text(strip=True) if title else None,
        "summary": summary,
    }

Now combine list-page extraction with detail-page enrichment.

import csv
import json
import time
import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; wikipedia-scraper/1.0; +https://example.com/bot)"
}
BASE_URL = "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"


def get(url):
    response = requests.get(url, headers=HEADERS, timeout=30)
    response.raise_for_status()
    return response.text


def parse_list_page(html):
    soup = BeautifulSoup(html, "html.parser")
    content = soup.select_one("div.mw-parser-output")
    results = []

    for li in content.select("ul > li"):
        text = li.get_text(" ", strip=True)
        if len(text) < 4 or not text[:3].isdigit():
            continue
        link = li.select_one("a[href^='/wiki/']")
        detail_url = None
        if link:
            detail_url = "https://en.wikipedia.org" + link.get("href", "")
        results.append({
            "label": text,
            "detail_url": detail_url,
        })

    return results


def parse_detail_page(html):
    soup = BeautifulSoup(html, "html.parser")
    title = soup.select_one("h1#firstHeading")
    content = soup.select_one("div.mw-parser-output")
    summary = ""

    if content:
        for p in content.select("p"):
            text = p.get_text(" ", strip=True)
            if text:
                summary = text
                break

    return {
        "detail_title": title.get_text(strip=True) if title else "",
        "summary": summary,
    }


list_html = get(BASE_URL)
records = parse_list_page(list_html)
enriched = []

for record in records[:10]:
    detail = {"detail_title": "", "summary": ""}
    if record["detail_url"]:
        try:
            detail_html = get(record["detail_url"])
            detail = parse_detail_page(detail_html)
            time.sleep(1)
        except requests.RequestException:
            pass

    enriched.append({**record, **detail})

with open("wikipedia_status_codes.json", "w", encoding="utf-8") as f:
    json.dump(enriched, f, ensure_ascii=False, indent=2)

with open("wikipedia_status_codes.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["label", "detail_url", "detail_title", "summary"])
    writer.writeheader()
    writer.writerows(enriched)

print(f"Saved {len(enriched)} rows")

Example output:

Saved 10 rows

Using ProxiesAPI for the fetch step

If you want to fetch the same Wikipedia URL through ProxiesAPI, the request shape is simple:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"

You can do the same thing in Python.

import requests
from urllib.parse import quote_plus


def fetch_via_proxiesapi(target_url: str, api_key: str) -> str:
    proxy_url = (
        "http://api.proxiesapi.com/?key="
        f"{api_key}&url={quote_plus(target_url)}"
    )
    response = requests.get(proxy_url, timeout=60)
    response.raise_for_status()
    return response.text

html = fetch_via_proxiesapi(
    "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes",
    api_key="API_KEY",
)
print(html[:200])

That pattern is useful when you want to keep your scraper code mostly unchanged while moving URL fetching into a proxy API layer.

Practical scraping tips for Wikipedia

Wikipedia is friendlier than many commercial sites, but you should still scrape responsibly.

1. Set a user agent

Always identify your script with a descriptive user agent.

2. Add delays between detail-page requests

If you’re following hundreds of links, sleep between requests.

3. Expect page structure differences

Some pages use tables, some use lists, some use infoboxes. Build small parser functions for each pattern.

4. Normalize text early

Wikipedia text often includes citations, superscripts, and formatting noise. Strip and normalize before export.

5. Save raw HTML during development

When a selector stops working, raw HTML snapshots help you debug quickly.

Common selector patterns on Wikipedia

These selectors are useful across many Wikipedia pages:

Main content: div.mw-parser-output
Page title: h1#firstHeading
Table rows: table.wikitable tr
Infobox: table.infobox
Internal links: a[href^='/wiki/']

When to use list-page scraping

This workflow is ideal when you need to build datasets from:

list pages of tools, companies, protocols, places, or people
category-like pages with lots of linked entries
reference pages that combine summary + navigation

It’s especially useful for internal research, content enrichment, and structured exports.

Final thoughts

Wikipedia list pages are one of the easiest ways to build a reliable scraper because the patterns are visible and usually repeatable.

The workflow is straightforward:

fetch the list page
extract rows or linked entries
visit selected detail pages
normalize and export the data

Start with one page, validate your selectors, and only then scale out to more pages and more detail-page requests.

If you want to simplify the request-routing side of your scraper, a fetch layer like ProxiesAPI can keep the networking piece minimal while you focus on parsing and data quality.

Need simpler scraping at scale?

ProxiesAPI lets you fetch target URLs through a simple API endpoint, which is handy when you want cleaner request routing and fewer moving parts in your scraping stack.

Get 1,000 free API calls View pricing

Extract structured fields from many Wikipedia pages (infobox + tables + links) with ProxiesAPI + Python, then save to CSV/JSON.

tutorial#python#wikipedia#web-scraping

Scrape GitHub Repository Data

Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.

tutorial#python#github#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Book Reviews and Ratings from Goodreads with Python (JSON-LD + Top Reviews)

Learn how to scrape Goodreads book pages responsibly: extract rating, rating count, review count via JSON-LD, parse key metadata, and collect top review snippets. Includes screenshot and ProxiesAPI-ready request patterns.

tutorial#python#goodreads#web-scraping

Scrape Wikipedia list pages with Python

Related guides