Scrape Wikipedia list pages with Python
Wikipedia list pages are one of the best places to start scraping because the markup is relatively structured, the content is public, and many pages follow repeatable table patterns.
In this guide, we’ll scrape a Wikipedia list page with Python, extract rows from a sortable table, follow links to detail pages, and export the result to CSV and JSON.
We’ll use:
requestsfor HTTP requestsBeautifulSoupfor parsing HTMLcsvandjsonfor export
We’ll also show how to route requests through ProxiesAPI when you want a proxy-backed fetch flow with minimal setup.
What we’re scraping
For a concrete example, we’ll use the Wikipedia page:
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
This page is a good tutorial target because it contains multiple content sections and predictable heading + list structure. To show a more table-like workflow, we’ll also use another common Wikipedia pattern from list pages: extracting linked entries and enriching them by visiting each linked page.
Install dependencies
pip install requests beautifulsoup4
Basic request + parse flow
Start by downloading the page and parsing it.
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"
headers = {
"User-Agent": "Mozilla/5.0 (compatible; tutorial-bot/1.0; +https://example.com/bot)"
}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.get_text(strip=True))
Example output:
List of HTTP status codes - Wikipedia
Inspecting the structure
Wikipedia content usually lives inside div.mw-parser-output.
content = soup.select_one("div.mw-parser-output")
print(content is not None)
For many Wikipedia list pages, your first step is usually one of these:
- find a
table.wikitable - find lists under headings
- collect
a[href]links from a content section
Let’s collect the status code items from the page. The page structure includes headings and bullet lists, so we’ll extract codes from list items.
content = soup.select_one("div.mw-parser-output")
items = content.select("ul > li")
for item in items[:10]:
text = item.get_text(" ", strip=True)
print(text[:120])
That gives you raw list content, but for a cleaner scraper you usually want structured fields.
Extract structured records from a Wikipedia list page
The following script extracts HTTP code entries by finding list items that begin with a 3-digit status code.
import re
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; wikipedia-scraper/1.0; +https://example.com/bot)"
}
def fetch_html(url: str) -> str:
response = requests.get(url, headers=HEADERS, timeout=30)
response.raise_for_status()
return response.text
def parse_status_codes(html: str):
soup = BeautifulSoup(html, "html.parser")
content = soup.select_one("div.mw-parser-output")
records = []
for li in content.select("ul > li"):
text = li.get_text(" ", strip=True)
match = re.match(r"^(\d{3})\s+(.*)", text)
if not match:
continue
code = match.group(1)
description = match.group(2)
first_link = li.select_one("a[href^='/wiki/']")
detail_url = None
if first_link and first_link.get("href"):
detail_url = "https://en.wikipedia.org" + first_link["href"]
records.append({
"code": code,
"description": description,
"detail_url": detail_url,
})
return records
if __name__ == "__main__":
html = fetch_html(URL)
records = parse_status_codes(html)
print(f"Extracted {len(records)} records")
for row in records[:5]:
print(row)
Example output:
Extracted 64 records
{'code': '100', 'description': 'Continue', 'detail_url': 'https://en.wikipedia.org/wiki/List_of_HTTP_status_codes'}
{'code': '101', 'description': 'Switching Protocols', 'detail_url': 'https://en.wikipedia.org/wiki/List_of_HTTP_status_codes'}
{'code': '102', 'description': 'Processing', 'detail_url': 'https://en.wikipedia.org/wiki/WebDAV'}
{'code': '103', 'description': 'Early Hints', 'detail_url': 'https://en.wikipedia.org/wiki/HTTP_103'}
{'code': '200', 'description': 'OK', 'detail_url': 'https://en.wikipedia.org/wiki/List_of_HTTP_status_codes'}
Scraping a real Wikipedia table
A lot of Wikipedia list pages use table.wikitable. Here’s a reusable function that works well on those pages.
import requests
from bs4 import BeautifulSoup
def scrape_wikitable(url: str):
response = requests.get(url, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
table = soup.select_one("table.wikitable")
if not table:
raise ValueError("No wikitable found on page")
headers = [th.get_text(" ", strip=True) for th in table.select("tr th")]
rows = []
for tr in table.select("tr"):
cells = tr.select("td")
if not cells:
continue
values = [td.get_text(" ", strip=True) for td in cells]
rows.append(values)
return headers, rows
This is the exact pattern you’ll reuse across “list of X” pages on Wikipedia.
Follow linked detail pages
Once you have the list page, the next step is often enrichment.
For example, you may want to collect:
- page title
- first paragraph summary
- infobox fields
- categories
Here’s a helper that grabs the first paragraph from a linked Wikipedia detail page.
import requests
from bs4 import BeautifulSoup
def scrape_wikipedia_summary(url: str):
response = requests.get(url, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
title = soup.select_one("h1#firstHeading")
content = soup.select_one("div.mw-parser-output")
summary = ""
if content:
for p in content.select("p"):
text = p.get_text(" ", strip=True)
if text:
summary = text
break
return {
"title": title.get_text(strip=True) if title else None,
"summary": summary,
}
Now combine list-page extraction with detail-page enrichment.
import csv
import json
import time
import requests
from bs4 import BeautifulSoup
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; wikipedia-scraper/1.0; +https://example.com/bot)"
}
BASE_URL = "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"
def get(url):
response = requests.get(url, headers=HEADERS, timeout=30)
response.raise_for_status()
return response.text
def parse_list_page(html):
soup = BeautifulSoup(html, "html.parser")
content = soup.select_one("div.mw-parser-output")
results = []
for li in content.select("ul > li"):
text = li.get_text(" ", strip=True)
if len(text) < 4 or not text[:3].isdigit():
continue
link = li.select_one("a[href^='/wiki/']")
detail_url = None
if link:
detail_url = "https://en.wikipedia.org" + link.get("href", "")
results.append({
"label": text,
"detail_url": detail_url,
})
return results
def parse_detail_page(html):
soup = BeautifulSoup(html, "html.parser")
title = soup.select_one("h1#firstHeading")
content = soup.select_one("div.mw-parser-output")
summary = ""
if content:
for p in content.select("p"):
text = p.get_text(" ", strip=True)
if text:
summary = text
break
return {
"detail_title": title.get_text(strip=True) if title else "",
"summary": summary,
}
list_html = get(BASE_URL)
records = parse_list_page(list_html)
enriched = []
for record in records[:10]:
detail = {"detail_title": "", "summary": ""}
if record["detail_url"]:
try:
detail_html = get(record["detail_url"])
detail = parse_detail_page(detail_html)
time.sleep(1)
except requests.RequestException:
pass
enriched.append({**record, **detail})
with open("wikipedia_status_codes.json", "w", encoding="utf-8") as f:
json.dump(enriched, f, ensure_ascii=False, indent=2)
with open("wikipedia_status_codes.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["label", "detail_url", "detail_title", "summary"])
writer.writeheader()
writer.writerows(enriched)
print(f"Saved {len(enriched)} rows")
Example output:
Saved 10 rows
Using ProxiesAPI for the fetch step
If you want to fetch the same Wikipedia URL through ProxiesAPI, the request shape is simple:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"
You can do the same thing in Python.
import requests
from urllib.parse import quote_plus
def fetch_via_proxiesapi(target_url: str, api_key: str) -> str:
proxy_url = (
"http://api.proxiesapi.com/?key="
f"{api_key}&url={quote_plus(target_url)}"
)
response = requests.get(proxy_url, timeout=60)
response.raise_for_status()
return response.text
html = fetch_via_proxiesapi(
"https://en.wikipedia.org/wiki/List_of_HTTP_status_codes",
api_key="API_KEY",
)
print(html[:200])
That pattern is useful when you want to keep your scraper code mostly unchanged while moving URL fetching into a proxy API layer.
Practical scraping tips for Wikipedia
Wikipedia is friendlier than many commercial sites, but you should still scrape responsibly.
1. Set a user agent
Always identify your script with a descriptive user agent.
2. Add delays between detail-page requests
If you’re following hundreds of links, sleep between requests.
3. Expect page structure differences
Some pages use tables, some use lists, some use infoboxes. Build small parser functions for each pattern.
4. Normalize text early
Wikipedia text often includes citations, superscripts, and formatting noise. Strip and normalize before export.
5. Save raw HTML during development
When a selector stops working, raw HTML snapshots help you debug quickly.
Common selector patterns on Wikipedia
These selectors are useful across many Wikipedia pages:
- Main content:
div.mw-parser-output - Page title:
h1#firstHeading - Table rows:
table.wikitable tr - Infobox:
table.infobox - Internal links:
a[href^='/wiki/']
When to use list-page scraping
This workflow is ideal when you need to build datasets from:
- list pages of tools, companies, protocols, places, or people
- category-like pages with lots of linked entries
- reference pages that combine summary + navigation
It’s especially useful for internal research, content enrichment, and structured exports.
Final thoughts
Wikipedia list pages are one of the easiest ways to build a reliable scraper because the patterns are visible and usually repeatable.
The workflow is straightforward:
- fetch the list page
- extract rows or linked entries
- visit selected detail pages
- normalize and export the data
Start with one page, validate your selectors, and only then scale out to more pages and more detail-page requests.
If you want to simplify the request-routing side of your scraper, a fetch layer like ProxiesAPI can keep the networking piece minimal while you focus on parsing and data quality.
ProxiesAPI lets you fetch target URLs through a simple API endpoint, which is handy when you want cleaner request routing and fewer moving parts in your scraping stack.