Scrape Wikipedia Category Pages into CSV

Jun 21, 2026 · tutorial · #python, #wikipedia, #web-scraping, #csv, #beautifulsoup, #proxiesapi

Wikipedia category pages are one of the most useful public structures on the web.

Instead of scraping one article at a time, you can start from a category page, collect all member pages, follow subcategories, and turn the whole tree into a research dataset.

That is useful for:

building topic maps
creating seed lists for later enrichment
collecting public entities by subject, geography, or industry

In this tutorial we will crawl a real category page, walk through subcategories, follow member pagination, and export everything to CSV.

Wikipedia category page

Use ProxiesAPI when the same crawler expands beyond Wikipedia

Wikipedia is friendly, but most production crawls do not stay that way. ProxiesAPI gives you a consistent fetch layer once your category crawler graduates to many public sites.

Get 1,000 free API calls View pricing

The page structure we care about

We will use:

https://en.wikipedia.org/wiki/Category:Artificial_intelligence

Wikipedia category pages have two main sections:

#mw-subcategories for child categories
#mw-pages for article members

Inside both sections, links are grouped under:

.mw-category-group a

When a category has more than 200 page members, Wikipedia adds pagination links such as:

#mw-pages a[href*="pagefrom="]

That makes category crawling pleasantly boring, which is exactly what you want in production.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 pandas lxml

Wikipedia usually does not need proxying for light workloads, but I will wire in optional ProxiesAPI support so the same fetch layer works later on less friendly targets.

export PROXIESAPI_KEY="YOUR_KEY"   # optional

Step 1: Fetch a category page

import os
import requests
from urllib.parse import urlencode

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "").strip()


def maybe_proxy(url: str) -> str:
    if not PROXIESAPI_KEY:
        return url
    return "https://api.proxiesapi.com/?" + urlencode({
        "auth_key": PROXIESAPI_KEY,
        "url": url,
    })


def fetch_html(url: str, session: requests.Session | None = None) -> str:
    s = session or requests.Session()
    r = s.get(maybe_proxy(url), headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


html = fetch_html("https://en.wikipedia.org/wiki/Category:Artificial_intelligence")
print(len(html))
print(html[:150])

You should see normal Wikipedia HTML, not a bot challenge page.

Step 2: Parse subcategories and page members

from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://en.wikipedia.org"


def parse_category_page(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    subcategories = []
    for a in soup.select("#mw-subcategories .mw-category-group a"):
        title = a.get_text(" ", strip=True)
        href = a.get("href", "").strip()
        if not href.startswith("/wiki/Category:"):
            continue
        subcategories.append({
            "title": title,
            "url": urljoin(BASE, href),
        })

    pages = []
    for a in soup.select("#mw-pages .mw-category-group a"):
        title = a.get_text(" ", strip=True)
        href = a.get("href", "").strip()
        if not href.startswith("/wiki/") or href.startswith("/wiki/Category:"):
            continue
        pages.append({
            "title": title,
            "url": urljoin(BASE, href),
        })

    next_page = None
    next_link = soup.select_one('#mw-pages a[href*="pagefrom="]')
    if next_link:
        next_page = urljoin(BASE, next_link.get("href"))

    return {
        "subcategories": subcategories,
        "pages": pages,
        "next_page": next_page,
    }

This mirrors the actual page structure closely:

child categories come from #mw-subcategories
member articles come from #mw-pages
overflow member pages are handled via the pagefrom link

Step 3: Handle member pagination

Wikipedia category pages often show all subcategories on one page but split article members into chunks.

Here is a helper that keeps collecting member pages until there is no next page link left.

def collect_all_members(category_url: str, session: requests.Session | None = None) -> tuple[list[dict], list[dict]]:
    s = session or requests.Session()

    html = fetch_html(category_url, session=s)
    parsed = parse_category_page(html)

    subcategories = parsed["subcategories"]
    pages = list(parsed["pages"])
    next_page = parsed["next_page"]

    while next_page:
        html = fetch_html(next_page, session=s)
        parsed = parse_category_page(html)
        pages.extend(parsed["pages"])
        next_page = parsed["next_page"]

    return subcategories, pages

For the artificial intelligence category, this matters because the member list is larger than the first page.

Step 4: Crawl the category tree

Now we can build a small breadth-first crawler.

from collections import deque
import time


def crawl_category_tree(root_url: str, max_categories: int = 20, delay_seconds: float = 1.0) -> list[dict]:
    session = requests.Session()
    queue = deque([(root_url, None)])
    seen_categories = set()
    rows = []

    while queue and len(seen_categories) < max_categories:
        category_url, parent_url = queue.popleft()
        if category_url in seen_categories:
            continue

        seen_categories.add(category_url)
        print(f"visiting category {len(seen_categories)}: {category_url}")

        subcategories, pages = collect_all_members(category_url, session=session)

        for sub in subcategories:
            rows.append({
                "kind": "subcategory",
                "parent_category_url": category_url,
                "parent_category_parent_url": parent_url,
                "title": sub["title"],
                "url": sub["url"],
            })
            if sub["url"] not in seen_categories:
                queue.append((sub["url"], category_url))

        for page in pages:
            rows.append({
                "kind": "page",
                "parent_category_url": category_url,
                "parent_category_parent_url": parent_url,
                "title": page["title"],
                "url": page["url"],
            })

        time.sleep(delay_seconds)

    return rows

This exports two useful record types:

subcategory rows so you preserve the tree structure
page rows so you can enrich article members later

Step 5: Save a clean CSV

import pandas as pd

ROOT = "https://en.wikipedia.org/wiki/Category:Artificial_intelligence"

rows = crawl_category_tree(ROOT, max_categories=12, delay_seconds=1.0)
df = pd.DataFrame(rows)

df = df.drop_duplicates(subset=["kind", "parent_category_url", "url"])
df.to_csv("wikipedia_category_tree.csv", index=False)

print("rows:", len(df))
print(df.head(10).to_dict(orient="records"))

Example output shape:

[
  {
    'kind': 'subcategory',
    'parent_category_url': 'https://en.wikipedia.org/wiki/Category:Artificial_intelligence',
    'title': 'Affective computing',
    'url': 'https://en.wikipedia.org/wiki/Category:Affective_computing'
  },
  {
    'kind': 'page',
    'parent_category_url': 'https://en.wikipedia.org/wiki/Category:Artificial_intelligence',
    'title': 'Artificial intelligence',
    'url': 'https://en.wikipedia.org/wiki/Artificial_intelligence'
  }
]

A few improvements worth adding

Once the basic CSV works, the next upgrades are straightforward:

Upgrade	Why it helps	Effort
Add `depth` column	Lets you analyze the tree by distance from the root	Low
Save category titles too	Easier downstream joins and debugging	Low
Export JSON as well as CSV	Better for nested workflows	Low
Enrich page summaries later	Turns a link list into a topic dataset	Medium
Cache HTML responses	Speeds up reruns and reduces requests	Medium

One practical pattern is to keep this crawl narrow:

first crawl category membership
then enrich only the pages that matter

That keeps your first pass fast and reliable.

Practical notes for Wikipedia category crawls

Wikipedia is forgiving, but a few habits still matter:

respect pagination instead of guessing counts
dedupe page URLs because related categories overlap
keep delays modest and consistent
store the parent category so you do not lose the tree context

Also note that many categories contain both broad topic pages and very specific edge cases. That is normal. The job of the scraper is to collect the graph cleanly. Filtering comes later.

If your long-term goal is taxonomy building, seed generation, or entity collection, category pages are one of the highest-leverage entry points on Wikipedia.

They are predictable, link-rich, and easy to turn into a CSV you can actually use.

Use ProxiesAPI when the same crawler expands beyond Wikipedia

Wikipedia is friendly, but most production crawls do not stay that way. ProxiesAPI gives you a consistent fetch layer once your category crawler graduates to many public sites.

Get 1,000 free API calls View pricing

Related guides

Scrape Crunchbase Company Data

Collect company profile fields from Crunchbase by discovering organization URLs, rendering profile pages, and parsing structured data into CSV.

tutorial#python#crunchbase#web-scraping

Scrape Craigslist Listings by Category and City

Show how to pull listing titles, prices, neighborhoods, and posting URLs from Craigslist search pages into a clean dataset.

tutorial#python#craigslist#web-scraping

Scrape Rightmove Sold Prices

Walk through building a sold-price dataset from Rightmove with listing details, pagination, and clean CSV export.

tutorial#python#rightmove#real-estate

Steam Deal Tracker: Scrape Daily Specials + Price Drops (Python + ProxiesAPI)

Scrape Steam specials/search pages via ProxiesAPI, extract discount + price + appid, and persist a daily snapshot to detect price drops. Includes pagination, CSV export, and a screenshot of the target page.

tutorial#python#steam#price-tracking