Scrape Wikipedia Category Pages into CSV

Wikipedia category pages are one of the most useful public structures on the web.

Instead of scraping one article at a time, you can start from a category page, collect all member pages, follow subcategories, and turn the whole tree into a research dataset.

That is useful for:

  • building topic maps
  • creating seed lists for later enrichment
  • collecting public entities by subject, geography, or industry

In this tutorial we will crawl a real category page, walk through subcategories, follow member pagination, and export everything to CSV.

Wikipedia category page

Use ProxiesAPI when the same crawler expands beyond Wikipedia

Wikipedia is friendly, but most production crawls do not stay that way. ProxiesAPI gives you a consistent fetch layer once your category crawler graduates to many public sites.


The page structure we care about

We will use:

https://en.wikipedia.org/wiki/Category:Artificial_intelligence

Wikipedia category pages have two main sections:

  • #mw-subcategories for child categories
  • #mw-pages for article members

Inside both sections, links are grouped under:

  • .mw-category-group a

When a category has more than 200 page members, Wikipedia adds pagination links such as:

  • #mw-pages a[href*="pagefrom="]

That makes category crawling pleasantly boring, which is exactly what you want in production.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 pandas lxml

Wikipedia usually does not need proxying for light workloads, but I will wire in optional ProxiesAPI support so the same fetch layer works later on less friendly targets.

export PROXIESAPI_KEY="YOUR_KEY"   # optional

Step 1: Fetch a category page

import os
import requests
from urllib.parse import urlencode

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "").strip()


def maybe_proxy(url: str) -> str:
    if not PROXIESAPI_KEY:
        return url
    return "https://api.proxiesapi.com/?" + urlencode({
        "auth_key": PROXIESAPI_KEY,
        "url": url,
    })


def fetch_html(url: str, session: requests.Session | None = None) -> str:
    s = session or requests.Session()
    r = s.get(maybe_proxy(url), headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


html = fetch_html("https://en.wikipedia.org/wiki/Category:Artificial_intelligence")
print(len(html))
print(html[:150])

You should see normal Wikipedia HTML, not a bot challenge page.


Step 2: Parse subcategories and page members

from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://en.wikipedia.org"


def parse_category_page(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    subcategories = []
    for a in soup.select("#mw-subcategories .mw-category-group a"):
        title = a.get_text(" ", strip=True)
        href = a.get("href", "").strip()
        if not href.startswith("/wiki/Category:"):
            continue
        subcategories.append({
            "title": title,
            "url": urljoin(BASE, href),
        })

    pages = []
    for a in soup.select("#mw-pages .mw-category-group a"):
        title = a.get_text(" ", strip=True)
        href = a.get("href", "").strip()
        if not href.startswith("/wiki/") or href.startswith("/wiki/Category:"):
            continue
        pages.append({
            "title": title,
            "url": urljoin(BASE, href),
        })

    next_page = None
    next_link = soup.select_one('#mw-pages a[href*="pagefrom="]')
    if next_link:
        next_page = urljoin(BASE, next_link.get("href"))

    return {
        "subcategories": subcategories,
        "pages": pages,
        "next_page": next_page,
    }

This mirrors the actual page structure closely:

  • child categories come from #mw-subcategories
  • member articles come from #mw-pages
  • overflow member pages are handled via the pagefrom link

Step 3: Handle member pagination

Wikipedia category pages often show all subcategories on one page but split article members into chunks.

Here is a helper that keeps collecting member pages until there is no next page link left.

def collect_all_members(category_url: str, session: requests.Session | None = None) -> tuple[list[dict], list[dict]]:
    s = session or requests.Session()

    html = fetch_html(category_url, session=s)
    parsed = parse_category_page(html)

    subcategories = parsed["subcategories"]
    pages = list(parsed["pages"])
    next_page = parsed["next_page"]

    while next_page:
        html = fetch_html(next_page, session=s)
        parsed = parse_category_page(html)
        pages.extend(parsed["pages"])
        next_page = parsed["next_page"]

    return subcategories, pages

For the artificial intelligence category, this matters because the member list is larger than the first page.


Step 4: Crawl the category tree

Now we can build a small breadth-first crawler.

from collections import deque
import time


def crawl_category_tree(root_url: str, max_categories: int = 20, delay_seconds: float = 1.0) -> list[dict]:
    session = requests.Session()
    queue = deque([(root_url, None)])
    seen_categories = set()
    rows = []

    while queue and len(seen_categories) < max_categories:
        category_url, parent_url = queue.popleft()
        if category_url in seen_categories:
            continue

        seen_categories.add(category_url)
        print(f"visiting category {len(seen_categories)}: {category_url}")

        subcategories, pages = collect_all_members(category_url, session=session)

        for sub in subcategories:
            rows.append({
                "kind": "subcategory",
                "parent_category_url": category_url,
                "parent_category_parent_url": parent_url,
                "title": sub["title"],
                "url": sub["url"],
            })
            if sub["url"] not in seen_categories:
                queue.append((sub["url"], category_url))

        for page in pages:
            rows.append({
                "kind": "page",
                "parent_category_url": category_url,
                "parent_category_parent_url": parent_url,
                "title": page["title"],
                "url": page["url"],
            })

        time.sleep(delay_seconds)

    return rows

This exports two useful record types:

  • subcategory rows so you preserve the tree structure
  • page rows so you can enrich article members later

Step 5: Save a clean CSV

import pandas as pd

ROOT = "https://en.wikipedia.org/wiki/Category:Artificial_intelligence"

rows = crawl_category_tree(ROOT, max_categories=12, delay_seconds=1.0)
df = pd.DataFrame(rows)

df = df.drop_duplicates(subset=["kind", "parent_category_url", "url"])
df.to_csv("wikipedia_category_tree.csv", index=False)

print("rows:", len(df))
print(df.head(10).to_dict(orient="records"))

Example output shape:

[
  {
    'kind': 'subcategory',
    'parent_category_url': 'https://en.wikipedia.org/wiki/Category:Artificial_intelligence',
    'title': 'Affective computing',
    'url': 'https://en.wikipedia.org/wiki/Category:Affective_computing'
  },
  {
    'kind': 'page',
    'parent_category_url': 'https://en.wikipedia.org/wiki/Category:Artificial_intelligence',
    'title': 'Artificial intelligence',
    'url': 'https://en.wikipedia.org/wiki/Artificial_intelligence'
  }
]

A few improvements worth adding

Once the basic CSV works, the next upgrades are straightforward:

UpgradeWhy it helpsEffort
Add depth columnLets you analyze the tree by distance from the rootLow
Save category titles tooEasier downstream joins and debuggingLow
Export JSON as well as CSVBetter for nested workflowsLow
Enrich page summaries laterTurns a link list into a topic datasetMedium
Cache HTML responsesSpeeds up reruns and reduces requestsMedium

One practical pattern is to keep this crawl narrow:

  • first crawl category membership
  • then enrich only the pages that matter

That keeps your first pass fast and reliable.


Practical notes for Wikipedia category crawls

Wikipedia is forgiving, but a few habits still matter:

  • respect pagination instead of guessing counts
  • dedupe page URLs because related categories overlap
  • keep delays modest and consistent
  • store the parent category so you do not lose the tree context

Also note that many categories contain both broad topic pages and very specific edge cases. That is normal. The job of the scraper is to collect the graph cleanly. Filtering comes later.

If your long-term goal is taxonomy building, seed generation, or entity collection, category pages are one of the highest-leverage entry points on Wikipedia.

They are predictable, link-rich, and easy to turn into a CSV you can actually use.

Use ProxiesAPI when the same crawler expands beyond Wikipedia

Wikipedia is friendly, but most production crawls do not stay that way. ProxiesAPI gives you a consistent fetch layer once your category crawler graduates to many public sites.

Related guides

Scrape Crunchbase Company Data
Collect company profile fields from Crunchbase by discovering organization URLs, rendering profile pages, and parsing structured data into CSV.
tutorial#python#crunchbase#web-scraping
Scrape Craigslist Listings by Category and City
Show how to pull listing titles, prices, neighborhoods, and posting URLs from Craigslist search pages into a clean dataset.
tutorial#python#craigslist#web-scraping
Scrape Rightmove Sold Prices
Walk through building a sold-price dataset from Rightmove with listing details, pagination, and clean CSV export.
tutorial#python#rightmove#real-estate
Steam Deal Tracker: Scrape Daily Specials + Price Drops (Python + ProxiesAPI)
Scrape Steam specials/search pages via ProxiesAPI, extract discount + price + appid, and persist a daily snapshot to detect price drops. Includes pagination, CSV export, and a screenshot of the target page.
tutorial#python#steam#price-tracking