Scrape Wikipedia Article Data at Scale (Tables + Infobox + Links)

Wikipedia is one of the best “practice arenas” for web scraping: pages are server-rendered, HTML is consistent, and a lot of valuable data is already structured (infoboxes, tables, categories, and internal links).

Wikipedia article page (we’ll extract infobox + tables + links)

In this tutorial, you’ll build a scraper that can:

  • fetch many Wikipedia article pages reliably
  • extract infobox fields (key/value pairs)
  • extract tables (like wikitable)
  • extract internal links (for crawling)
  • save results to JSON and CSV

We’ll use Python with requests + BeautifulSoup, and we’ll show exactly where ProxiesAPI fits in.

Make high-volume Wikipedia fetches stable with ProxiesAPI

When you move from 10 pages to 10,000, the network layer becomes the bottleneck. ProxiesAPI gives you a simple, consistent fetch interface so your scraper code stays clean while your crawl scales.


What we’re scraping (Wikipedia page structure)

Most Wikipedia articles share a few structural patterns:

  • The main content lives under div#mw-content-text
  • The infobox is usually a <table> with a class containing infobox
  • Many structured tables use the wikitable class
  • Internal links are simple <a href="/wiki/..."> anchors

A simplified infobox looks like:

<table class="infobox ...">
  <tr>
    <th scope="row">Born</th>
    <td>...</td>
  </tr>
</table>

And a typical wikitable:

<table class="wikitable">
  <tr><th>Col 1</th><th>Col 2</th></tr>
  <tr><td>...</td><td>...</td></tr>
</table>

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup("lxml") for more reliable parsing than the default HTML parser

Step 1: Fetch HTML (direct vs ProxiesAPI)

Option A — direct fetch (good for small runs)

import requests

TIMEOUT = (10, 30)
session = requests.Session()


def fetch_direct(url: str) -> str:
    r = session.get(
        url,
        timeout=TIMEOUT,
        headers={
            "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"
        },
    )
    r.raise_for_status()
    return r.text

ProxiesAPI gives you a single, consistent HTTP interface:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://en.wikipedia.org/wiki/Web_scraping" | head

In Python:

import urllib.parse
import requests

PROXIESAPI_KEY = "API_KEY"  # <- set your real key
TIMEOUT = (10, 60)


def fetch_via_proxiesapi(url: str) -> str:
    api = "http://api.proxiesapi.com/"
    params = {
        "key": PROXIESAPI_KEY,
        "url": url,
    }
    req_url = api + "?" + urllib.parse.urlencode(params)

    r = requests.get(
        req_url,
        timeout=TIMEOUT,
        headers={
            "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"
        },
    )
    r.raise_for_status()
    return r.text

In the rest of this tutorial, we’ll write our scraper to accept a fetch(url) function so you can switch between direct and ProxiesAPI easily.


Step 2: Parse an infobox into a dictionary

Here’s a robust approach:

  • locate the first table whose class contains infobox
  • iterate tr rows
  • use th as the key and td as the value
  • normalize whitespace
from bs4 import BeautifulSoup


def clean_text(el) -> str:
    if not el:
        return ""
    # preserve small line breaks, strip footnote markers as plain text
    return " ".join(el.get_text(" ", strip=True).split())


def parse_infobox(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    box = soup.select_one("table.infobox")
    if not box:
        # Some pages use different infobox variants; try a contains match.
        box = soup.select_one("table[class*='infobox']")

    if not box:
        return {}

    data = {}
    for row in box.select("tr"):
        key = row.select_one("th")
        val = row.select_one("td")
        k = clean_text(key)
        v = clean_text(val)
        if k and v:
            data[k] = v

    return data

Quick sanity check

url = "https://en.wikipedia.org/wiki/Web_scraping"
html = fetch_via_proxiesapi(url)
infobox = parse_infobox(html)
print("infobox keys:", len(infobox))
print(list(infobox)[:8])

Typical output (varies by page):

infobox keys: 5
['Paradigm', 'Type', 'Developer(s)', 'Initial release', 'License']

Step 3: Extract all wikitable tables as rows

Wikipedia pages can contain many tables; we’ll focus on table.wikitable.

We’ll return tables as a list of dictionaries:

  • caption
  • headers
  • rows (each row is a list of cell texts)

def parse_wikitables(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    tables = []

    for t in soup.select("table.wikitable"):
        caption_el = t.select_one("caption")
        caption = clean_text(caption_el)

        # headers: first row with <th>
        headers = []
        header_row = t.select_one("tr")
        if header_row:
            headers = [clean_text(th) for th in header_row.select("th")]

        rows = []
        for tr in t.select("tr"):
            tds = tr.select("td")
            if not tds:
                continue
            rows.append([clean_text(td) for td in tds])

        tables.append({
            "caption": caption,
            "headers": headers,
            "rows": rows,
        })

    return tables

To crawl Wikipedia, you usually want to keep it scoped to:

  • /wiki/... links
  • skip special pages like Help: or Special:
import re


def extract_internal_links(html: str, limit: int = 200) -> list[str]:
    soup = BeautifulSoup(html, "lxml")

    links = []
    seen = set()

    for a in soup.select("div#mw-content-text a[href]"):
        href = a.get("href")
        if not href:
            continue

        if not href.startswith("/wiki/"):
            continue

        # Skip special namespaces
        if re.search(r"^/wiki/(Special|Help|Talk|File|Category|Template):", href):
            continue

        if href in seen:
            continue

        seen.add(href)
        links.append("https://en.wikipedia.org" + href)

        if len(links) >= limit:
            break

    return links

Step 5: Put it together for one page

import json


def parse_article(url: str, html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title_el = soup.select_one("h1#firstHeading")
    title = clean_text(title_el)

    return {
        "url": url,
        "title": title,
        "infobox": parse_infobox(html),
        "wikitables": parse_wikitables(html),
        "internal_links": extract_internal_links(html, limit=200),
    }


url = "https://en.wikipedia.org/wiki/Web_scraping"
html = fetch_via_proxiesapi(url)
article = parse_article(url, html)

print(article["title"], "infobox:", len(article["infobox"]), "tables:", len(article["wikitables"]))

with open("wikipedia_article.json", "w", encoding="utf-8") as f:
    json.dump(article, f, ensure_ascii=False, indent=2)

print("wrote wikipedia_article.json")

Example run:

Web scraping infobox: 5 tables: 0
wrote wikipedia_article.json

(Your table count depends on the specific page you scrape.)


Step 6: Scale to many pages (batch + retries)

When you scrape at scale, two things matter most:

  1. you will hit transient failures (timeouts, occasional 429s, temporary network errors)
  2. you need a way to resume without losing progress

This simple pipeline:

  • reads a list of URLs
  • fetches each page with retries
  • writes one JSON per URL (easy to resume)
  • also writes a compact CSV summary
import csv
import time
import random
from pathlib import Path


def fetch_with_retries(fetch_fn, url: str, attempts: int = 4) -> str:
    last = None
    for i in range(1, attempts + 1):
        try:
            return fetch_fn(url)
        except Exception as e:
            last = e
            sleep = min(30, (2 ** i) + random.random())
            print(f"fetch failed (attempt {i}/{attempts}) {url}: {e}; sleeping {sleep:.1f}s")
            time.sleep(sleep)
    raise last


def run_batch(urls: list[str], out_dir: str = "out_wikipedia"):
    out = Path(out_dir)
    out.mkdir(parents=True, exist_ok=True)

    rows = []
    for idx, url in enumerate(urls, start=1):
        slug = url.split("/wiki/")[-1]
        out_path = out / f"{slug}.json"
        if out_path.exists():
            print("skip", url)
            continue

        html = fetch_with_retries(fetch_via_proxiesapi, url)
        article = parse_article(url, html)

        out_path.write_text(json.dumps(article, ensure_ascii=False, indent=2), encoding="utf-8")
        print(f"[{idx}/{len(urls)}] wrote", out_path)

        rows.append({
            "url": url,
            "title": article["title"],
            "infobox_keys": len(article["infobox"]),
            "tables": len(article["wikitables"]),
            "links": len(article["internal_links"]),
        })

    # summary CSV
    with open(out / "summary.csv", "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=["url", "title", "infobox_keys", "tables", "links"])
        w.writeheader()
        w.writerows(rows)

    print("wrote", out / "summary.csv")

Try it with a small seed set:

seed = [
    "https://en.wikipedia.org/wiki/Web_scraping",
    "https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)",
    "https://en.wikipedia.org/wiki/Requests_(software)",
]
run_batch(seed)

Practical notes (don’t skip these)

1) Be gentle with request rates

Even if a site is permissive, high burst traffic is rarely appreciated. Add pacing if you’re doing large crawls.

2) Prefer “write one file per URL” for resumability

Single huge JSON files are annoying to resume. One-file-per-URL makes retries and partial progress easy.

3) Keep your parsing defensive

Wikipedia templates vary. Your parse_infobox() returning {} is not a failure — it’s expected for pages without an infobox.


Where ProxiesAPI fits (honestly)

Wikipedia is relatively friendly. You can scrape it directly.

But the moment your workflow becomes:

  • many URLs
  • multiple retries
  • multiple runs per day

…then the fetch layer becomes “the thing” you spend time debugging.

ProxiesAPI keeps the fetching interface simple so you can focus on parsing and data quality.


Checklist

  • Fetch works with a timeout
  • Infobox extraction returns sane key/value pairs
  • Tables (if present) parse into headers + rows
  • Link extractor stays within /wiki/ scope
  • Batch runner can resume by skipping existing files
Make high-volume Wikipedia fetches stable with ProxiesAPI

When you move from 10 pages to 10,000, the network layer becomes the bottleneck. ProxiesAPI gives you a simple, consistent fetch interface so your scraper code stays clean while your crawl scales.

Related guides

How to Scrape Apartment Listings from Apartments.com (Python + ProxiesAPI)
Scrape Apartments.com listing cards and detail-page fields with Python. Includes pagination, resilient parsing, retries, and clean JSON/CSV exports.
tutorial#python#apartments#real-estate
How to Scrape Business Reviews from Yelp (Python + ProxiesAPI)
Extract Yelp search results and business-page review snippets with Python. Includes pagination, resilient selectors, retries, and a clean JSON/CSV export.
tutorial#python#yelp#reviews
Build a Job Board with Data from Indeed (Python scraper tutorial)
Scrape Indeed job listings (title, company, location, salary, summary) with Python (requests + BeautifulSoup), then save a clean dataset you can render as a simple job board. Includes pagination + ProxiesAPI fetch.
tutorial#python#indeed#jobs
Scrape IMDb Top 250 Movies into a Dataset
Pull rank, title, year, rating, and votes into clean CSV/JSON for analysis with working Python code.
tutorial#python#imdb#web-scraping