Scrape OpenStreetMap Wiki pages with Python

Mar 13, 2026 · tutorial · #python, #openstreetmap, #osm, #web-scraping, #beautifulsoup, #requests

The OpenStreetMap Wiki is a useful scraping target when you need structured documentation, category listings, and linked reference pages for mapping workflows, community research, or content monitoring.

In this guide, we’ll scrape OpenStreetMap Wiki pages with Python by:

fetching a category page
extracting the linked wiki entries
visiting each linked page
pulling titles and summary text
exporting the result to JSON and CSV

We’ll use requests and BeautifulSoup, and we’ll show how to swap the fetch step to ProxiesAPI when you want a proxy-backed request flow.

Keep the request layer simple

Once you move from a handful of wiki pages to a larger crawl, ProxiesAPI gives you a simple fetch endpoint so your parser code can stay focused on extraction and exports.

Get 1,000 free API calls View pricing

Target pages

We’ll use a real category page:

Main page: https://wiki.openstreetmap.org/wiki/Main_Page
Category page: https://wiki.openstreetmap.org/wiki/Category:Beginners%27_guide

That category page is a good example because it contains a list of wiki pages inside a familiar MediaWiki layout.

Install dependencies

pip install requests beautifulsoup4

Step 1: Fetch the category page

Start with a simple request.

import requests
from bs4 import BeautifulSoup

URL = "https://wiki.openstreetmap.org/wiki/Category:Beginners%27_guide"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; osm-wiki-scraper/1.0; +https://example.com/bot)"
}

response = requests.get(URL, headers=HEADERS, timeout=30)
response.raise_for_status()

soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.get_text(strip=True))

Example output:

Category:Beginners' guide - OpenStreetMap Wiki

Step 2: Inspect the page structure

OpenStreetMap Wiki runs on MediaWiki, so many pages share common patterns.

Useful selectors include:

page title: h1.firstHeading
main content: div#mw-content-text
category members: div#mw-pages
content links: a[href^='/wiki/']

Let’s extract the category members.

members = soup.select("div#mw-pages li a[href^='/wiki/']")
for link in members[:10]:
    print(link.get_text(strip=True), link.get("href"))

On the category page, the member links are typically listed under the “Pages in category …” section.

Step 3: Build a reusable category scraper

Here’s a production-friendly starting point.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://wiki.openstreetmap.org"
CATEGORY_URL = "https://wiki.openstreetmap.org/wiki/Category:Beginners%27_guide"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; osm-wiki-scraper/1.0; +https://example.com/bot)"
}


def get_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=30)
    response.raise_for_status()
    return response.text


def parse_category_page(html: str):
    soup = BeautifulSoup(html, "html.parser")
    links = []

    for a in soup.select("div#mw-pages li a[href^='/wiki/']"):
        href = a.get("href")
        title = a.get_text(strip=True)
        if not href or not title:
            continue
        links.append({
            "title": title,
            "url": urljoin(BASE, href),
        })

    return links


html = get_html(CATEGORY_URL)
records = parse_category_page(html)
print(f"Found {len(records)} category pages")
print(records[:5])

Example output:

Found 48 category pages
[{'title': 'Beginners\' guide', 'url': 'https://wiki.openstreetmap.org/wiki/Beginners%27_guide'}, {'title': 'Beginners\' guide 1.3', 'url': 'https://wiki.openstreetmap.org/wiki/Beginners%27_guide_1.3'}]

Step 4: Visit each linked wiki page

A category page is only the index. Usually you also want fields from the linked pages.

For documentation-style wiki pages, a practical enrichment set is:

page title
first paragraph summary
URL
number of outgoing wiki links

Here’s a parser for individual pages.

from bs4 import BeautifulSoup


def parse_wiki_page(html: str):
    soup = BeautifulSoup(html, "html.parser")

    title_el = soup.select_one("h1.firstHeading")
    title = title_el.get_text(strip=True) if title_el else ""

    content = soup.select_one("div#mw-content-text")
    summary = ""
    link_count = 0

    if content:
        for p in content.select("p"):
            text = p.get_text(" ", strip=True)
            if text:
                summary = text
                break
        link_count = len(content.select("a[href^='/wiki/']"))

    return {
        "page_title": title,
        "summary": summary,
        "link_count": link_count,
    }

Now let’s combine category extraction with page enrichment.

import csv
import json
import time
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://wiki.openstreetmap.org"
CATEGORY_URL = "https://wiki.openstreetmap.org/wiki/Category:Beginners%27_guide"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; osm-wiki-scraper/1.0; +https://example.com/bot)"
}


def get_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=30)
    response.raise_for_status()
    return response.text


def parse_category_page(html: str):
    soup = BeautifulSoup(html, "html.parser")
    out = []
    for a in soup.select("div#mw-pages li a[href^='/wiki/']"):
        href = a.get("href")
        title = a.get_text(strip=True)
        if href and title:
            out.append({
                "title": title,
                "url": urljoin(BASE, href),
            })
    return out


def parse_wiki_page(html: str):
    soup = BeautifulSoup(html, "html.parser")
    title_el = soup.select_one("h1.firstHeading")
    content = soup.select_one("div#mw-content-text")

    summary = ""
    if content:
        for p in content.select("p"):
            text = p.get_text(" ", strip=True)
            if text:
                summary = text
                break

    return {
        "page_title": title_el.get_text(strip=True) if title_el else "",
        "summary": summary,
        "internal_link_count": len(content.select("a[href^='/wiki/']")) if content else 0,
    }


category_html = get_html(CATEGORY_URL)
category_pages = parse_category_page(category_html)
enriched = []

for page in category_pages[:15]:
    try:
        html = get_html(page["url"])
        details = parse_wiki_page(html)
        enriched.append({**page, **details})
        print("scraped:", page["url"])
        time.sleep(1)
    except requests.RequestException as exc:
        print("failed:", page["url"], exc)

with open("osm_wiki_beginners_guide.json", "w", encoding="utf-8") as f:
    json.dump(enriched, f, ensure_ascii=False, indent=2)

with open("osm_wiki_beginners_guide.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(
        f,
        fieldnames=["title", "url", "page_title", "summary", "internal_link_count"],
    )
    writer.writeheader()
    writer.writerows(enriched)

print(f"Saved {len(enriched)} enriched rows")

Example output:

scraped: https://wiki.openstreetmap.org/wiki/Beginners%27_guide
scraped: https://wiki.openstreetmap.org/wiki/Beginners%27_guide_1.3
scraped: https://wiki.openstreetmap.org/wiki/Beginners%27_guide_2.1
Saved 15 enriched rows

Why these selectors work

For MediaWiki-based sites, these selectors are stable enough to be useful:

h1.firstHeading → page title
div#mw-content-text → main body area
div#mw-pages li a → category listing entries

That means you can adapt the same scraper shape to other wiki categories with very few changes.

Handling pagination on large categories

Some category pages span multiple pages. In that case, look for next-page links in the category navigation area and keep crawling until there is no next page.

A simplified helper looks like this:


def find_next_category_page(soup):
    for a in soup.select("div#mw-pages a"):
        text = a.get_text(" ", strip=True).lower()
        if "next page" in text:
            return urljoin(BASE, a.get("href"))
    return None

If you’re scraping larger OpenStreetMap Wiki categories, this is the first upgrade to add.

Use ProxiesAPI for the request layer

If you want to fetch OpenStreetMap Wiki pages through ProxiesAPI, the curl pattern is:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://wiki.openstreetmap.org/wiki/Category:Beginners%27_guide"

And in Python:

import requests
from urllib.parse import quote_plus


def fetch_via_proxiesapi(target_url: str, api_key: str) -> str:
    proxy_url = (
        "http://api.proxiesapi.com/?key="
        f"{api_key}&url={quote_plus(target_url)}"
    )
    response = requests.get(proxy_url, timeout=60)
    response.raise_for_status()
    return response.text

html = fetch_via_proxiesapi(
    "https://wiki.openstreetmap.org/wiki/Category:Beginners%27_guide",
    api_key="API_KEY",
)
print(html[:200])

That lets you keep the same parsing code while changing how pages are fetched.

Practical tips for OSM Wiki scraping

1. Expect wiki-specific noise

Navigation, edit links, language controls, and templates can add clutter. Extract from the content area rather than parsing the whole document blindly.

2. Use small parser functions

Keep one function for category pages and another for detail pages. Wiki layouts are consistent, but not identical.

3. Save intermediate results

If you’re visiting dozens or hundreds of linked pages, write progress to disk instead of waiting until the very end.

4. Be polite with request pacing

A one-second delay between detail-page fetches is a simple default for documentation sites.

5. Validate a few pages manually

Before scaling up, spot-check 5 to 10 records to make sure your summary extraction and URLs are correct.

When this pattern is useful

Scraping OpenStreetMap Wiki pages is useful when you need to:

build a local research index of documentation pages
monitor new or changed wiki content in a category
collect examples for internal tooling or training data
structure wiki references for a mapping workflow

Final thoughts

The easiest way to scrape OpenStreetMap Wiki pages is to think of the task in two layers:

category pages give you the list of targets
detail pages give you the actual content you want to store

Once you separate those two layers, the scraper becomes simple to extend.

Start with one category, confirm the selectors, export the data, and then add pagination or deeper field extraction as needed. And if you want a cleaner network layer while keeping your parser code unchanged, ProxiesAPI gives you a straightforward fetch pattern to plug in.

Keep the request layer simple

Once you move from a handful of wiki pages to a larger crawl, ProxiesAPI gives you a simple fetch endpoint so your parser code can stay focused on extraction and exports.

Get 1,000 free API calls View pricing

Related guides

Scrape GitHub Repository Data

Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.

tutorial#python#github#web-scraping

Scrape Secondhand Fashion Listings from Vinted

Show how to extract Vinted search listings, prices, brands, and image URLs into a resale-market dataset with Python, screenshots, and a ProxiesAPI-ready fetch layer.

tutorial#python#vinted#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Book Reviews and Ratings from Goodreads with Python (JSON-LD + Top Reviews)

Learn how to scrape Goodreads book pages responsibly: extract rating, rating count, review count via JSON-LD, parse key metadata, and collect top review snippets. Includes screenshot and ProxiesAPI-ready request patterns.

tutorial#python#goodreads#web-scraping