Scrape Numbeo Cost of Living Data with Python (cities, indices, and tables)

Numbeo is one of the most referenced sources for cost-of-living comparisons. For scraping, it’s a nice target because the key data is typically in server-rendered HTML tables.

In this guide we’ll build a real Python scraper that:

  • pulls a city cost-of-living page
  • extracts the headline indices + key tables
  • normalizes rows into structured records
  • exports JSON + CSV
  • includes a screenshot so you can verify structure visually

Numbeo cost of living page (tables + indices we’ll parse)

When you scrape many cities, ProxiesAPI keeps requests steady

Numbeo pages are HTML-table heavy — perfect for fast parsing. ProxiesAPI helps when you’re pulling dozens or hundreds of city pages without tripping rate limits.


What we’re scraping (Numbeo URL patterns)

Numbeo’s city pages often look like:

  • https://www.numbeo.com/cost-of-living/in/<City>

Example:

  • https://www.numbeo.com/cost-of-living/in/Amsterdam

There are also “country” pages and comparison pages, but city pages are the most directly useful if you want a dataset.

Terminal sanity check

curl -s "https://www.numbeo.com/cost-of-living/in/Amsterdam" | head -n 10

If you see HTML with tables, you’re good.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • pandas (optional) for easy CSV handling
  • tenacity for retries

ProxiesAPI integration

As your crawl expands (many cities), you’ll want a stable network layer.

The code below supports a ProxiesAPI proxy via an environment variable.

import os
import random
import time
from typing import Optional

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 30)

# Example: http://USER:PASS@gateway.proxiesapi.com:PORT
PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")

session = requests.Session()

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}


def _proxy_dict() -> Optional[dict]:
    if not PROXIESAPI_PROXY_URL:
        return None
    return {"http": PROXIESAPI_PROXY_URL, "https": PROXIESAPI_PROXY_URL}


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=20),
    retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch(url: str) -> str:
    # jitter
    time.sleep(random.uniform(0.3, 1.0))

    r = session.get(url, headers=HEADERS, timeout=TIMEOUT, proxies=_proxy_dict())
    r.raise_for_status()
    return r.text

Step 1: Extract headline indices + tables

Numbeo pages typically include:

  • a “headline” area with key indices (Cost of Living Index, Rent Index, etc.)
  • one or more tables (restaurants, markets, transport, utilities, etc.)

We’ll parse:

  • h1 for city label
  • any “indices” table (two columns like Index + Value)
  • price tables with item + range columns
import re
from bs4 import BeautifulSoup


def clean(s: str | None) -> str | None:
    if not s:
        return None
    t = re.sub(r"\s+", " ", s).strip()
    return t or None


def parse_float(s: str | None) -> float | None:
    if not s:
        return None
    # remove commas and non-numeric tokens
    t = re.sub(r"[^0-9\.,-]", "", s)
    t = t.replace(",", "")
    try:
        return float(t)
    except ValueError:
        return None


def parse_city_page(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = clean(soup.select_one("h1").get_text(" ", strip=True) if soup.select_one("h1") else None)

    # Many Numbeo pages have tables with class 'data_wide_table' or similar.
    tables = soup.select("table")

    indices = []
    price_tables = []

    for tbl in tables:
        # header cells
        headers = [clean(th.get_text(" ", strip=True)) for th in tbl.select("th")]
        rows = tbl.select("tr")

        # heuristic: an index table often has 2 columns and rows like "Cost of Living Index" + "74.2"
        # We detect by checking if the first column contains "Index" tokens.
        if headers and len(headers) == 2 and ("Index" in (headers[0] or "") or "Index" in (headers[1] or "")):
            out = []
            for tr in rows[1:]:
                tds = tr.select("td")
                if len(tds) != 2:
                    continue
                k = clean(tds[0].get_text(" ", strip=True))
                v_text = clean(tds[1].get_text(" ", strip=True))
                v = parse_float(v_text)
                if k:
                    out.append({"name": k, "value": v, "raw": v_text})
            if out:
                indices.extend(out)
            continue

        # heuristic: price tables often have columns like "Item", "Price", "Range" or similar
        if headers and headers[0] and "Item" in headers[0]:
            # capture rows
            out_rows = []
            for tr in rows[1:]:
                tds = tr.select("td")
                if len(tds) < 2:
                    continue
                item = clean(tds[0].get_text(" ", strip=True))
                price = clean(tds[1].get_text(" ", strip=True))
                range_text = clean(tds[2].get_text(" ", strip=True)) if len(tds) >= 3 else None
                if item:
                    out_rows.append({"item": item, "price": price, "range": range_text})
            if out_rows:
                price_tables.append({"headers": headers, "rows": out_rows})

    return {
        "title": title,
        "indices": indices,
        "tables": price_tables,
    }

Step 2: Turn city names into URLs

Numbeo city URLs usually just capitalize words, but cities can have spaces.

A safe approach is:

  • take a known URL list (seed from your own list or a country page)
  • or encode city names and let Numbeo redirect

For a simple tutorial, we’ll start with a list of city slugs you control.

from urllib.parse import quote

BASE = "https://www.numbeo.com"


def city_url(city: str) -> str:
    # Numbeo expects spaces as %20
    return f"{BASE}/cost-of-living/in/{quote(city)}"

Step 3: Export JSON + CSV

We’ll write:

  • one JSON file per city
  • one flattened CSV for the indices table
import json
import csv


def export_city_json(city: str, data: dict, out_dir: str = "out") -> str:
    os.makedirs(out_dir, exist_ok=True)
    path = os.path.join(out_dir, f"numbeo_{city.replace(' ', '_').lower()}.json")
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    return path


def export_indices_csv(city: str, indices: list[dict], path: str) -> None:
    # append mode so you can build a dataset across many cities
    exists = os.path.exists(path)
    with open(path, "a", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=["city", "name", "value", "raw"])
        if not exists:
            w.writeheader()
        for row in indices:
            w.writerow({"city": city, **row})


if __name__ == "__main__":
    cities = ["Amsterdam", "Berlin", "Paris"]
    for city in cities:
        url = city_url(city)
        html = fetch(url)
        parsed = parse_city_page(html)

        json_path = export_city_json(city, {"city": city, "url": url, **parsed})
        export_indices_csv(city, parsed.get("indices", []), "out/numbeo_indices.csv")

        print("city", city, "json", json_path, "indices", len(parsed.get("indices", [])))

Scaling it up (many cities)

If you want to scrape 200+ cities:

  • add caching (don’t refetch unchanged pages)
  • add a crawl delay (0.5–2.0s) + random jitter
  • use a proxy layer (ProxiesAPI) when you hit rate limits
  • store results in SQLite so you can resume

Troubleshooting

Getting 403/429

  • slow down
  • rotate IPs (ProxiesAPI)
  • keep a session cookie jar

Tables missing

  • view page source and confirm the table HTML exists
  • if the table is populated by JS (rare on Numbeo, but possible), consider Playwright

QA checklist

  • City URL fetches HTML
  • parse_city_page() returns at least one table
  • Indices rows parse to floats
  • JSON exports valid UTF-8
  • CSV appends across cities

Where ProxiesAPI fits (honestly)

A single Numbeo page is easy.

But a real dataset involves many pages — and any time you scale a crawl, you’ll see intermittent failures.

ProxiesAPI gives you a predictable way to rotate IPs and keep your scraper running when you move from “3 cities for a demo” to “300 cities for a product.”

When you scrape many cities, ProxiesAPI keeps requests steady

Numbeo pages are HTML-table heavy — perfect for fast parsing. ProxiesAPI helps when you’re pulling dozens or hundreds of city pages without tripping rate limits.

Related guides

Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Marktplaats.nl Listings with Python (search + pagination + price extraction)
Build a clean dataset from Marktplaats search pages (title, price, location, seller type) via ProxiesAPI + BeautifulSoup, exporting to CSV.
tutorial#python#web-scraping#beautifulsoup
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping
Scrape Costco Product Prices with Python (Search + Pagination + SKU Variants)
Pull product name, price, unit size, and availability from Costco listings into a clean CSV using ProxiesAPI + requests. Includes pagination and variant normalization patterns.
tutorial#python#costco#price-scraping