Scrape IMDb Top 250 Movies into a Dataset (Python + ProxiesAPI)
IMDb’s Top 250 Movies list is a classic scraping target: it’s a single page with a well-defined table, but it still forces you to solve the real-world problems that make scrapers flaky:
- HTML changes (classes move, wrappers change)
- number parsing (ratings, vote counts)
- transient failures (timeouts, 429s)
- exporting to a dataset format you can actually use
In this guide, we’ll build a production-grade Python scraper that outputs:
imdb_top_250.jsonimdb_top_250.csv
…and uses ProxiesAPI as the network layer so you can keep the crawl stable if you run it from servers, CI, or at higher frequency.

When you scrape at scale, failures come from the network layer (timeouts, throttling, transient blocks). ProxiesAPI gives you a stable HTTP surface so your parser code can stay simple.
What we’re scraping
Target page:
https://www.imdb.com/chart/top/
At the time of writing, IMDb renders a table-like layout where each row contains:
- rank (1..250)
- title
- release year
- rating (e.g. 9.2)
- vote count (e.g. 2.9M)
Important: IMDb’s HTML can change. Instead of hard-coding brittle selectors, we’ll:
- prefer semantic attributes when available
- keep selectors minimal
- validate we got 250 rows and fail loudly if not
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingtenacityfor retries with exponential backoff
Step 1: Fetch HTML via ProxiesAPI (with timeouts + retries)
A reliable scraper starts with a reliable fetch.
Below is a simple ProxiesAPI pattern:
- build a target URL
- call ProxiesAPI with that URL
- set a real timeout
- retry transient failures
Put your ProxiesAPI key in an environment variable:
export PROXIESAPI_KEY="YOUR_API_KEY"
Python fetcher:
import os
import time
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY")
TIMEOUT = (10, 40) # connect, read
SESSION = requests.Session()
class FetchError(RuntimeError):
pass
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=20),
retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch_html(url: str) -> str:
if not PROXIESAPI_KEY:
raise RuntimeError("Missing PROXIESAPI_KEY env var")
# ProxiesAPI common pattern: pass the target URL as a parameter.
# If your account uses a slightly different endpoint/param name,
# keep this function as the only place you change it.
api_url = "https://api.proxiesapi.com"
params = {
"api_key": PROXIESAPI_KEY,
"url": url,
}
headers = {
# a realistic UA reduces pointless bot suspicion
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
r = SESSION.get(api_url, params=params, headers=headers, timeout=TIMEOUT)
# Treat 429/5xx as retryable
if r.status_code in (429, 500, 502, 503, 504):
raise FetchError(f"Retryable status: {r.status_code}")
r.raise_for_status()
return r.text
html = fetch_html("https://www.imdb.com/chart/top/")
print("bytes:", len(html))
print(html[:200])
Notes:
- The retry policy is conservative (5 attempts, exponential backoff).
- We fail fast if
PROXIESAPI_KEYis missing. - We keep all ProxiesAPI integration inside
fetch_html()so the rest of the code is pure parsing.
Step 2: Parse the Top 250 rows
IMDb markup changes over time, so we’ll write a parser that:
- finds rows that look like “Top 250 items”
- extracts title + year + rating + votes from within the row
- validates we got a sensible number of items
Implementation:
import re
from bs4 import BeautifulSoup
def parse_year(text: str) -> int | None:
m = re.search(r"(19\d{2}|20\d{2})", text or "")
return int(m.group(1)) if m else None
def parse_votes(text: str) -> int | None:
"""Parse vote strings like '2.9M' or '945K' or '123,456'."""
if not text:
return None
t = text.strip().upper().replace(",", "")
m = re.match(r"^(\d+(?:\.\d+)?)([KM])?$", t)
if not m:
# Sometimes IMDb includes parentheses or extra words.
m2 = re.search(r"(\d+(?:\.\d+)?)([KM])?", t)
if not m2:
return None
num, suf = m2.group(1), m2.group(2)
else:
num, suf = m.group(1), m.group(2)
val = float(num)
if suf == "K":
val *= 1_000
elif suf == "M":
val *= 1_000_000
return int(val)
def parse_top_250(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
items = []
# Strategy:
# - IMDb has used 'li' based layouts and table based layouts in different eras.
# - We try a couple of broad selectors and unify extraction.
# 1) Try list-style items
candidates = soup.select("li.ipc-metadata-list-summary-item")
# Fallback: table rows (older markup)
if not candidates:
candidates = soup.select("tr")
for c in candidates:
text = c.get_text(" ", strip=True)
if not text:
continue
# Title: prefer visible link text
title = None
title_a = c.select_one("a")
if title_a:
title = title_a.get_text(" ", strip=True)
# Year: look for a 4-digit year anywhere in the row
year = parse_year(text)
# Rating: common pattern is a decimal 0-10, but avoid years
rating = None
m = re.search(r"\b(\d\.\d)\b", text)
if m:
rating = float(m.group(1))
# Votes: look for 'K'/'M' strings near 'votes' word if present
votes = None
mv = re.search(r"(\d+(?:\.\d+)?[KM])\s+votes", text, flags=re.IGNORECASE)
if mv:
votes = parse_votes(mv.group(1))
else:
# Sometimes votes appear without 'votes' label; try a weaker heuristic
mv2 = re.search(r"\b(\d+(?:\.\d+)?[KM])\b", text)
if mv2:
votes = parse_votes(mv2.group(1))
# Rank: look for leading '1.' or '1' near start
rank = None
mr = re.match(r"^(\d{1,3})\D", text)
if mr:
rank = int(mr.group(1))
# Keep only plausible movie rows
if title and year and rating and 1 <= year <= 2100 and 0 < rating <= 10:
items.append({
"rank": rank,
"title": title,
"year": year,
"rating": rating,
"votes": votes,
})
# De-dupe by title+year and keep best rank
dedup = {}
for it in items:
k = (it["title"], it["year"])
if k not in dedup:
dedup[k] = it
else:
# prefer a non-null rank
if dedup[k].get("rank") is None and it.get("rank") is not None:
dedup[k] = it
out = list(dedup.values())
# If we got way too few, something changed.
if len(out) < 200:
raise RuntimeError(f"Parser returned too few items: {len(out)}")
# Sort by rank when present; otherwise by rating desc
out.sort(key=lambda x: (x["rank"] is None, x["rank"] or 999, -x["rating"]))
return out
movies = parse_top_250(html)
print("movies:", len(movies))
print(movies[0])
Why we validate count: a broken scraper that silently exports 17 rows is worse than one that errors.
Step 3: Export to JSON + CSV
import json
import csv
movies = parse_top_250(fetch_html("https://www.imdb.com/chart/top/"))
with open("imdb_top_250.json", "w", encoding="utf-8") as f:
json.dump(movies, f, ensure_ascii=False, indent=2)
with open("imdb_top_250.csv", "w", encoding="utf-8", newline="") as f:
w = csv.DictWriter(f, fieldnames=["rank", "title", "year", "rating", "votes"])
w.writeheader()
for m in movies:
w.writerow(m)
print("wrote imdb_top_250.json and imdb_top_250.csv", len(movies))
Practical hardening (the stuff that matters)
1) Use caching when iterating on parsers
While developing, don’t hammer the site.
from pathlib import Path
cache = Path(".cache_imdb_top.html")
if cache.exists():
html = cache.read_text(encoding="utf-8")
else:
html = fetch_html("https://www.imdb.com/chart/top/")
cache.write_text(html, encoding="utf-8")
2) Don’t scrape detail pages unless you need them
The Top 250 page already gives you a great dataset. Detail pages multiply request counts by 250.
3) Validate schema before you ship the dataset
Add a quick sanity check:
assert all(m["title"] and m["year"] and m["rating"] for m in movies)
assert movies[0]["rating"] >= movies[-1]["rating"] - 2 # rough check
Where ProxiesAPI fits (honestly)
For a single run from your laptop, you might be fine without proxies.
ProxiesAPI becomes valuable when:
- you run the scraper repeatedly (cron jobs)
- you run from cloud IPs that get throttled faster
- you expand beyond Top 250 → search pages → detail pages
- you need consistent latency and fewer transient failures
The goal isn’t “scrape anything without consequences.” The goal is: reduce flaky network failures so your data pipeline is predictable.
QA checklist
- Parser returns ~250 items
-
rankis populated for most rows - years look sane (no 0 / 2099)
- vote counts parse into integers
- exports open cleanly in Excel / Pandas
Next upgrades
- Enrich each movie with genres + runtime by crawling detail pages (careful: +250 requests)
- Add incremental updates (only re-scrape weekly)
- Store into SQLite and build a small analytics notebook
When you scrape at scale, failures come from the network layer (timeouts, throttling, transient blocks). ProxiesAPI gives you a stable HTTP surface so your parser code can stay simple.