Scrape IMDb Top 250 Movies into a Dataset (Python + ProxiesAPI)

IMDb’s Top 250 Movies list is a classic scraping target: it’s a single page with a well-defined table, but it still forces you to solve the real-world problems that make scrapers flaky:

  • HTML changes (classes move, wrappers change)
  • number parsing (ratings, vote counts)
  • transient failures (timeouts, 429s)
  • exporting to a dataset format you can actually use

In this guide, we’ll build a production-grade Python scraper that outputs:

  • imdb_top_250.json
  • imdb_top_250.csv

…and uses ProxiesAPI as the network layer so you can keep the crawl stable if you run it from servers, CI, or at higher frequency.

IMDb Top 250 page (we’ll scrape rank, title, year, rating, votes)

Make your IMDb crawl more reliable with ProxiesAPI

When you scrape at scale, failures come from the network layer (timeouts, throttling, transient blocks). ProxiesAPI gives you a stable HTTP surface so your parser code can stay simple.


What we’re scraping

Target page:

  • https://www.imdb.com/chart/top/

At the time of writing, IMDb renders a table-like layout where each row contains:

  • rank (1..250)
  • title
  • release year
  • rating (e.g. 9.2)
  • vote count (e.g. 2.9M)

Important: IMDb’s HTML can change. Instead of hard-coding brittle selectors, we’ll:

  • prefer semantic attributes when available
  • keep selectors minimal
  • validate we got 250 rows and fail loudly if not

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • tenacity for retries with exponential backoff

Step 1: Fetch HTML via ProxiesAPI (with timeouts + retries)

A reliable scraper starts with a reliable fetch.

Below is a simple ProxiesAPI pattern:

  • build a target URL
  • call ProxiesAPI with that URL
  • set a real timeout
  • retry transient failures

Put your ProxiesAPI key in an environment variable:

export PROXIESAPI_KEY="YOUR_API_KEY"

Python fetcher:

import os
import time
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY")

TIMEOUT = (10, 40)  # connect, read
SESSION = requests.Session()

class FetchError(RuntimeError):
    pass

@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch_html(url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Missing PROXIESAPI_KEY env var")

    # ProxiesAPI common pattern: pass the target URL as a parameter.
    # If your account uses a slightly different endpoint/param name,
    # keep this function as the only place you change it.
    api_url = "https://api.proxiesapi.com"

    params = {
        "api_key": PROXIESAPI_KEY,
        "url": url,
    }

    headers = {
        # a realistic UA reduces pointless bot suspicion
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }

    r = SESSION.get(api_url, params=params, headers=headers, timeout=TIMEOUT)

    # Treat 429/5xx as retryable
    if r.status_code in (429, 500, 502, 503, 504):
        raise FetchError(f"Retryable status: {r.status_code}")

    r.raise_for_status()
    return r.text

html = fetch_html("https://www.imdb.com/chart/top/")
print("bytes:", len(html))
print(html[:200])

Notes:

  • The retry policy is conservative (5 attempts, exponential backoff).
  • We fail fast if PROXIESAPI_KEY is missing.
  • We keep all ProxiesAPI integration inside fetch_html() so the rest of the code is pure parsing.

Step 2: Parse the Top 250 rows

IMDb markup changes over time, so we’ll write a parser that:

  1. finds rows that look like “Top 250 items”
  2. extracts title + year + rating + votes from within the row
  3. validates we got a sensible number of items

Implementation:

import re
from bs4 import BeautifulSoup

def parse_year(text: str) -> int | None:
    m = re.search(r"(19\d{2}|20\d{2})", text or "")
    return int(m.group(1)) if m else None


def parse_votes(text: str) -> int | None:
    """Parse vote strings like '2.9M' or '945K' or '123,456'."""
    if not text:
        return None

    t = text.strip().upper().replace(",", "")
    m = re.match(r"^(\d+(?:\.\d+)?)([KM])?$", t)
    if not m:
        # Sometimes IMDb includes parentheses or extra words.
        m2 = re.search(r"(\d+(?:\.\d+)?)([KM])?", t)
        if not m2:
            return None
        num, suf = m2.group(1), m2.group(2)
    else:
        num, suf = m.group(1), m.group(2)

    val = float(num)
    if suf == "K":
        val *= 1_000
    elif suf == "M":
        val *= 1_000_000

    return int(val)


def parse_top_250(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    items = []

    # Strategy:
    # - IMDb has used 'li' based layouts and table based layouts in different eras.
    # - We try a couple of broad selectors and unify extraction.

    # 1) Try list-style items
    candidates = soup.select("li.ipc-metadata-list-summary-item")

    # Fallback: table rows (older markup)
    if not candidates:
        candidates = soup.select("tr")

    for c in candidates:
        text = c.get_text(" ", strip=True)
        if not text:
            continue

        # Title: prefer visible link text
        title = None
        title_a = c.select_one("a")
        if title_a:
            title = title_a.get_text(" ", strip=True)

        # Year: look for a 4-digit year anywhere in the row
        year = parse_year(text)

        # Rating: common pattern is a decimal 0-10, but avoid years
        rating = None
        m = re.search(r"\b(\d\.\d)\b", text)
        if m:
            rating = float(m.group(1))

        # Votes: look for 'K'/'M' strings near 'votes' word if present
        votes = None
        mv = re.search(r"(\d+(?:\.\d+)?[KM])\s+votes", text, flags=re.IGNORECASE)
        if mv:
            votes = parse_votes(mv.group(1))
        else:
            # Sometimes votes appear without 'votes' label; try a weaker heuristic
            mv2 = re.search(r"\b(\d+(?:\.\d+)?[KM])\b", text)
            if mv2:
                votes = parse_votes(mv2.group(1))

        # Rank: look for leading '1.' or '1' near start
        rank = None
        mr = re.match(r"^(\d{1,3})\D", text)
        if mr:
            rank = int(mr.group(1))

        # Keep only plausible movie rows
        if title and year and rating and 1 <= year <= 2100 and 0 < rating <= 10:
            items.append({
                "rank": rank,
                "title": title,
                "year": year,
                "rating": rating,
                "votes": votes,
            })

    # De-dupe by title+year and keep best rank
    dedup = {}
    for it in items:
        k = (it["title"], it["year"])
        if k not in dedup:
            dedup[k] = it
        else:
            # prefer a non-null rank
            if dedup[k].get("rank") is None and it.get("rank") is not None:
                dedup[k] = it

    out = list(dedup.values())

    # If we got way too few, something changed.
    if len(out) < 200:
        raise RuntimeError(f"Parser returned too few items: {len(out)}")

    # Sort by rank when present; otherwise by rating desc
    out.sort(key=lambda x: (x["rank"] is None, x["rank"] or 999, -x["rating"]))

    return out

movies = parse_top_250(html)
print("movies:", len(movies))
print(movies[0])

Why we validate count: a broken scraper that silently exports 17 rows is worse than one that errors.


Step 3: Export to JSON + CSV

import json
import csv

movies = parse_top_250(fetch_html("https://www.imdb.com/chart/top/"))

with open("imdb_top_250.json", "w", encoding="utf-8") as f:
    json.dump(movies, f, ensure_ascii=False, indent=2)

with open("imdb_top_250.csv", "w", encoding="utf-8", newline="") as f:
    w = csv.DictWriter(f, fieldnames=["rank", "title", "year", "rating", "votes"])
    w.writeheader()
    for m in movies:
        w.writerow(m)

print("wrote imdb_top_250.json and imdb_top_250.csv", len(movies))

Practical hardening (the stuff that matters)

1) Use caching when iterating on parsers

While developing, don’t hammer the site.

from pathlib import Path

cache = Path(".cache_imdb_top.html")
if cache.exists():
    html = cache.read_text(encoding="utf-8")
else:
    html = fetch_html("https://www.imdb.com/chart/top/")
    cache.write_text(html, encoding="utf-8")

2) Don’t scrape detail pages unless you need them

The Top 250 page already gives you a great dataset. Detail pages multiply request counts by 250.

3) Validate schema before you ship the dataset

Add a quick sanity check:

assert all(m["title"] and m["year"] and m["rating"] for m in movies)
assert movies[0]["rating"] >= movies[-1]["rating"] - 2  # rough check

Where ProxiesAPI fits (honestly)

For a single run from your laptop, you might be fine without proxies.

ProxiesAPI becomes valuable when:

  • you run the scraper repeatedly (cron jobs)
  • you run from cloud IPs that get throttled faster
  • you expand beyond Top 250 → search pages → detail pages
  • you need consistent latency and fewer transient failures

The goal isn’t “scrape anything without consequences.” The goal is: reduce flaky network failures so your data pipeline is predictable.


QA checklist

  • Parser returns ~250 items
  • rank is populated for most rows
  • years look sane (no 0 / 2099)
  • vote counts parse into integers
  • exports open cleanly in Excel / Pandas

Next upgrades

  • Enrich each movie with genres + runtime by crawling detail pages (careful: +250 requests)
  • Add incremental updates (only re-scrape weekly)
  • Store into SQLite and build a small analytics notebook
Make your IMDb crawl more reliable with ProxiesAPI

When you scrape at scale, failures come from the network layer (timeouts, throttling, transient blocks). ProxiesAPI gives you a stable HTTP surface so your parser code can stay simple.

Related guides

Scrape IMDb Top 250 Movies into a Dataset
Pull rank, title, year, rating, and votes into clean CSV/JSON for analysis with working Python code.
tutorial#python#imdb#web-scraping
Scrape UK Property Prices from Rightmove with Python (Sold Prices Dataset + Screenshots)
Build a Rightmove sold-prices dataset builder in Python: fetch HTML reliably, parse listing cards, follow pagination, enrich details pages, and export a clean CSV/JSONL. Includes proof screenshots and a resilient request layer with ProxiesAPI.
tutorial#python#rightmove#real-estate
Scrape Government Contract Opportunities from SAM.gov (Python + ProxiesAPI)
Build a reliable scraper for SAM.gov contract opportunities: crawl search results, paginate, extract listing cards, fetch detail pages, and export CSV/JSON. Includes retry logic and a screenshot step for proof.
tutorial#python#sam-gov#government-contracts
Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)
Build a repeatable sold-prices dataset from Rightmove with Python + ProxiesAPI: crawl sold listings, paginate, fetch property details, and save a clean CSV/JSONL. Includes a screenshot capture step.
tutorial#python#rightmove#property-data