Scrape IMDb Top 250 into a Weekly Tracker (Rank Changes, Ratings, Votes)

Jun 03, 2026 · tutorial · #python, #imdb, #selenium, #beautifulsoup, #csv, #data-tracking

IMDb's Top 250 is perfect tracking data.

It changes slowly, but not randomly. Rankings drift. Vote counts climb. Ratings compress at the top. And if you take a clean weekly snapshot, you can answer questions like:

which films moved up or down this week?
which titles gained the most votes?
where are ratings stable even when rank changes?

The reliable approach is not plain requests.get(). In this environment, a raw IMDb fetch returned a blocked response. A browser-rendered page worked, so that is the path to build on.

Use ProxiesAPI when IMDb starts blocking the boring part

For IMDb, the hard part is often just getting a stable page load. ProxiesAPI can sit underneath your browser automation as the proxy layer while your extraction code stays unchanged.

Get 1,000 free API calls View pricing

The scraping strategy

IMDb changes frontend classes often, so I would not anchor this scraper to one deep CSS chain.

The safer pattern is:

render the page in a browser
grab the page source
parse the structured data first
use DOM selectors only for fields the structured blob does not give you cleanly

The two selectors worth remembering are:

title links: a[href*="/title/tt"]
structured data: script[type="application/ld+json"]

Install

python -m venv .venv
source .venv/bin/activate
pip install selenium beautifulsoup4 lxml pandas

Step 1: Load the rendered page

from __future__ import annotations

import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def build_driver() -> webdriver.Chrome:
    options = Options()
    options.add_argument("--headless=new")
    options.add_argument("--window-size=1440,2400")
    options.add_argument(
        "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
    )

    proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
    if proxy_url:
        options.add_argument(f"--proxy-server={proxy_url}")

    return webdriver.Chrome(options=options)


def get_rendered_html() -> str:
    driver = build_driver()
    try:
        driver.get("https://www.imdb.com/chart/top/")
        WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'a[href*="/title/tt"]'))
        )
        return driver.page_source
    finally:
        driver.quit()

That wait condition matters. If you do not see title links, you do not have the chart.

Step 2: Parse the JSON-LD item list

IMDb usually ships structured metadata in application/ld+json. That is more stable than scraping presentation classes.

import json
from bs4 import BeautifulSoup


def extract_itemlist(soup: BeautifulSoup) -> dict:
    for node in soup.select('script[type="application/ld+json"]'):
        text = node.string or node.get_text()
        if not text:
            continue
        data = json.loads(text)
        if isinstance(data, dict) and data.get("@type") == "ItemList":
            return data
    raise ValueError("Could not find IMDb ItemList JSON-LD")

From there, each movie is typically in itemListElement with a position and nested item payload.

Step 3: Turn the item list into rows

import re


def imdb_id_from_url(url: str) -> str | None:
    m = re.search(r"/title/(tt\d+)/", url or "")
    return m.group(1) if m else None


def parse_top250(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    itemlist = extract_itemlist(soup)
    rows = []

    for entry in itemlist.get("itemListElement", []):
        movie = entry.get("item", {})
        agg = movie.get("aggregateRating", {}) or {}
        rows.append(
            {
                "rank": int(entry.get("position")),
                "title": movie.get("name"),
                "url": movie.get("url"),
                "title_id": imdb_id_from_url(movie.get("url")),
                "rating": float(agg["ratingValue"]) if agg.get("ratingValue") else None,
                "votes": int(str(agg["ratingCount"]).replace(",", "")) if agg.get("ratingCount") else None,
            }
        )

    return rows

This is the cleanest path because it avoids depending on IMDb's current card layout.

Step 4: Save a weekly snapshot

from pathlib import Path
import pandas as pd


def snapshot_top250(snapshot_date: str) -> pd.DataFrame:
    html = get_rendered_html()
    rows = parse_top250(html)
    df = pd.DataFrame(rows).sort_values("rank").reset_index(drop=True)
    df["snapshot_date"] = snapshot_date
    return df


def save_snapshot(df: pd.DataFrame, output_dir: str = "data") -> Path:
    out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    path = out_dir / f"imdb_top250_{df['snapshot_date'].iloc[0]}.csv"
    df.to_csv(path, index=False)
    return path

Step 5: Compare this week vs last week

This is where the tracker becomes more interesting than the one-off scrape.

def compare_snapshots(current: pd.DataFrame, previous: pd.DataFrame) -> pd.DataFrame:
    merged = current.merge(
        previous[["title_id", "rank", "rating", "votes"]],
        on="title_id",
        how="left",
        suffixes=("", "_prev"),
    )

    merged["rank_change"] = merged["rank_prev"] - merged["rank"]
    merged["rating_change"] = merged["rating"] - merged["rating_prev"]
    merged["votes_change"] = merged["votes"] - merged["votes_prev"]

    return merged[
        [
            "title_id",
            "title",
            "rank",
            "rank_prev",
            "rank_change",
            "rating",
            "rating_prev",
            "rating_change",
            "votes",
            "votes_prev",
            "votes_change",
        ]
    ].sort_values(["rank"])

If a film moves from rank 8 to rank 5, rank_change will be +3.

End-to-end run

if __name__ == "__main__":
    current = snapshot_top250("2026-06-03")
    current_path = save_snapshot(current)
    print("saved snapshot:", current_path)

    previous = pd.read_csv("data/imdb_top250_2026-05-27.csv")
    changes = compare_snapshots(current, previous)
    changes.to_csv("data/imdb_top250_changes_2026-06-03.csv", index=False)

    movers = changes.loc[changes["rank_change"].fillna(0) != 0, ["title", "rank", "rank_prev", "rank_change"]]
    print(movers.head(15).to_string(index=False))

Example output:

                      title  rank  rank_prev  rank_change
       12 Angry Men           5          6            1
       The Lord of the Rings  7          5           -2

export PROXIESAPI_PROXY_URL="http://USER:PASS@proxy.proxiesapi.com:PORT"

Then the only code change is:

options.add_argument(f"--proxy-server={proxy_url}")

That is the right boundary:

Selenium renders
BeautifulSoup parses
ProxiesAPI helps the requests arrive more consistently

Final notes

Two practical safeguards will save you time:

Fail fast if you extract fewer than 200 rows.
Keep snapshots keyed by title_id, not title text alone.

With that, you have a weekly IMDb Top 250 tracker that can chart:

rank movement
rating drift
vote growth

And that is much more interesting than a static CSV.

Use ProxiesAPI when IMDb starts blocking the boring part

For IMDb, the hard part is often just getting a stable page load. ProxiesAPI can sit underneath your browser automation as the proxy layer while your extraction code stays unchanged.

Get 1,000 free API calls View pricing

Extract IMDb Top 250 movies (rank, title, year, rating, vote count) into clean CSV/JSON — with robust parsing, retries, and polite crawling.

tutorial#python#imdb#web-scraping

Scrape IMDb Top 250 Movies into a Dataset

Pull rank, title, year, rating, and votes into clean CSV/JSON for analysis with working Python code.

tutorial#python#imdb#web-scraping

Scrape Numbeo Quality of Life Index by City with Python

Extract Numbeo's city-level quality-of-life scores, safety, traffic, pollution, and climate indicators into a clean dataset with Python and ProxiesAPI.

tutorial#python#numbeo#web-scraping

Scrape UK Property Prices from Rightmove

Show how to collect Rightmove listing prices, addresses, agent names, and URLs into a reusable UK property dataset with Python and ProxiesAPI.

tutorial#python#rightmove#real-estate