Scrape IMDb Top 250 into a Weekly Tracker (Rank Changes, Ratings, Votes)

IMDb's Top 250 is perfect tracking data.

It changes slowly, but not randomly. Rankings drift. Vote counts climb. Ratings compress at the top. And if you take a clean weekly snapshot, you can answer questions like:

  • which films moved up or down this week?
  • which titles gained the most votes?
  • where are ratings stable even when rank changes?

The reliable approach is not plain requests.get(). In this environment, a raw IMDb fetch returned a blocked response. A browser-rendered page worked, so that is the path to build on.

IMDb Top 250

Use ProxiesAPI when IMDb starts blocking the boring part

For IMDb, the hard part is often just getting a stable page load. ProxiesAPI can sit underneath your browser automation as the proxy layer while your extraction code stays unchanged.


The scraping strategy

IMDb changes frontend classes often, so I would not anchor this scraper to one deep CSS chain.

The safer pattern is:

  1. render the page in a browser
  2. grab the page source
  3. parse the structured data first
  4. use DOM selectors only for fields the structured blob does not give you cleanly

The two selectors worth remembering are:

  • title links: a[href*="/title/tt"]
  • structured data: script[type="application/ld+json"]

Install

python -m venv .venv
source .venv/bin/activate
pip install selenium beautifulsoup4 lxml pandas

Step 1: Load the rendered page

from __future__ import annotations

import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def build_driver() -> webdriver.Chrome:
    options = Options()
    options.add_argument("--headless=new")
    options.add_argument("--window-size=1440,2400")
    options.add_argument(
        "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
    )

    proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
    if proxy_url:
        options.add_argument(f"--proxy-server={proxy_url}")

    return webdriver.Chrome(options=options)


def get_rendered_html() -> str:
    driver = build_driver()
    try:
        driver.get("https://www.imdb.com/chart/top/")
        WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'a[href*="/title/tt"]'))
        )
        return driver.page_source
    finally:
        driver.quit()

That wait condition matters. If you do not see title links, you do not have the chart.


Step 2: Parse the JSON-LD item list

IMDb usually ships structured metadata in application/ld+json. That is more stable than scraping presentation classes.

import json
from bs4 import BeautifulSoup


def extract_itemlist(soup: BeautifulSoup) -> dict:
    for node in soup.select('script[type="application/ld+json"]'):
        text = node.string or node.get_text()
        if not text:
            continue
        data = json.loads(text)
        if isinstance(data, dict) and data.get("@type") == "ItemList":
            return data
    raise ValueError("Could not find IMDb ItemList JSON-LD")

From there, each movie is typically in itemListElement with a position and nested item payload.


Step 3: Turn the item list into rows

import re


def imdb_id_from_url(url: str) -> str | None:
    m = re.search(r"/title/(tt\d+)/", url or "")
    return m.group(1) if m else None


def parse_top250(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    itemlist = extract_itemlist(soup)
    rows = []

    for entry in itemlist.get("itemListElement", []):
        movie = entry.get("item", {})
        agg = movie.get("aggregateRating", {}) or {}
        rows.append(
            {
                "rank": int(entry.get("position")),
                "title": movie.get("name"),
                "url": movie.get("url"),
                "title_id": imdb_id_from_url(movie.get("url")),
                "rating": float(agg["ratingValue"]) if agg.get("ratingValue") else None,
                "votes": int(str(agg["ratingCount"]).replace(",", "")) if agg.get("ratingCount") else None,
            }
        )

    return rows

This is the cleanest path because it avoids depending on IMDb's current card layout.


Step 4: Save a weekly snapshot

from pathlib import Path
import pandas as pd


def snapshot_top250(snapshot_date: str) -> pd.DataFrame:
    html = get_rendered_html()
    rows = parse_top250(html)
    df = pd.DataFrame(rows).sort_values("rank").reset_index(drop=True)
    df["snapshot_date"] = snapshot_date
    return df


def save_snapshot(df: pd.DataFrame, output_dir: str = "data") -> Path:
    out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    path = out_dir / f"imdb_top250_{df['snapshot_date'].iloc[0]}.csv"
    df.to_csv(path, index=False)
    return path

Step 5: Compare this week vs last week

This is where the tracker becomes more interesting than the one-off scrape.

def compare_snapshots(current: pd.DataFrame, previous: pd.DataFrame) -> pd.DataFrame:
    merged = current.merge(
        previous[["title_id", "rank", "rating", "votes"]],
        on="title_id",
        how="left",
        suffixes=("", "_prev"),
    )

    merged["rank_change"] = merged["rank_prev"] - merged["rank"]
    merged["rating_change"] = merged["rating"] - merged["rating_prev"]
    merged["votes_change"] = merged["votes"] - merged["votes_prev"]

    return merged[
        [
            "title_id",
            "title",
            "rank",
            "rank_prev",
            "rank_change",
            "rating",
            "rating_prev",
            "rating_change",
            "votes",
            "votes_prev",
            "votes_change",
        ]
    ].sort_values(["rank"])

If a film moves from rank 8 to rank 5, rank_change will be +3.


End-to-end run

if __name__ == "__main__":
    current = snapshot_top250("2026-06-03")
    current_path = save_snapshot(current)
    print("saved snapshot:", current_path)

    previous = pd.read_csv("data/imdb_top250_2026-05-27.csv")
    changes = compare_snapshots(current, previous)
    changes.to_csv("data/imdb_top250_changes_2026-06-03.csv", index=False)

    movers = changes.loc[changes["rank_change"].fillna(0) != 0, ["title", "rank", "rank_prev", "rank_change"]]
    print(movers.head(15).to_string(index=False))

Example output:

                      title  rank  rank_prev  rank_change
       12 Angry Men           5          6            1
       The Lord of the Rings  7          5           -2

Why this approach is more robust

1. JSON-LD is usually steadier than UI classes

IMDb can rename layout classes. It is less likely to stop publishing a machine-readable item list for a flagship page.

2. Browser rendering solves the blocked raw request problem

If direct requests start returning partial HTML, challenge pages, or empty content, the browser route gives you the post-render DOM.

3. The weekly diff is the product

Most people stop at "I scraped the list." The useful version is "I can see what changed."


Where ProxiesAPI fits

If your browser automation becomes flaky because of IP reputation, keep the Selenium logic and add a proxy layer:

export PROXIESAPI_PROXY_URL="http://USER:PASS@proxy.proxiesapi.com:PORT"

Then the only code change is:

options.add_argument(f"--proxy-server={proxy_url}")

That is the right boundary:

  • Selenium renders
  • BeautifulSoup parses
  • ProxiesAPI helps the requests arrive more consistently

Final notes

Two practical safeguards will save you time:

  1. Fail fast if you extract fewer than 200 rows.
  2. Keep snapshots keyed by title_id, not title text alone.

With that, you have a weekly IMDb Top 250 tracker that can chart:

  • rank movement
  • rating drift
  • vote growth

And that is much more interesting than a static CSV.

Use ProxiesAPI when IMDb starts blocking the boring part

For IMDb, the hard part is often just getting a stable page load. ProxiesAPI can sit underneath your browser automation as the proxy layer while your extraction code stays unchanged.

Related guides

Scrape IMDb Top 250 Movies into a Dataset (Python + ProxiesAPI)
Extract IMDb Top 250 movies (rank, title, year, rating, vote count) into clean CSV/JSON — with robust parsing, retries, and polite crawling.
tutorial#python#imdb#web-scraping
Scrape IMDb Top 250 Movies into a Dataset
Pull rank, title, year, rating, and votes into clean CSV/JSON for analysis with working Python code.
tutorial#python#imdb#web-scraping
Scrape Yahoo Finance Earnings Calendar with Python (Dates, EPS Estimates, CSV Export)
Turn Yahoo Finance's earnings calendar into a clean daily dataset you can filter by date, ticker, and surprise expectations.
tutorial#python#yahoo-finance#earnings-calendar
Scrape GitHub Issues (Labels, States, Pagination) Into CSV
Build a practical GitHub Issues scraper in Python: parse issue rows, collect labels + state + dates, follow pagination, and export a triage-ready CSV. Includes screenshot + working code.
tutorial#python#github#issues