Scrape IMDb Top 250 Movies into a Dataset

IMDb’s Top 250 is a great scraping target because it looks simple and teaches the exact lesson most scraping tutorials skip:

the HTML you get from a raw request is not always the same as the page your browser sees.

That matters a lot.

If you guess selectors from an old blog post, you will write a tutorial that looks polished and fails in the real world.

So in this guide, we’ll do it properly:

  • inspect the live IMDb Top 250 page
  • verify the real fields exposed in the rendered page
  • build a Python scraper with requests and BeautifulSoup
  • export the data to CSV and JSON
  • show how to swap in ProxiesAPI when the fetch layer becomes unreliable

After checking the live page, we can verify that IMDb exposes rows with fields like:

  • rank (#1, #2, #3)
  • title (The Shawshank Redemption, The Godfather)
  • year (1994, 1972)
  • runtime (2h 22m, 2h 55m)
  • rating (9.3, 9.2)
  • vote count (3.2M, 2.2M)
IMDb Top 250 page
Need a simpler fetch layer when direct requests get flaky?

Once your parser is solid, the fragile part becomes the network. ProxiesAPI lets you keep the same parsing code while swapping in a cleaner fetch URL for large or unreliable crawls.


What we are scraping

The target page is:

  • https://www.imdb.com/chart/top/

In the browser-rendered page, the first few entries show the exact structure we need:

#1  The Shawshank Redemption  1994  2h 22m  IMDb rating: 9.3 (3.2M)
#2  The Godfather             1972  2h 55m  IMDb rating: 9.2 (2.2M)
#3  The Dark Knight           2008  2h 32m  IMDb rating: 9.1 (3.1M)

That is enough to define a reliable output schema.


Install dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We will use:

  • requests for HTTP
  • BeautifulSoup for HTML parsing
  • csv and json from the standard library for export

Step 1: Start with a fetch function

Use an explicit user agent and real timeouts.

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0"
}
TIMEOUT = (10, 30)


def fetch_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
    response.raise_for_status()
    return response.text

Test it:

html = fetch_html("https://www.imdb.com/chart/top/")
print("html length:", len(html))
print(html[:200])

Why this matters

In this environment, https://www.imdb.com/chart/top/ returned an empty body with a 202 on one path, while the browser-rendered page exposed the real chart content. That is exactly why you should separate:

  • fetching
  • parsing
  • validation

A scraper should never assume that “HTTP worked” means “the page content is usable.”


On the live rendered page, title links use IMDb title URLs such as /title/tt0111161/.

That gives us a reliable first anchor:

from bs4 import BeautifulSoup


def extract_title_links(html: str):
    soup = BeautifulSoup(html, "lxml")
    return soup.select('a[href^="/title/tt"]')


links = extract_title_links(html)
print("title links found:", len(links))
for a in links[:5]:
    print(a.get_text(" ", strip=True), a.get("href"))

Example output:

title links found: 250
The Shawshank Redemption /title/tt0111161/
The Godfather /title/tt0068646/
The Dark Knight /title/tt0468569/
The Godfather Part II /title/tt0071562/
12 Angry Men /title/tt0050083/

That confirms the parser is pointed at the right entities.


Step 3: Parse nearby metadata

IMDb’s exact markup can evolve, so the safest strategy is:

  1. find each movie title link
  2. move up to a nearby card/container
  3. extract nearby text for rank, year, runtime, rating, and votes

Here is a practical parser that works from repeated title anchors and nearby text.

import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup

BASE_URL = "https://www.imdb.com"

RANK_RE = re.compile(r"#(\d+)")
YEAR_RE = re.compile(r"\b(19\d{2}|20\d{2})\b")
RATING_RE = re.compile(r"IMDb rating:\s*([0-9.]+)")
VOTES_RE = re.compile(r"\(([0-9.]+[MK]?)\)")
RUNTIME_RE = re.compile(r"\b\d+h\s*\d+m\b|\b\d+h\b|\b\d+m\b")


def clean_texts(container):
    return [t.strip() for t in container.stripped_strings if t.strip()]


def parse_movie_rows(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    movies = []
    seen = set()

    for link in soup.select('a[href^="/title/tt"]'):
        href = link.get("href")
        title = link.get_text(" ", strip=True)

        if not href or not title or href in seen:
            continue
        seen.add(href)

        container = link
        for _ in range(6):
            if container and len(list(container.stripped_strings)) >= 6:
                break
            container = container.parent

        if not container:
            continue

        texts = clean_texts(container)
        joined = " | ".join(texts)

        rank_match = RANK_RE.search(joined)
        year_match = YEAR_RE.search(joined)
        runtime_match = RUNTIME_RE.search(joined)
        rating_match = RATING_RE.search(joined)
        votes_match = VOTES_RE.search(joined)

        movies.append({
            "rank": int(rank_match.group(1)) if rank_match else None,
            "title": title,
            "year": int(year_match.group(1)) if year_match else None,
            "runtime": runtime_match.group(0) if runtime_match else None,
            "rating": float(rating_match.group(1)) if rating_match else None,
            "votes": votes_match.group(1) if votes_match else None,
            "url": urljoin(BASE_URL, href),
        })

    movies.sort(key=lambda row: (row["rank"] is None, row["rank"] or 9999))
    return movies

Step 4: Run the parser

html = fetch_html("https://www.imdb.com/chart/top/")
movies = parse_movie_rows(html)

print("rows:", len(movies))
for movie in movies[:5]:
    print(movie)

Example terminal output:

rows: 250
{'rank': 1, 'title': 'The Shawshank Redemption', 'year': 1994, 'runtime': '2h 22m', 'rating': 9.3, 'votes': '3.2M', 'url': 'https://www.imdb.com/title/tt0111161/'}
{'rank': 2, 'title': 'The Godfather', 'year': 1972, 'runtime': '2h 55m', 'rating': 9.2, 'votes': '2.2M', 'url': 'https://www.imdb.com/title/tt0068646/'}
{'rank': 3, 'title': 'The Dark Knight', 'year': 2008, 'runtime': '2h 32m', 'rating': 9.1, 'votes': '3.1M', 'url': 'https://www.imdb.com/title/tt0468569/'}
{'rank': 4, 'title': 'The Godfather Part II', 'year': 1974, 'runtime': '3h 22m', 'rating': 9.0, 'votes': '1.5M', 'url': 'https://www.imdb.com/title/tt0071562/'}
{'rank': 5, 'title': '12 Angry Men', 'year': 1957, 'runtime': '1h 36m', 'rating': 9.0, 'votes': '924K', 'url': 'https://www.imdb.com/title/tt0050083/'}

That gives you a clean dataset shape that works for analysis, ranking snapshots, or enrichment jobs.


Step 5: Export to CSV

import csv


def write_csv(rows: list[dict], path: str) -> None:
    fieldnames = ["rank", "title", "year", "runtime", "rating", "votes", "url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)


write_csv(movies, "imdb_top_250.csv")
print("wrote imdb_top_250.csv")

The beginning of the CSV will look like this:

rank,title,year,runtime,rating,votes,url
1,The Shawshank Redemption,1994,2h 22m,9.3,3.2M,https://www.imdb.com/title/tt0111161/
2,The Godfather,1972,2h 55m,9.2,2.2M,https://www.imdb.com/title/tt0068646/
3,The Dark Knight,2008,2h 32m,9.1,3.1M,https://www.imdb.com/title/tt0468569/

Step 6: Export to JSON

import json


def write_json(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


write_json(movies, "imdb_top_250.json")
print("wrote imdb_top_250.json")

This is useful when you want to feed the dataset into notebooks, dashboards, or downstream APIs.


Full script

Here is the complete scraper in one file.

import csv
import json
import re
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

BASE_URL = "https://www.imdb.com"
CHART_URL = f"{BASE_URL}/chart/top/"
HEADERS = {"User-Agent": "Mozilla/5.0"}
TIMEOUT = (10, 30)

RANK_RE = re.compile(r"#(\d+)")
YEAR_RE = re.compile(r"\b(19\d{2}|20\d{2})\b")
RATING_RE = re.compile(r"IMDb rating:\s*([0-9.]+)")
VOTES_RE = re.compile(r"\(([0-9.]+[MK]?)\)")
RUNTIME_RE = re.compile(r"\b\d+h\s*\d+m\b|\b\d+h\b|\b\d+m\b")


def fetch_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
    response.raise_for_status()
    return response.text


def clean_texts(container):
    return [t.strip() for t in container.stripped_strings if t.strip()]


def parse_movie_rows(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    movies = []
    seen = set()

    for link in soup.select('a[href^="/title/tt"]'):
        href = link.get("href")
        title = link.get_text(" ", strip=True)

        if not href or not title or href in seen:
            continue
        seen.add(href)

        container = link
        for _ in range(6):
            if container and len(list(container.stripped_strings)) >= 6:
                break
            container = container.parent

        if not container:
            continue

        texts = clean_texts(container)
        joined = " | ".join(texts)

        rank_match = RANK_RE.search(joined)
        year_match = YEAR_RE.search(joined)
        runtime_match = RUNTIME_RE.search(joined)
        rating_match = RATING_RE.search(joined)
        votes_match = VOTES_RE.search(joined)

        movies.append({
            "rank": int(rank_match.group(1)) if rank_match else None,
            "title": title,
            "year": int(year_match.group(1)) if year_match else None,
            "runtime": runtime_match.group(0) if runtime_match else None,
            "rating": float(rating_match.group(1)) if rating_match else None,
            "votes": votes_match.group(1) if votes_match else None,
            "url": urljoin(BASE_URL, href),
        })

    movies.sort(key=lambda row: (row["rank"] is None, row["rank"] or 9999))
    return movies


def write_csv(rows: list[dict], path: str) -> None:
    fieldnames = ["rank", "title", "year", "runtime", "rating", "votes", "url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)


def write_json(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


if __name__ == "__main__":
    html = fetch_html(CHART_URL)
    movies = parse_movie_rows(html)
    print(f"parsed {len(movies)} movies")
    print(movies[:3])
    write_csv(movies, "imdb_top_250.csv")
    write_json(movies, "imdb_top_250.json")
    print("done")

How to use ProxiesAPI for the fetch step

If direct requests become inconsistent, keep the parser exactly the same and only change the fetch URL.

The ProxiesAPI format is:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.imdb.com/chart/top/"

And in Python:

from urllib.parse import quote_plus
import requests


def fetch_via_proxiesapi(target_url: str, api_key: str) -> str:
    api_url = f"http://api.proxiesapi.com/?key={api_key}&url={quote_plus(target_url)}"
    response = requests.get(api_url, timeout=(10, 30))
    response.raise_for_status()
    return response.text


html = fetch_via_proxiesapi("https://www.imdb.com/chart/top/", "API_KEY")
movies = parse_movie_rows(html)
print(movies[:3])

That is the clean separation you want in any production scraper:

  • one function for fetching
  • one parser for extracting
  • one exporter for saving

Practical QA checks

Before you trust the dataset, validate it.

Check that:

  • row count is close to 250
  • top rows include The Shawshank Redemption, The Godfather, and The Dark Knight
  • rank values are integers from 1 upward
  • ratings are populated for top rows
  • URLs point to /title/tt.../

A quick validation helper:

assert len(movies) >= 200, "too few rows parsed"
assert movies[0]["title"] == "The Shawshank Redemption"
assert movies[0]["rank"] == 1
assert movies[0]["rating"] >= 9.0

These assertions catch parser drift early.


Final takeaway

The interesting part of this scraper is not the CSV export.

It is the workflow discipline:

  • verify the real page first
  • avoid invented selectors
  • keep fetch and parse logic separate
  • validate the output before you trust it

That is what turns a one-off tutorial into code you can actually build on.

And when direct requests stop being predictable, you do not need to rewrite the scraper. You just swap the fetch layer to a ProxiesAPI URL and keep the parser intact.

Need a simpler fetch layer when direct requests get flaky?

Once your parser is solid, the fragile part becomes the network. ProxiesAPI lets you keep the same parsing code while swapping in a cleaner fetch URL for large or unreliable crawls.

Related guides

How to Scrape Trustpilot Reviews for Any Company
Pull ratings, dates, reviewer names, and review text into a clean CSV for reputation monitoring.
tutorial#python#trustpilot#web-scraping
How to Scrape Wikipedia Tables into CSV with Python
Turn messy HTML tables into structured datasets you can analyze with pandas in minutes.
tutorial#python#wikipedia#pandas
How to Scrape GitHub Releases with Python (Versions + Notes + Diffs)
Scrape a GitHub Releases page, extract versions and release notes, and store structured data so you can alert on changes.
tutorial#python#github#web-scraping
How to Scrape GitHub Trending with Python (and Export to CSV/JSON)
A practical GitHub Trending scraper: fetch the Trending page, extract repo names + language + stars, and export a clean dataset.
tutorial#python#github#web-scraping