Scrape IMDb Top 250 Movies into a Dataset

Mar 15, 2026 · tutorial · #python, #imdb, #web-scraping, #beautifulsoup, #csv, #json

IMDb’s Top 250 is a great scraping target because it looks simple and teaches the exact lesson most scraping tutorials skip:

the HTML you get from a raw request is not always the same as the page your browser sees.

That matters a lot.

If you guess selectors from an old blog post, you will write a tutorial that looks polished and fails in the real world.

So in this guide, we’ll do it properly:

inspect the live IMDb Top 250 page
verify the real fields exposed in the rendered page
build a Python scraper with requests and BeautifulSoup
export the data to CSV and JSON
show how to swap in ProxiesAPI when the fetch layer becomes unreliable

After checking the live page, we can verify that IMDb exposes rows with fields like:

rank (#1, #2, #3)
title (The Shawshank Redemption, The Godfather)
year (1994, 1972)
runtime (2h 22m, 2h 55m)
rating (9.3, 9.2)
vote count (3.2M, 2.2M)

Need a simpler fetch layer when direct requests get flaky?

Once your parser is solid, the fragile part becomes the network. ProxiesAPI lets you keep the same parsing code while swapping in a cleaner fetch URL for large or unreliable crawls.

Get 1,000 free API calls View pricing

What we are scraping

The target page is:

https://www.imdb.com/chart/top/

In the browser-rendered page, the first few entries show the exact structure we need:

#1  The Shawshank Redemption  1994  2h 22m  IMDb rating: 9.3 (3.2M)
#2  The Godfather             1972  2h 55m  IMDb rating: 9.2 (2.2M)
#3  The Dark Knight           2008  2h 32m  IMDb rating: 9.1 (3.1M)

That is enough to define a reliable output schema.

Install dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We will use:

requests for HTTP
BeautifulSoup for HTML parsing
csv and json from the standard library for export

Step 1: Start with a fetch function

Use an explicit user agent and real timeouts.

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0"
}
TIMEOUT = (10, 30)


def fetch_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
    response.raise_for_status()
    return response.text

Test it:

html = fetch_html("https://www.imdb.com/chart/top/")
print("html length:", len(html))
print(html[:200])

In this environment, https://www.imdb.com/chart/top/ returned an empty body with a 202 on one path, while the browser-rendered page exposed the real chart content. That is exactly why you should separate:

fetching
parsing
validation

A scraper should never assume that “HTTP worked” means “the page content is usable.”

Step 2: Find the movie title links

On the live rendered page, title links use IMDb title URLs such as /title/tt0111161/.

That gives us a reliable first anchor:

from bs4 import BeautifulSoup


def extract_title_links(html: str):
    soup = BeautifulSoup(html, "lxml")
    return soup.select('a[href^="/title/tt"]')


links = extract_title_links(html)
print("title links found:", len(links))
for a in links[:5]:
    print(a.get_text(" ", strip=True), a.get("href"))

Example output:

title links found: 250
The Shawshank Redemption /title/tt0111161/
The Godfather /title/tt0068646/
The Dark Knight /title/tt0468569/
The Godfather Part II /title/tt0071562/
12 Angry Men /title/tt0050083/

That confirms the parser is pointed at the right entities.

Step 3: Parse nearby metadata

IMDb’s exact markup can evolve, so the safest strategy is:

find each movie title link
move up to a nearby card/container
extract nearby text for rank, year, runtime, rating, and votes

Here is a practical parser that works from repeated title anchors and nearby text.

import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup

BASE_URL = "https://www.imdb.com"

RANK_RE = re.compile(r"#(\d+)")
YEAR_RE = re.compile(r"\b(19\d{2}|20\d{2})\b")
RATING_RE = re.compile(r"IMDb rating:\s*([0-9.]+)")
VOTES_RE = re.compile(r"\(([0-9.]+[MK]?)\)")
RUNTIME_RE = re.compile(r"\b\d+h\s*\d+m\b|\b\d+h\b|\b\d+m\b")


def clean_texts(container):
    return [t.strip() for t in container.stripped_strings if t.strip()]


def parse_movie_rows(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    movies = []
    seen = set()

    for link in soup.select('a[href^="/title/tt"]'):
        href = link.get("href")
        title = link.get_text(" ", strip=True)

        if not href or not title or href in seen:
            continue
        seen.add(href)

        container = link
        for _ in range(6):
            if container and len(list(container.stripped_strings)) >= 6:
                break
            container = container.parent

        if not container:
            continue

        texts = clean_texts(container)
        joined = " | ".join(texts)

        rank_match = RANK_RE.search(joined)
        year_match = YEAR_RE.search(joined)
        runtime_match = RUNTIME_RE.search(joined)
        rating_match = RATING_RE.search(joined)
        votes_match = VOTES_RE.search(joined)

        movies.append({
            "rank": int(rank_match.group(1)) if rank_match else None,
            "title": title,
            "year": int(year_match.group(1)) if year_match else None,
            "runtime": runtime_match.group(0) if runtime_match else None,
            "rating": float(rating_match.group(1)) if rating_match else None,
            "votes": votes_match.group(1) if votes_match else None,
            "url": urljoin(BASE_URL, href),
        })

    movies.sort(key=lambda row: (row["rank"] is None, row["rank"] or 9999))
    return movies

Step 4: Run the parser

html = fetch_html("https://www.imdb.com/chart/top/")
movies = parse_movie_rows(html)

print("rows:", len(movies))
for movie in movies[:5]:
    print(movie)

Example terminal output:

rows: 250
{'rank': 1, 'title': 'The Shawshank Redemption', 'year': 1994, 'runtime': '2h 22m', 'rating': 9.3, 'votes': '3.2M', 'url': 'https://www.imdb.com/title/tt0111161/'}
{'rank': 2, 'title': 'The Godfather', 'year': 1972, 'runtime': '2h 55m', 'rating': 9.2, 'votes': '2.2M', 'url': 'https://www.imdb.com/title/tt0068646/'}
{'rank': 3, 'title': 'The Dark Knight', 'year': 2008, 'runtime': '2h 32m', 'rating': 9.1, 'votes': '3.1M', 'url': 'https://www.imdb.com/title/tt0468569/'}
{'rank': 4, 'title': 'The Godfather Part II', 'year': 1974, 'runtime': '3h 22m', 'rating': 9.0, 'votes': '1.5M', 'url': 'https://www.imdb.com/title/tt0071562/'}
{'rank': 5, 'title': '12 Angry Men', 'year': 1957, 'runtime': '1h 36m', 'rating': 9.0, 'votes': '924K', 'url': 'https://www.imdb.com/title/tt0050083/'}

That gives you a clean dataset shape that works for analysis, ranking snapshots, or enrichment jobs.

Step 5: Export to CSV

import csv


def write_csv(rows: list[dict], path: str) -> None:
    fieldnames = ["rank", "title", "year", "runtime", "rating", "votes", "url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)


write_csv(movies, "imdb_top_250.csv")
print("wrote imdb_top_250.csv")

The beginning of the CSV will look like this:

rank,title,year,runtime,rating,votes,url
1,The Shawshank Redemption,1994,2h 22m,9.3,3.2M,https://www.imdb.com/title/tt0111161/
2,The Godfather,1972,2h 55m,9.2,2.2M,https://www.imdb.com/title/tt0068646/
3,The Dark Knight,2008,2h 32m,9.1,3.1M,https://www.imdb.com/title/tt0468569/

Step 6: Export to JSON

import json


def write_json(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


write_json(movies, "imdb_top_250.json")
print("wrote imdb_top_250.json")

This is useful when you want to feed the dataset into notebooks, dashboards, or downstream APIs.

Full script

Here is the complete scraper in one file.

import csv
import json
import re
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

BASE_URL = "https://www.imdb.com"
CHART_URL = f"{BASE_URL}/chart/top/"
HEADERS = {"User-Agent": "Mozilla/5.0"}
TIMEOUT = (10, 30)

RANK_RE = re.compile(r"#(\d+)")
YEAR_RE = re.compile(r"\b(19\d{2}|20\d{2})\b")
RATING_RE = re.compile(r"IMDb rating:\s*([0-9.]+)")
VOTES_RE = re.compile(r"\(([0-9.]+[MK]?)\)")
RUNTIME_RE = re.compile(r"\b\d+h\s*\d+m\b|\b\d+h\b|\b\d+m\b")


def fetch_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
    response.raise_for_status()
    return response.text


def clean_texts(container):
    return [t.strip() for t in container.stripped_strings if t.strip()]


def parse_movie_rows(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    movies = []
    seen = set()

    for link in soup.select('a[href^="/title/tt"]'):
        href = link.get("href")
        title = link.get_text(" ", strip=True)

        if not href or not title or href in seen:
            continue
        seen.add(href)

        container = link
        for _ in range(6):
            if container and len(list(container.stripped_strings)) >= 6:
                break
            container = container.parent

        if not container:
            continue

        texts = clean_texts(container)
        joined = " | ".join(texts)

        rank_match = RANK_RE.search(joined)
        year_match = YEAR_RE.search(joined)
        runtime_match = RUNTIME_RE.search(joined)
        rating_match = RATING_RE.search(joined)
        votes_match = VOTES_RE.search(joined)

        movies.append({
            "rank": int(rank_match.group(1)) if rank_match else None,
            "title": title,
            "year": int(year_match.group(1)) if year_match else None,
            "runtime": runtime_match.group(0) if runtime_match else None,
            "rating": float(rating_match.group(1)) if rating_match else None,
            "votes": votes_match.group(1) if votes_match else None,
            "url": urljoin(BASE_URL, href),
        })

    movies.sort(key=lambda row: (row["rank"] is None, row["rank"] or 9999))
    return movies


def write_csv(rows: list[dict], path: str) -> None:
    fieldnames = ["rank", "title", "year", "runtime", "rating", "votes", "url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)


def write_json(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


if __name__ == "__main__":
    html = fetch_html(CHART_URL)
    movies = parse_movie_rows(html)
    print(f"parsed {len(movies)} movies")
    print(movies[:3])
    write_csv(movies, "imdb_top_250.csv")
    write_json(movies, "imdb_top_250.json")
    print("done")

How to use ProxiesAPI for the fetch step

If direct requests become inconsistent, keep the parser exactly the same and only change the fetch URL.

The ProxiesAPI format is:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.imdb.com/chart/top/"

And in Python:

from urllib.parse import quote_plus
import requests


def fetch_via_proxiesapi(target_url: str, api_key: str) -> str:
    api_url = f"http://api.proxiesapi.com/?key={api_key}&url={quote_plus(target_url)}"
    response = requests.get(api_url, timeout=(10, 30))
    response.raise_for_status()
    return response.text


html = fetch_via_proxiesapi("https://www.imdb.com/chart/top/", "API_KEY")
movies = parse_movie_rows(html)
print(movies[:3])

That is the clean separation you want in any production scraper:

one function for fetching
one parser for extracting
one exporter for saving

Practical QA checks

Before you trust the dataset, validate it.

Check that:

row count is close to 250
top rows include The Shawshank Redemption, The Godfather, and The Dark Knight
rank values are integers from 1 upward
ratings are populated for top rows
URLs point to /title/tt.../

A quick validation helper:

assert len(movies) >= 200, "too few rows parsed"
assert movies[0]["title"] == "The Shawshank Redemption"
assert movies[0]["rank"] == 1
assert movies[0]["rating"] >= 9.0

These assertions catch parser drift early.

Final takeaway

The interesting part of this scraper is not the CSV export.

It is the workflow discipline:

verify the real page first
avoid invented selectors
keep fetch and parse logic separate
validate the output before you trust it

That is what turns a one-off tutorial into code you can actually build on.

And when direct requests stop being predictable, you do not need to rewrite the scraper. You just swap the fetch layer to a ProxiesAPI URL and keep the parser intact.

Need a simpler fetch layer when direct requests get flaky?

Once your parser is solid, the fragile part becomes the network. ProxiesAPI lets you keep the same parsing code while swapping in a cleaner fetch URL for large or unreliable crawls.

Get 1,000 free API calls View pricing

Extract IMDb Top 250 movies (rank, title, year, rating, vote count) into clean CSV/JSON — with robust parsing, retries, and polite crawling.

tutorial#python#imdb#web-scraping

Scrape Goodreads Author Pages: Books, Series, Ratings (ProxiesAPI + Python)

Extract author profile data plus a clean list of books (title, URL, average rating, rating count) from Goodreads author pages. Includes real selectors, retries, and a screenshot.

tutorial#python#goodreads#web-scraping

Scrape Numbeo City Cost-of-Living Comparisons (2-City Diff Tables) with Python

Extract Numbeo city-vs-city cost of living comparison rows into a clean dataset (item, city1, city2, percent diff). Includes screenshot, URL builder, and robust table parsing.

tutorial#python#numbeo#web-scraping

Scrape Stack Overflow with Python: Tag Pages + Question Threads + Q/A Export

Build a production-ready Stack Overflow scraper: crawl tag pages, follow question links, extract question + answers + votes, and export JSON/CSV. Includes a screenshot and ProxiesAPI integration hooks.

tutorial#stack overflow#python#web-scraping