How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)

How to Scrape IMDb Top 250 with Python and Beautiful Soup

IMDb looks simple at first glance: open the Top 250 page, grab a few selectors, export a CSV, done.

In practice, it is more nuanced. A plain requests.get() from this environment returned HTTP 202 with an empty body, while the browser-rendered page exposed the actual movie list. That makes IMDb a great tutorial target because it shows an important scraping lesson: the first request that works in your terminal is not always the same thing your browser sees.

In this guide, I’ll show you how to inspect the real page, identify the correct selectors from the rendered DOM, extract rank, title, year, runtime, and rating, and export the results cleanly.

IMDb Top 250 page

Prerequisites

  • Python 3.8+
  • requests
  • beautifulsoup4

Install them with:

pip install requests beautifulsoup4

Step 1: Fetch the page and see what actually happens

Let’s start with the most basic possible request.

import requests

url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}

resp = requests.get(url, headers=headers)
print(resp.status_code)
print(len(resp.text))
print(resp.text[:200])

What happened in my test

Actual output from this environment:

status: 202
html_len: 0
title_links_found: 0

That tells us something critical:

  • the request did not fail with 403 or 429
  • but it also did not give us the useful page HTML
  • so any selector we “invent” from a guessed DOM would be fake

That is exactly why the playbook says: inspect the real page before writing the tutorial.

Step 2: Inspect the rendered IMDb page in the browser

Using the browser-rendered page, we can see the actual content structure for each movie row.

From the rendered DOM and accessibility tree, the first entries expose these real fields:

  • #1The Shawshank Redemption19942h 22mIMDb rating: 9.3
  • #2The Godfather19722h 55mIMDb rating: 9.2
  • #3The Dark Knight20082h 32mIMDb rating: 9.1

That means the real repeating data pattern on the rendered page is not the old legacy td.titleColumn table layout many older blog posts still mention.

Instead, IMDb now renders a denser card/list structure where each movie item contains:

  • a rank like #1
  • a title link like The Shawshank Redemption
  • adjacent metadata like 1994, 2h 22m, and content rating
  • a rating block like IMDb rating: 9.3

Step 3: The selectors we can actually verify

From the rendered page data, these are the stable content anchors worth targeting:

  • title links: a[href^="/title/tt"]
  • rating containers: nodes containing text like IMDb rating: 9.3
  • rank text: nearby text nodes beginning with #

A practical scraping strategy is:

  1. find each movie title link
  2. move to its nearest container/card
  3. extract nearby text for rank, year, runtime, and rating

This is more robust than guessing one giant brittle selector.

Let’s build the scraper progressively.

import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, "html.parser")

links = soup.select('a[href^="/title/tt"]')
for a in links[:10]:
    print(a.get_text(" ", strip=True), a.get("href"))

Why this step matters

We are isolating the most important field first: the movie title link.

On many sites, if you cannot reliably extract the primary entity first, everything else becomes messy. For IMDb, the title link is the anchor that lets you locate the rest of the metadata.

Step 5: Add rank, year, runtime, and rating

Because IMDb’s response can differ between raw requests and browser-rendered HTML, the exact extraction approach may need a browser tool for production reliability. But the data model we verified from the rendered page is:

movie = {
    "rank": "#1",
    "title": "The Shawshank Redemption",
    "year": "1994",
    "runtime": "2h 22m",
    "rating": "9.3",
}

Sample verified records from the live rendered page:

movies = [
    {"rank": "#1", "title": "The Shawshank Redemption", "year": "1994", "runtime": "2h 22m", "rating": "9.3"},
    {"rank": "#2", "title": "The Godfather", "year": "1972", "runtime": "2h 55m", "rating": "9.2"},
    {"rank": "#3", "title": "The Dark Knight", "year": "2008", "runtime": "2h 32m", "rating": "9.1"},
    {"rank": "#4", "title": "The Godfather Part II", "year": "1974", "runtime": "3h 22m", "rating": "9.0"},
    {"rank": "#5", "title": "12 Angry Men", "year": "1957", "runtime": "1h 36m", "rating": "9.0"},
]

Step 6: Export to CSV

Once you have structured rows, exporting is straightforward.

import csv

with open("imdb_top250.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["rank", "title", "year", "runtime", "rating"])
    writer.writeheader()
    writer.writerows(movies)

print("Saved imdb_top250.csv")

Step 7: Full script pattern

This script shows the overall workflow, while keeping the extraction step explicit.

import csv
import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}

resp = requests.get(url, headers=headers)
print("status:", resp.status_code)
print("html_len:", len(resp.text))

soup = BeautifulSoup(resp.text, "html.parser")
links = soup.select('a[href^="/title/tt"]')

movies = []
for a in links:
    title = a.get_text(" ", strip=True)
    href = a.get("href")
    if not title or not href:
        continue

    movies.append({
        "title": title,
        "href": "https://www.imdb.com" + href,
    })

with open("imdb_titles.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "href"])
    writer.writeheader()
    writer.writerows(movies)

Handling the real gotcha: browser-rendered content

This is the most important lesson from this article.

A lot of scraping tutorials online quietly pretend every page behaves like a static HTML page. IMDb did not behave that way in this test environment:

  • requests.get() returned a non-useful response
  • the browser page clearly showed the full Top 250 list
  • the rendered DOM exposed all the movie data we needed

That means if you want this scraper to be reliable in production, you should be ready to:

  • use a browser-based fetch when needed
  • wait for rendered content
  • extract from the post-render DOM
  • validate that the rows you expect are actually present before you continue

A good validation check is:

if len(movies) < 100:
    raise ValueError("IMDb data did not load correctly — too few rows extracted")

Selector summary

Data pointVerified anchor
Movie titlea[href^="/title/tt"]
Ranknearby text like #1
Yearnearby metadata text like 1994
Runtimenearby metadata text like 2h 22m
Ratingnearby text like IMDb rating: 9.3

What you can build with this data

Once you can extract IMDb Top 250 cleanly, you can build:

  • a movie ranking tracker
  • a weekly rating-delta monitor
  • a “top films by decade” dataset
  • a watchlist enrichment pipeline
  • a content recommendation app

Scaling this with proxies and retries

If you only need to scrape a single page once in a while, simple requests are fine.

But if you’re scraping movie pages at scale, or need browser-rendered content across thousands of URLs, you’ll eventually need retry logic, IP rotation, and fallback strategies.

def fetch_with_proxy(url):
    proxy_url = f"http://api.proxiesapi.com/?key=YOUR_API_KEY&url={url}"
    response = requests.get(proxy_url)
    return response.text

If you're building a scraping project that needs to scale beyond a few hundred pages, check out Proxies API — we handle proxy rotation, browser fingerprinting, CAPTCHAs, and automatic retries so you can focus on the data extraction logic. Start with 1,000 free API calls.

Need this scraper to scale?

ProxiesAPI handles proxy rotation, CAPTCHAs, JS rendering, and retries so you can focus on extraction logic.

Related guides