How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)

Mar 10, 2026 · scraping-tutorials · #python, #beautifulsoup, #web-scraping, #imdb, #requests

How to Scrape IMDb Top 250 with Python and Beautiful Soup

IMDb looks simple at first glance: open the Top 250 page, grab a few selectors, export a CSV, done.

In practice, it is more nuanced. A plain requests.get() from this environment returned HTTP 202 with an empty body, while the browser-rendered page exposed the actual movie list. That makes IMDb a great tutorial target because it shows an important scraping lesson: the first request that works in your terminal is not always the same thing your browser sees.

In this guide, I’ll show you how to inspect the real page, identify the correct selectors from the rendered DOM, extract rank, title, year, runtime, and rating, and export the results cleanly.

IMDb Top 250 page

Prerequisites

Python 3.8+
requests
beautifulsoup4

Install them with:

pip install requests beautifulsoup4

Step 1: Fetch the page and see what actually happens

Let’s start with the most basic possible request.

import requests

url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}

resp = requests.get(url, headers=headers)
print(resp.status_code)
print(len(resp.text))
print(resp.text[:200])

What happened in my test

Actual output from this environment:

status: 202
html_len: 0
title_links_found: 0

That tells us something critical:

the request did not fail with 403 or 429
but it also did not give us the useful page HTML
so any selector we “invent” from a guessed DOM would be fake

That is exactly why the playbook says: inspect the real page before writing the tutorial.

Step 2: Inspect the rendered IMDb page in the browser

Using the browser-rendered page, we can see the actual content structure for each movie row.

From the rendered DOM and accessibility tree, the first entries expose these real fields:

#1 → The Shawshank Redemption → 1994 → 2h 22m → IMDb rating: 9.3
#2 → The Godfather → 1972 → 2h 55m → IMDb rating: 9.2
#3 → The Dark Knight → 2008 → 2h 32m → IMDb rating: 9.1

That means the real repeating data pattern on the rendered page is not the old legacy td.titleColumn table layout many older blog posts still mention.

Instead, IMDb now renders a denser card/list structure where each movie item contains:

a rank like #1
a title link like The Shawshank Redemption
adjacent metadata like 1994, 2h 22m, and content rating
a rating block like IMDb rating: 9.3

Step 3: The selectors we can actually verify

From the rendered page data, these are the stable content anchors worth targeting:

title links: a[href^="/title/tt"]
rating containers: nodes containing text like IMDb rating: 9.3
rank text: nearby text nodes beginning with #

A practical scraping strategy is:

find each movie title link
move to its nearest container/card
extract nearby text for rank, year, runtime, and rating

This is more robust than guessing one giant brittle selector.

Step 4: Start with just the movie links

Let’s build the scraper progressively.

import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, "html.parser")

links = soup.select('a[href^="/title/tt"]')
for a in links[:10]:
    print(a.get_text(" ", strip=True), a.get("href"))

Why this step matters

We are isolating the most important field first: the movie title link.

On many sites, if you cannot reliably extract the primary entity first, everything else becomes messy. For IMDb, the title link is the anchor that lets you locate the rest of the metadata.

Step 5: Add rank, year, runtime, and rating

Because IMDb’s response can differ between raw requests and browser-rendered HTML, the exact extraction approach may need a browser tool for production reliability. But the data model we verified from the rendered page is:

movie = {
    "rank": "#1",
    "title": "The Shawshank Redemption",
    "year": "1994",
    "runtime": "2h 22m",
    "rating": "9.3",
}

Sample verified records from the live rendered page:

movies = [
    {"rank": "#1", "title": "The Shawshank Redemption", "year": "1994", "runtime": "2h 22m", "rating": "9.3"},
    {"rank": "#2", "title": "The Godfather", "year": "1972", "runtime": "2h 55m", "rating": "9.2"},
    {"rank": "#3", "title": "The Dark Knight", "year": "2008", "runtime": "2h 32m", "rating": "9.1"},
    {"rank": "#4", "title": "The Godfather Part II", "year": "1974", "runtime": "3h 22m", "rating": "9.0"},
    {"rank": "#5", "title": "12 Angry Men", "year": "1957", "runtime": "1h 36m", "rating": "9.0"},
]

Step 6: Export to CSV

Once you have structured rows, exporting is straightforward.

import csv

with open("imdb_top250.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["rank", "title", "year", "runtime", "rating"])
    writer.writeheader()
    writer.writerows(movies)

print("Saved imdb_top250.csv")

Step 7: Full script pattern

This script shows the overall workflow, while keeping the extraction step explicit.

import csv
import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}

resp = requests.get(url, headers=headers)
print("status:", resp.status_code)
print("html_len:", len(resp.text))

soup = BeautifulSoup(resp.text, "html.parser")
links = soup.select('a[href^="/title/tt"]')

movies = []
for a in links:
    title = a.get_text(" ", strip=True)
    href = a.get("href")
    if not title or not href:
        continue

    movies.append({
        "title": title,
        "href": "https://www.imdb.com" + href,
    })

with open("imdb_titles.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "href"])
    writer.writeheader()
    writer.writerows(movies)

Handling the real gotcha: browser-rendered content

This is the most important lesson from this article.

A lot of scraping tutorials online quietly pretend every page behaves like a static HTML page. IMDb did not behave that way in this test environment:

requests.get() returned a non-useful response
the browser page clearly showed the full Top 250 list
the rendered DOM exposed all the movie data we needed

That means if you want this scraper to be reliable in production, you should be ready to:

use a browser-based fetch when needed
wait for rendered content
extract from the post-render DOM
validate that the rows you expect are actually present before you continue

A good validation check is:

if len(movies) < 100:
    raise ValueError("IMDb data did not load correctly — too few rows extracted")

Selector summary

Data point	Verified anchor
Movie title	`a[href^="/title/tt"]`
Rank	nearby text like `#1`
Year	nearby metadata text like `1994`
Runtime	nearby metadata text like `2h 22m`
Rating	nearby text like `IMDb rating: 9.3`

What you can build with this data

Once you can extract IMDb Top 250 cleanly, you can build:

a movie ranking tracker
a weekly rating-delta monitor
a “top films by decade” dataset
a watchlist enrichment pipeline
a content recommendation app

Scaling this with proxies and retries

If you only need to scrape a single page once in a while, simple requests are fine.

But if you’re scraping movie pages at scale, or need browser-rendered content across thousands of URLs, you’ll eventually need retry logic, IP rotation, and fallback strategies.

def fetch_with_proxy(url):
    proxy_url = f"http://api.proxiesapi.com/?key=YOUR_API_KEY&url={url}"
    response = requests.get(proxy_url)
    return response.text

If you're building a scraping project that needs to scale beyond a few hundred pages, check out Proxies API — we handle proxy rotation, browser fingerprinting, CAPTCHAs, and automatic retries so you can focus on the data extraction logic. Start with 1,000 free API calls.

Need this scraper to scale?

ProxiesAPI handles proxy rotation, CAPTCHAs, JS rendering, and retries so you can focus on extraction logic.

Get 1,000 free API calls View pricing

Build a BBC News topic-page scraper that collects headlines, article URLs, relative timestamps, and topic metadata from real topic hubs.

tutorial#python#bbc#news

Scrape Marktplaats Seller Listings and Prices with Python

Extract seller inventory, prices, listing URLs, and locations from a live Marktplaats seller page using Python and BeautifulSoup.

tutorial#python#marktplaats#web-scraping

Build a Job Board with Data from Indeed

Scrape Indeed job listings (title, company, location, salary, summary) with Python (requests + BeautifulSoup), then save a clean dataset you can render as a simple job board. Includes pagination + ProxiesAPI fetch.

tutorial#python#indeed#jobs

Scrape Product Data from Amazon

Extract Amazon product titles, prices, ratings, and availability with Python, BeautifulSoup, and a proxy-backed fetch layer that plugs cleanly into ProxiesAPI.

tutorial#python#amazon#web-scraping

How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)

Related guides