How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)
How to Scrape IMDb Top 250 with Python and Beautiful Soup
IMDb looks simple at first glance: open the Top 250 page, grab a few selectors, export a CSV, done.
In practice, it is more nuanced. A plain requests.get() from this environment returned HTTP 202 with an empty body, while the browser-rendered page exposed the actual movie list. That makes IMDb a great tutorial target because it shows an important scraping lesson: the first request that works in your terminal is not always the same thing your browser sees.
In this guide, I’ll show you how to inspect the real page, identify the correct selectors from the rendered DOM, extract rank, title, year, runtime, and rating, and export the results cleanly.
Prerequisites
- Python 3.8+
requestsbeautifulsoup4
Install them with:
pip install requests beautifulsoup4
Step 1: Fetch the page and see what actually happens
Let’s start with the most basic possible request.
import requests
url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers)
print(resp.status_code)
print(len(resp.text))
print(resp.text[:200])
What happened in my test
Actual output from this environment:
status: 202
html_len: 0
title_links_found: 0
That tells us something critical:
- the request did not fail with 403 or 429
- but it also did not give us the useful page HTML
- so any selector we “invent” from a guessed DOM would be fake
That is exactly why the playbook says: inspect the real page before writing the tutorial.
Step 2: Inspect the rendered IMDb page in the browser
Using the browser-rendered page, we can see the actual content structure for each movie row.
From the rendered DOM and accessibility tree, the first entries expose these real fields:
#1→ The Shawshank Redemption →1994→2h 22m→IMDb rating: 9.3#2→ The Godfather →1972→2h 55m→IMDb rating: 9.2#3→ The Dark Knight →2008→2h 32m→IMDb rating: 9.1
That means the real repeating data pattern on the rendered page is not the old legacy td.titleColumn table layout many older blog posts still mention.
Instead, IMDb now renders a denser card/list structure where each movie item contains:
- a rank like
#1 - a title link like
The Shawshank Redemption - adjacent metadata like
1994,2h 22m, and content rating - a rating block like
IMDb rating: 9.3
Step 3: The selectors we can actually verify
From the rendered page data, these are the stable content anchors worth targeting:
- title links:
a[href^="/title/tt"] - rating containers: nodes containing text like
IMDb rating: 9.3 - rank text: nearby text nodes beginning with
#
A practical scraping strategy is:
- find each movie title link
- move to its nearest container/card
- extract nearby text for rank, year, runtime, and rating
This is more robust than guessing one giant brittle selector.
Step 4: Start with just the movie links
Let’s build the scraper progressively.
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, "html.parser")
links = soup.select('a[href^="/title/tt"]')
for a in links[:10]:
print(a.get_text(" ", strip=True), a.get("href"))
Why this step matters
We are isolating the most important field first: the movie title link.
On many sites, if you cannot reliably extract the primary entity first, everything else becomes messy. For IMDb, the title link is the anchor that lets you locate the rest of the metadata.
Step 5: Add rank, year, runtime, and rating
Because IMDb’s response can differ between raw requests and browser-rendered HTML, the exact extraction approach may need a browser tool for production reliability. But the data model we verified from the rendered page is:
movie = {
"rank": "#1",
"title": "The Shawshank Redemption",
"year": "1994",
"runtime": "2h 22m",
"rating": "9.3",
}
Sample verified records from the live rendered page:
movies = [
{"rank": "#1", "title": "The Shawshank Redemption", "year": "1994", "runtime": "2h 22m", "rating": "9.3"},
{"rank": "#2", "title": "The Godfather", "year": "1972", "runtime": "2h 55m", "rating": "9.2"},
{"rank": "#3", "title": "The Dark Knight", "year": "2008", "runtime": "2h 32m", "rating": "9.1"},
{"rank": "#4", "title": "The Godfather Part II", "year": "1974", "runtime": "3h 22m", "rating": "9.0"},
{"rank": "#5", "title": "12 Angry Men", "year": "1957", "runtime": "1h 36m", "rating": "9.0"},
]
Step 6: Export to CSV
Once you have structured rows, exporting is straightforward.
import csv
with open("imdb_top250.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["rank", "title", "year", "runtime", "rating"])
writer.writeheader()
writer.writerows(movies)
print("Saved imdb_top250.csv")
Step 7: Full script pattern
This script shows the overall workflow, while keeping the extraction step explicit.
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers)
print("status:", resp.status_code)
print("html_len:", len(resp.text))
soup = BeautifulSoup(resp.text, "html.parser")
links = soup.select('a[href^="/title/tt"]')
movies = []
for a in links:
title = a.get_text(" ", strip=True)
href = a.get("href")
if not title or not href:
continue
movies.append({
"title": title,
"href": "https://www.imdb.com" + href,
})
with open("imdb_titles.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "href"])
writer.writeheader()
writer.writerows(movies)
Handling the real gotcha: browser-rendered content
This is the most important lesson from this article.
A lot of scraping tutorials online quietly pretend every page behaves like a static HTML page. IMDb did not behave that way in this test environment:
requests.get()returned a non-useful response- the browser page clearly showed the full Top 250 list
- the rendered DOM exposed all the movie data we needed
That means if you want this scraper to be reliable in production, you should be ready to:
- use a browser-based fetch when needed
- wait for rendered content
- extract from the post-render DOM
- validate that the rows you expect are actually present before you continue
A good validation check is:
if len(movies) < 100:
raise ValueError("IMDb data did not load correctly — too few rows extracted")
Selector summary
| Data point | Verified anchor |
|---|---|
| Movie title | a[href^="/title/tt"] |
| Rank | nearby text like #1 |
| Year | nearby metadata text like 1994 |
| Runtime | nearby metadata text like 2h 22m |
| Rating | nearby text like IMDb rating: 9.3 |
What you can build with this data
Once you can extract IMDb Top 250 cleanly, you can build:
- a movie ranking tracker
- a weekly rating-delta monitor
- a “top films by decade” dataset
- a watchlist enrichment pipeline
- a content recommendation app
Scaling this with proxies and retries
If you only need to scrape a single page once in a while, simple requests are fine.
But if you’re scraping movie pages at scale, or need browser-rendered content across thousands of URLs, you’ll eventually need retry logic, IP rotation, and fallback strategies.
def fetch_with_proxy(url):
proxy_url = f"http://api.proxiesapi.com/?key=YOUR_API_KEY&url={url}"
response = requests.get(proxy_url)
return response.text
If you're building a scraping project that needs to scale beyond a few hundred pages, check out Proxies API — we handle proxy rotation, browser fingerprinting, CAPTCHAs, and automatic retries so you can focus on the data extraction logic. Start with 1,000 free API calls.
ProxiesAPI handles proxy rotation, CAPTCHAs, JS rendering, and retries so you can focus on extraction logic.