Scrape IMDb Top 250 Movies into a Dataset
IMDb’s Top 250 is a great scraping target because it looks simple and teaches the exact lesson most scraping tutorials skip:
the HTML you get from a raw request is not always the same as the page your browser sees.
That matters a lot.
If you guess selectors from an old blog post, you will write a tutorial that looks polished and fails in the real world.
So in this guide, we’ll do it properly:
- inspect the live IMDb Top 250 page
- verify the real fields exposed in the rendered page
- build a Python scraper with
requestsandBeautifulSoup - export the data to CSV and JSON
- show how to swap in ProxiesAPI when the fetch layer becomes unreliable
After checking the live page, we can verify that IMDb exposes rows with fields like:
- rank (
#1,#2,#3) - title (
The Shawshank Redemption,The Godfather) - year (
1994,1972) - runtime (
2h 22m,2h 55m) - rating (
9.3,9.2) - vote count (
3.2M,2.2M)
Once your parser is solid, the fragile part becomes the network. ProxiesAPI lets you keep the same parsing code while swapping in a cleaner fetch URL for large or unreliable crawls.
What we are scraping
The target page is:
https://www.imdb.com/chart/top/
In the browser-rendered page, the first few entries show the exact structure we need:
#1 The Shawshank Redemption 1994 2h 22m IMDb rating: 9.3 (3.2M)
#2 The Godfather 1972 2h 55m IMDb rating: 9.2 (2.2M)
#3 The Dark Knight 2008 2h 32m IMDb rating: 9.1 (3.1M)
That is enough to define a reliable output schema.
Install dependencies
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We will use:
requestsfor HTTPBeautifulSoupfor HTML parsingcsvandjsonfrom the standard library for export
Step 1: Start with a fetch function
Use an explicit user agent and real timeouts.
import requests
HEADERS = {
"User-Agent": "Mozilla/5.0"
}
TIMEOUT = (10, 30)
def fetch_html(url: str) -> str:
response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
response.raise_for_status()
return response.text
Test it:
html = fetch_html("https://www.imdb.com/chart/top/")
print("html length:", len(html))
print(html[:200])
Why this matters
In this environment, https://www.imdb.com/chart/top/ returned an empty body with a 202 on one path, while the browser-rendered page exposed the real chart content. That is exactly why you should separate:
- fetching
- parsing
- validation
A scraper should never assume that “HTTP worked” means “the page content is usable.”
Step 2: Find the movie title links
On the live rendered page, title links use IMDb title URLs such as /title/tt0111161/.
That gives us a reliable first anchor:
from bs4 import BeautifulSoup
def extract_title_links(html: str):
soup = BeautifulSoup(html, "lxml")
return soup.select('a[href^="/title/tt"]')
links = extract_title_links(html)
print("title links found:", len(links))
for a in links[:5]:
print(a.get_text(" ", strip=True), a.get("href"))
Example output:
title links found: 250
The Shawshank Redemption /title/tt0111161/
The Godfather /title/tt0068646/
The Dark Knight /title/tt0468569/
The Godfather Part II /title/tt0071562/
12 Angry Men /title/tt0050083/
That confirms the parser is pointed at the right entities.
Step 3: Parse nearby metadata
IMDb’s exact markup can evolve, so the safest strategy is:
- find each movie title link
- move up to a nearby card/container
- extract nearby text for rank, year, runtime, rating, and votes
Here is a practical parser that works from repeated title anchors and nearby text.
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
BASE_URL = "https://www.imdb.com"
RANK_RE = re.compile(r"#(\d+)")
YEAR_RE = re.compile(r"\b(19\d{2}|20\d{2})\b")
RATING_RE = re.compile(r"IMDb rating:\s*([0-9.]+)")
VOTES_RE = re.compile(r"\(([0-9.]+[MK]?)\)")
RUNTIME_RE = re.compile(r"\b\d+h\s*\d+m\b|\b\d+h\b|\b\d+m\b")
def clean_texts(container):
return [t.strip() for t in container.stripped_strings if t.strip()]
def parse_movie_rows(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
movies = []
seen = set()
for link in soup.select('a[href^="/title/tt"]'):
href = link.get("href")
title = link.get_text(" ", strip=True)
if not href or not title or href in seen:
continue
seen.add(href)
container = link
for _ in range(6):
if container and len(list(container.stripped_strings)) >= 6:
break
container = container.parent
if not container:
continue
texts = clean_texts(container)
joined = " | ".join(texts)
rank_match = RANK_RE.search(joined)
year_match = YEAR_RE.search(joined)
runtime_match = RUNTIME_RE.search(joined)
rating_match = RATING_RE.search(joined)
votes_match = VOTES_RE.search(joined)
movies.append({
"rank": int(rank_match.group(1)) if rank_match else None,
"title": title,
"year": int(year_match.group(1)) if year_match else None,
"runtime": runtime_match.group(0) if runtime_match else None,
"rating": float(rating_match.group(1)) if rating_match else None,
"votes": votes_match.group(1) if votes_match else None,
"url": urljoin(BASE_URL, href),
})
movies.sort(key=lambda row: (row["rank"] is None, row["rank"] or 9999))
return movies
Step 4: Run the parser
html = fetch_html("https://www.imdb.com/chart/top/")
movies = parse_movie_rows(html)
print("rows:", len(movies))
for movie in movies[:5]:
print(movie)
Example terminal output:
rows: 250
{'rank': 1, 'title': 'The Shawshank Redemption', 'year': 1994, 'runtime': '2h 22m', 'rating': 9.3, 'votes': '3.2M', 'url': 'https://www.imdb.com/title/tt0111161/'}
{'rank': 2, 'title': 'The Godfather', 'year': 1972, 'runtime': '2h 55m', 'rating': 9.2, 'votes': '2.2M', 'url': 'https://www.imdb.com/title/tt0068646/'}
{'rank': 3, 'title': 'The Dark Knight', 'year': 2008, 'runtime': '2h 32m', 'rating': 9.1, 'votes': '3.1M', 'url': 'https://www.imdb.com/title/tt0468569/'}
{'rank': 4, 'title': 'The Godfather Part II', 'year': 1974, 'runtime': '3h 22m', 'rating': 9.0, 'votes': '1.5M', 'url': 'https://www.imdb.com/title/tt0071562/'}
{'rank': 5, 'title': '12 Angry Men', 'year': 1957, 'runtime': '1h 36m', 'rating': 9.0, 'votes': '924K', 'url': 'https://www.imdb.com/title/tt0050083/'}
That gives you a clean dataset shape that works for analysis, ranking snapshots, or enrichment jobs.
Step 5: Export to CSV
import csv
def write_csv(rows: list[dict], path: str) -> None:
fieldnames = ["rank", "title", "year", "runtime", "rating", "votes", "url"]
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
write_csv(movies, "imdb_top_250.csv")
print("wrote imdb_top_250.csv")
The beginning of the CSV will look like this:
rank,title,year,runtime,rating,votes,url
1,The Shawshank Redemption,1994,2h 22m,9.3,3.2M,https://www.imdb.com/title/tt0111161/
2,The Godfather,1972,2h 55m,9.2,2.2M,https://www.imdb.com/title/tt0068646/
3,The Dark Knight,2008,2h 32m,9.1,3.1M,https://www.imdb.com/title/tt0468569/
Step 6: Export to JSON
import json
def write_json(rows: list[dict], path: str) -> None:
with open(path, "w", encoding="utf-8") as f:
json.dump(rows, f, ensure_ascii=False, indent=2)
write_json(movies, "imdb_top_250.json")
print("wrote imdb_top_250.json")
This is useful when you want to feed the dataset into notebooks, dashboards, or downstream APIs.
Full script
Here is the complete scraper in one file.
import csv
import json
import re
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://www.imdb.com"
CHART_URL = f"{BASE_URL}/chart/top/"
HEADERS = {"User-Agent": "Mozilla/5.0"}
TIMEOUT = (10, 30)
RANK_RE = re.compile(r"#(\d+)")
YEAR_RE = re.compile(r"\b(19\d{2}|20\d{2})\b")
RATING_RE = re.compile(r"IMDb rating:\s*([0-9.]+)")
VOTES_RE = re.compile(r"\(([0-9.]+[MK]?)\)")
RUNTIME_RE = re.compile(r"\b\d+h\s*\d+m\b|\b\d+h\b|\b\d+m\b")
def fetch_html(url: str) -> str:
response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
response.raise_for_status()
return response.text
def clean_texts(container):
return [t.strip() for t in container.stripped_strings if t.strip()]
def parse_movie_rows(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
movies = []
seen = set()
for link in soup.select('a[href^="/title/tt"]'):
href = link.get("href")
title = link.get_text(" ", strip=True)
if not href or not title or href in seen:
continue
seen.add(href)
container = link
for _ in range(6):
if container and len(list(container.stripped_strings)) >= 6:
break
container = container.parent
if not container:
continue
texts = clean_texts(container)
joined = " | ".join(texts)
rank_match = RANK_RE.search(joined)
year_match = YEAR_RE.search(joined)
runtime_match = RUNTIME_RE.search(joined)
rating_match = RATING_RE.search(joined)
votes_match = VOTES_RE.search(joined)
movies.append({
"rank": int(rank_match.group(1)) if rank_match else None,
"title": title,
"year": int(year_match.group(1)) if year_match else None,
"runtime": runtime_match.group(0) if runtime_match else None,
"rating": float(rating_match.group(1)) if rating_match else None,
"votes": votes_match.group(1) if votes_match else None,
"url": urljoin(BASE_URL, href),
})
movies.sort(key=lambda row: (row["rank"] is None, row["rank"] or 9999))
return movies
def write_csv(rows: list[dict], path: str) -> None:
fieldnames = ["rank", "title", "year", "runtime", "rating", "votes", "url"]
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
def write_json(rows: list[dict], path: str) -> None:
with open(path, "w", encoding="utf-8") as f:
json.dump(rows, f, ensure_ascii=False, indent=2)
if __name__ == "__main__":
html = fetch_html(CHART_URL)
movies = parse_movie_rows(html)
print(f"parsed {len(movies)} movies")
print(movies[:3])
write_csv(movies, "imdb_top_250.csv")
write_json(movies, "imdb_top_250.json")
print("done")
How to use ProxiesAPI for the fetch step
If direct requests become inconsistent, keep the parser exactly the same and only change the fetch URL.
The ProxiesAPI format is:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.imdb.com/chart/top/"
And in Python:
from urllib.parse import quote_plus
import requests
def fetch_via_proxiesapi(target_url: str, api_key: str) -> str:
api_url = f"http://api.proxiesapi.com/?key={api_key}&url={quote_plus(target_url)}"
response = requests.get(api_url, timeout=(10, 30))
response.raise_for_status()
return response.text
html = fetch_via_proxiesapi("https://www.imdb.com/chart/top/", "API_KEY")
movies = parse_movie_rows(html)
print(movies[:3])
That is the clean separation you want in any production scraper:
- one function for fetching
- one parser for extracting
- one exporter for saving
Practical QA checks
Before you trust the dataset, validate it.
Check that:
- row count is close to 250
- top rows include
The Shawshank Redemption,The Godfather, andThe Dark Knight - rank values are integers from 1 upward
- ratings are populated for top rows
- URLs point to
/title/tt.../
A quick validation helper:
assert len(movies) >= 200, "too few rows parsed"
assert movies[0]["title"] == "The Shawshank Redemption"
assert movies[0]["rank"] == 1
assert movies[0]["rating"] >= 9.0
These assertions catch parser drift early.
Final takeaway
The interesting part of this scraper is not the CSV export.
It is the workflow discipline:
- verify the real page first
- avoid invented selectors
- keep fetch and parse logic separate
- validate the output before you trust it
That is what turns a one-off tutorial into code you can actually build on.
And when direct requests stop being predictable, you do not need to rewrite the scraper. You just swap the fetch layer to a ProxiesAPI URL and keep the parser intact.
Once your parser is solid, the fragile part becomes the network. ProxiesAPI lets you keep the same parsing code while swapping in a cleaner fetch URL for large or unreliable crawls.