Scrape Netflix Catalogue Data with Python + ProxiesAPI (Titles, Genres, Availability)

Netflix is notoriously hard to scrape “like a normal website.”

  • much of the UI is app-like
  • markup varies by region and A/B experiments
  • content availability is country-dependent

So what can you reliably do?

You can build a repeatable catalogue snapshot by targeting stable, public-facing surfaces:

  • title “browse” / listing pages (when accessible)
  • title detail pages (when accessible)

…and writing your scraper to be defensive:

  • treat every field as optional
  • dedupe aggressively
  • keep the fetch layer stable (timeouts, retries, backoff)

In this guide we’ll implement an extractor that produces rows like:

{"title":"Stranger Things","url":"https://www.netflix.com/title/80057281","title_id":"80057281","maturity":"TV-14","genres":["Sci-Fi TV"],"availability_country":"US"}

Netflix browse UI (we’ll extract title cards + title URLs)

Make catalogue crawls consistent with ProxiesAPI

Netflix pages are geo-sensitive and can throttle or vary by region/device. ProxiesAPI helps stabilize your fetches (location consistency, retries, rotation) so your catalogue extractor can run on a schedule.


A reality check (and what we’re not doing)

Netflix actively discourages scraping and can require login, JS execution, and geo checks.

This tutorial does not promise:

  • full global catalogue coverage
  • perfect genre/maturity extraction for every title
  • bypassing paywalls or logged-in walls

Instead, it shows a pattern you can safely reuse:

  1. fetch a set of listing pages you can access
  2. extract title IDs + URLs (the stable identifiers)
  3. optionally enrich each title by visiting its detail page

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

ProxiesAPI-powered fetch() (single integration point)

import os
import time
import random
import requests

TIMEOUT = (15, 60)
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")

session = requests.Session()


def fetch(url: str, *, country: str | None = None, max_retries: int = 5) -> str:
    """Fetch a URL with retries/backoff.

    ProxiesAPI is used as a network reliability layer.
    Replace parameter names with the exact ProxiesAPI interface you use.
    """
    last_err = None

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/122.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }

    for attempt in range(1, max_retries + 1):
        try:
            if not PROXIESAPI_KEY:
                raise RuntimeError("Missing PROXIESAPI_KEY env var")

            params = {
                "auth_key": PROXIESAPI_KEY,
                "url": url,
            }
            if country:
                params["country"] = country

            r = session.get(
                "https://api.proxiesapi.com",
                params=params,
                timeout=TIMEOUT,
                headers=headers,
            )
            r.raise_for_status()
            return r.text

        except Exception as e:
            last_err = e
            sleep_s = min(45, (2 ** (attempt - 1)) + random.random())
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after {max_retries} retries: {url}") from last_err

Pick target URLs (what to crawl)

Netflix URLs change and may redirect.

Common patterns you might see:

  • Browse entry: https://www.netflix.com/browse
  • Genre listing: https://www.netflix.com/browse/genre/<genre_id>
  • Title detail: https://www.netflix.com/title/<title_id>

For a catalogue snapshot, the most valuable output is:

  • title_id
  • title URL

Because you can always enrich later.

We’ll crawl a list of “seed” pages and extract any /title/<id> links.


Even when Netflix uses dynamic rendering, title links frequently appear as anchors somewhere in the HTML.

We’ll parse:

  • all a[href*="/title/"]
  • normalize to absolute URL
  • extract numeric ID
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.netflix.com"


def extract_title_id(href: str) -> str | None:
    m = re.search(r"/title/(\d+)", href)
    return m.group(1) if m else None


def parse_titles_from_html(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    for a in soup.select('a[href*="/title/"]'):
        href = a.get("href")
        if not href:
            continue

        title_id = extract_title_id(href)
        if not title_id:
            continue

        url = href if href.startswith("http") else urljoin(BASE, href)

        # The visible title text isn't always present, but try.
        text = a.get_text(" ", strip=True) or None

        out.append({
            "title_id": title_id,
            "url": url,
            "title": text,
        })

    # Dedupe by title_id
    seen = set()
    uniq = []
    for row in out:
        tid = row["title_id"]
        if tid in seen:
            continue
        seen.add(tid)
        uniq.append(row)

    return uniq

Step 2: Crawl multiple seed pages (defensive + dedupe)

You can build your seed list from:

  • a few genre pages you care about
  • curated lists (internal)
  • your own “watch” categories

def crawl_seeds(seed_urls: list[str], *, country: str = "US") -> list[dict]:
    all_titles = []
    seen = set()

    for i, url in enumerate(seed_urls, start=1):
        html = fetch(url, country=country)
        batch = parse_titles_from_html(html)

        added = 0
        for row in batch:
            tid = row["title_id"]
            if tid in seen:
                continue
            seen.add(tid)
            row["availability_country"] = country
            row["source_url"] = url
            all_titles.append(row)
            added += 1

        print(f"seed {i}/{len(seed_urls)} -> found {len(batch)} titles, added {added}, total {len(all_titles)}")

    return all_titles

Example seeds (replace with pages that are accessible for you):

SEEDS = [
    "https://www.netflix.com/browse",
    # "https://www.netflix.com/browse/genre/83",  # TV Shows (example)
    # "https://www.netflix.com/browse/genre/1365",  # Action & Adventure (example)
]

rows = crawl_seeds(SEEDS, country="US")
print("unique titles:", len(rows))

Step 3 (optional): Enrich a title detail page

If your fetches can access title pages, you can enrich each row.

We’ll extract a few fields when present:

  • maturity rating
  • genres
  • synopsis

Because Netflix uses dynamic scripts, these may not always be available in static HTML. The code below is best-effort and safe when fields are missing.


def enrich_title(row: dict, *, country: str = "US") -> dict:
    url = row["url"]
    html = fetch(url, country=country)
    soup = BeautifulSoup(html, "lxml")

    # These selectors may change; keep them optional.
    maturity = None
    synopsis = None
    genres = []

    # Meta description sometimes includes synopsis-like content
    meta_desc = soup.select_one('meta[name="description"]')
    if meta_desc and meta_desc.get("content"):
        synopsis = meta_desc.get("content").strip() or None

    # Some pages include maturity rating in aria-label or text
    rating_el = soup.find(attrs={"data-uia": re.compile(r"maturity-rating", re.I)})
    if rating_el:
        maturity = rating_el.get_text(" ", strip=True) or None

    # Genres: look for links containing /browse/genre/
    for a in soup.select('a[href*="/browse/genre/"]'):
        g = a.get_text(" ", strip=True)
        if g and g not in genres:
            genres.append(g)

    row = dict(row)
    row.update({
        "maturity": maturity,
        "synopsis": synopsis,
        "genres": genres,
    })
    return row

Export: JSON Lines (stream-friendly)

import json

def write_jsonl(path: str, rows: list[dict]):
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")


if __name__ == "__main__":
    seeds = ["https://www.netflix.com/browse"]
    base_rows = crawl_seeds(seeds, country="US")

    # Optional: enrich first N titles
    enriched = []
    for row in base_rows[:50]:
        try:
            enriched.append(enrich_title(row, country="US"))
        except Exception:
            enriched.append(row)

    write_jsonl("netflix_catalogue_us.jsonl", enriched)
    print("wrote", len(enriched))

QA checklist

  • You’re consistently using one country for a single dataset run
  • You’re deduping by title_id
  • Your crawler logs how many titles each seed produces
  • You’re handling missing fields (synopsis/genres/maturity) without crashing
  • You can re-run daily and diff results (new titles, removals)

Where ProxiesAPI fits (honestly)

Catalogue scraping is less about fancy parsing and more about reliability:

  • redirects and geo variance
  • occasional throttling
  • inconsistent responses across runs

ProxiesAPI helps by letting you keep a consistent location and improving success rates with retries/rotation, so your snapshots don’t randomly fail halfway through.

Make catalogue crawls consistent with ProxiesAPI

Netflix pages are geo-sensitive and can throttle or vary by region/device. ProxiesAPI helps stabilize your fetches (location consistency, retries, rotation) so your catalogue extractor can run on a schedule.

Related guides

Scrape Pinterest Images and Pins (Search + Board URLs) with Python + ProxiesAPI
Extract pin titles, image URLs, outbound links, and board metadata from Pinterest search + board pages with pagination, retries, and defensive parsing. Includes a screenshot of the target UI.
tutorial#python#pinterest#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping
Scrape Patreon Creator Data with Python (Profiles, Tiers, Posts)
Extract Patreon creator metadata, membership tiers, and recent public posts with a screenshot-first workflow, robust retries, and ProxiesAPI-backed requests.
tutorial#python#patreon#web-scraping
Scrape NBA Scores and Standings from ESPN with Python (Box Scores + Schedule)
Build a clean dataset of today’s NBA games and standings from ESPN pages using robust selectors and proxy-safe requests.
tutorial#python#nba#espn