Scrape Podcast Data from Apple Podcasts (Charts + Show/Episode Metadata) with Python + ProxiesAPI

Apple Podcasts is an underrated data source.

If you’re doing market research, competitive analysis, or building a discovery product, you often want:

  • what’s trending today (charts)
  • show-level metadata (title, publisher, categories)
  • a list of episodes (titles + publish dates)
  • stable identifiers so you can update your dataset incrementally

In this guide we’ll build a charts → shows → episodes pipeline in Python.

We’ll do it in a way that’s practical:

  • don’t assume one page template forever
  • use stable IDs (Apple uses numeric IDs like id123456789)
  • store “last seen” timestamps so updates are fast

And we’ll keep the network layer clean so you can optionally route requests through ProxiesAPI when you scale.

Apple Podcasts charts (we’ll scrape chart entries and follow show links)

Crawl charts and show pages reliably with ProxiesAPI

Charts pages and show pages can throttle when you scale. ProxiesAPI helps keep your crawl runs stable while you focus on parsing and data quality.


What we’re scraping

There are (at least) two useful surfaces:

  1. Charts pages (top shows / top episodes)

These pages contain a ranked list where each entry links to a show or episode.

  1. Show pages

These contain:

  • show title
  • publisher
  • description
  • categories
  • episode list (often paginated / infinite scroll)

We’ll build a pipeline that starts at a charts URL, collects show URLs + show IDs, then visits each show URL to extract metadata + episodes.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML with timeouts + optional ProxiesAPI

Keep scraping code maintainable by having exactly one place where HTTP happens.

import os
import time
import requests

TIMEOUT = (10, 30)

session = requests.Session()

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

# Template: set this to your ProxiesAPI proxy URL if you want to route traffic.
# Example shape: http://USER:PASS@proxy.proxiesapi.com:PORT
PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")


def fetch_html(url: str, *, retries: int = 3, sleep_s: float = 1.0) -> str:
    proxies = None
    if PROXIESAPI_PROXY_URL:
        proxies = {"http": PROXIESAPI_PROXY_URL, "https": PROXIESAPI_PROXY_URL}

    last_err = None
    for attempt in range(1, retries + 1):
        try:
            r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT, proxies=proxies)
            r.raise_for_status()

            text = r.text
            lower = text.lower()
            if "access denied" in lower and attempt < retries:
                raise RuntimeError("soft-block html")

            return text
        except Exception as e:
            last_err = e
            if attempt < retries:
                time.sleep(sleep_s * attempt)
                continue
            raise

    raise last_err  # pragma: no cover

Step 2: Parse stable IDs from Apple Podcasts URLs

Apple show URLs often contain an id segment like:

  • .../id123456789

That numeric ID is gold because it’s stable for incremental updates.

import re


def extract_apple_id(url: str) -> str | None:
    m = re.search(r"\bid(\d+)\b", url or "")
    return m.group(1) if m else None

Charts HTML changes by region and by Apple’s experiments.

So the robust approach is:

  • find links that look like Apple Podcasts show links
  • keep the list order as rank
  • de-dupe by show id
from bs4 import BeautifulSoup
from urllib.parse import urljoin


def parse_chart_shows(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Many chart entries link to a show page; we look for anchors containing '/podcast/' or '/us/podcast/'
    candidates = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue

        href_abs = urljoin(base_url, href)

        if "/podcast/" not in href_abs:
            continue

        show_id = extract_apple_id(href_abs)
        text = a.get_text(" ", strip=True)
        if not show_id or not text:
            continue

        candidates.append({
            "show_id": show_id,
            "show_url": href_abs,
            "anchor_text": text,
        })

    # Preserve order (rank) and de-dupe by show_id
    seen = set()
    ranked = []
    for c in candidates:
        if c["show_id"] in seen:
            continue
        seen.add(c["show_id"])
        c["rank"] = len(ranked) + 1
        ranked.append(c)

    return ranked

Example charts URL

Apple has multiple chart surfaces. One place to start is Apple’s chart support page and follow through, but for scraping you’ll use an actual chart URL you care about (country/genre).

Once you have it, run:

chart_url = "PASTE_YOUR_CHART_URL_HERE"
html = fetch_html(chart_url)
shows = parse_chart_shows(html, chart_url)
print("chart shows:", len(shows))
print(shows[:5])

Step 4: Scrape show metadata + episodes

Show pages can be heavy. We’ll extract:

  • title
  • publisher (when available)
  • description
  • episode list entries (title + date when present)

Show parser (best-effort)

from bs4 import BeautifulSoup


def parse_show_page(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    show_id = extract_apple_id(url)

    # Title often in h1
    h1 = soup.select_one("h1")
    title = h1.get_text(" ", strip=True) if h1 else None

    # Description often in meta tags or visible blocks
    meta_desc = soup.select_one('meta[name="description"]')
    description = meta_desc.get("content") if meta_desc else None

    # Publisher/author varies; try common patterns
    publisher = None
    for sel in ["[data-testid='publisher']", "[data-qa='publisher']", ".product-header__subtitle"]:
        el = soup.select_one(sel)
        if el:
            publisher = el.get_text(" ", strip=True)
            break

    # Episodes: look for list items with links and dates.
    episodes = []
    for item in soup.select("a[href]"):
        href = item.get("href")
        if not href:
            continue

        # Many episode links include '/podcast/' as well, but we only need text-rich entries.
        t = item.get_text(" ", strip=True)
        if not t or len(t) < 8:
            continue

        episodes.append({"title": t, "url": href})

    # De-dupe episode titles (page contains many nav links)
    seen_titles = set()
    clean_eps = []
    for ep in episodes:
        if ep["title"] in seen_titles:
            continue
        seen_titles.add(ep["title"])
        clean_eps.append(ep)

    return {
        "show_id": show_id,
        "show_url": url,
        "title": title,
        "publisher": publisher,
        "description": description,
        "episodes": clean_eps[:200],  # cap for safety; paginate if you need full history
    }

Crawl shows from charts

import json


def crawl_chart(chart_url: str, max_shows: int = 50) -> list[dict]:
    chart_html = fetch_html(chart_url)
    shows = parse_chart_shows(chart_html, chart_url)

    out = []
    for s in shows[:max_shows]:
        show_html = fetch_html(s["show_url"])
        data = parse_show_page(show_html, s["show_url"])
        data["rank"] = s["rank"]
        out.append(data)
        print("fetched", data.get("show_id"), data.get("title"))

    return out


# Example:
# dataset = crawl_chart(chart_url, max_shows=25)
# with open('apple_podcasts_chart.json', 'w', encoding='utf-8') as f:
#     json.dump(dataset, f, ensure_ascii=False, indent=2)

Incremental updates (the part that saves you time)

Instead of re-scraping everything daily:

  • scrape charts daily (fast)
  • keep a store of show_id → last_checked
  • only refresh a show if:
    • it’s in the current chart, or
    • it hasn’t been checked in N days

A simple approach with a JSON “state” file:

import json
import time


def load_state(path="state.json"):
    try:
        return json.load(open(path, "r", encoding="utf-8"))
    except FileNotFoundError:
        return {"shows": {}}


def save_state(state, path="state.json"):
    json.dump(state, open(path, "w", encoding="utf-8"), ensure_ascii=False, indent=2)


def should_refresh(state, show_id: str, ttl_days: int = 7) -> bool:
    rec = state["shows"].get(show_id)
    if not rec:
        return True
    last = rec.get("last_checked", 0)
    return (time.time() - last) > (ttl_days * 86400)

Where ProxiesAPI fits (honestly)

Apple Podcasts pages can throttle when you scale across countries/genres or when you run daily crawls.

ProxiesAPI helps by giving you a proxy layer so:

  • repeated crawls are less likely to hit IP-based limits
  • retries are more effective
  • you can keep the rest of your code unchanged

It’s not magic: you still need good caching, conservative crawl rate, and robust parsing.


QA checklist

  • You can extract a list of show URLs from your chart URL
  • extract_apple_id() returns a numeric ID for shows
  • You can fetch show pages and get a non-empty title
  • Episode extraction returns at least some titles (then refine selectors for your target pages)
  • Your crawler logs progress and can resume
Crawl charts and show pages reliably with ProxiesAPI

Charts pages and show pages can throttle when you scale. ProxiesAPI helps keep your crawl runs stable while you focus on parsing and data quality.

Related guides

Scrape Netflix Catalogue Data with Python + ProxiesAPI (Titles, Genres, Availability)
Build a repeatable Netflix title dataset from listing pages: extract title rows, handle pagination defensively, dedupe, and export clean JSONL. Includes a screenshot of the target UI.
tutorial#python#netflix#web-scraping
Scrape Pinterest Images and Pins (Search + Board URLs) with Python + ProxiesAPI
Extract pin titles, image URLs, outbound links, and board metadata from Pinterest search + board pages with pagination, retries, and defensive parsing. Includes a screenshot of the target UI.
tutorial#python#pinterest#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping
Scrape Patreon Creator Data with Python (Profiles, Tiers, Posts)
Extract Patreon creator metadata, membership tiers, and recent public posts with a screenshot-first workflow, robust retries, and ProxiesAPI-backed requests.
tutorial#python#patreon#web-scraping