Scrape Podcast Charts & Episode Metadata from Apple Podcasts with Python (via ProxiesAPI)

Apple Podcasts is a great “real world” scraping target because it has:

  • public, crawlable chart pages
  • show pages with structured content
  • episode lists that can be paginated or truncated in the UI

In this guide we’ll build a Python scraper that:

  1. pulls a chart page (top podcasts)
  2. extracts show URLs + ids
  3. crawls each show page
  4. extracts episode metadata (title, publish date, duration, episode url)
  5. exports to JSON and CSV
  6. uses ProxiesAPI as the network layer when you scale the crawl

Apple Podcasts charts (we’ll scrape show rows and then episodes)

Scale podcast crawling with ProxiesAPI

Charts → shows → episodes is a classic crawl graph. ProxiesAPI helps keep those many small requests stable with a proxy-backed fetch URL and optional rendering when targets get picky.


What we’re scraping (URLs)

Apple Podcasts has multiple public surfaces. Two common ones:

  • Charts pages (by country / category)
  • Show pages (podcast details + episodes)

The exact chart URL format can evolve, so don’t hardcode country/category without checking in the browser first.

For scraping, the important pattern is:

  • chart page contains many show links
  • each show link leads to a show page that contains an episode list

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: A fetch layer with retries + ProxiesAPI

Same as any crawl: timeouts, retries, and a single switch to route requests through ProxiesAPI.

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

session = requests.Session()


def fetch(url: str, *, proxiesapi_key: str | None = None, retries: int = 4) -> str:
    last = None
    for attempt in range(1, retries + 1):
        try:
            if proxiesapi_key:
                proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
                r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
            else:
                r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
            r.raise_for_status()
            return r.text
        except Exception as e:
            last = e
            time.sleep((2 ** attempt) + random.random())
    raise RuntimeError(f"failed: {last}")

Step 2: Scrape a chart page (extract show URLs)

Start by loading the chart page you care about in a browser, then “View Source” and look for repeated show links.

We’ll implement extraction as:

  • collect all <a href> links
  • keep only those that look like podcast show URLs
  • de-dupe
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup

APPLE = "https://podcasts.apple.com"


def is_show_url(href: str) -> bool:
    # Common pattern: /<country>/podcast/<slug>/id<digits>
    return bool(re.search(r"/podcast/.*/id\d+", href))


def parse_chart(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    shows = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue
        if href.startswith("/"):
            href = urljoin(APPLE, href)
        if not href.startswith("http"):
            continue
        if "podcasts.apple.com" not in href:
            continue
        if not is_show_url(href):
            continue

        title = a.get_text(" ", strip=True) or None
        shows.append({"show_url": href, "link_text": title, "chart_url": base_url})

    # De-dupe by show_url
    uniq = {}
    for s in shows:
        uniq[s["show_url"]] = s

    return list(uniq.values())

Tip: save a local HTML snapshot

chart_url = "https://podcasts.apple.com/us/charts"  # example; verify current structure
html = fetch(chart_url)
open("apple_charts.html", "w", encoding="utf-8").write(html)
print("saved apple_charts.html", len(html))

If /us/charts redirects somewhere else, keep the final URL you see in your browser and use that.


Step 3: Scrape a show page (podcast metadata)

On most Apple show pages you can extract:

  • show title
  • publisher
  • description
  • and then an episode list (often near the bottom)

def parse_show(html: str, show_url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Keep selectors simple and adaptable; Apple changes class names.
    title = None
    h1 = soup.select_one("h1")
    if h1:
        title = h1.get_text(" ", strip=True)

    # Publisher is often in a span/label-like element near the header.
    publisher = None
    pub = soup.find(string=re.compile(r"^by ", re.I))
    if pub:
        publisher = str(pub).strip()

    # Description: meta tag is a reliable fallback
    desc = None
    meta = soup.select_one('meta[name="description"]')
    if meta:
        desc = meta.get("content")

    return {
        "show_url": show_url,
        "title": title,
        "publisher_guess": publisher,
        "description": desc,
    }

Step 4: Extract episodes from a show page

Episode rows are often links that include something like /podcast/.../id... plus an episode slug, and they typically contain:

  • episode title
  • publish date
  • duration

Because markup changes, we’ll:

  • identify candidate links inside an episode list region
  • for each candidate, extract nearby text

def parse_episodes(html: str, show_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    episodes = []

    # Broad: find links that look like episode URLs.
    # Many episode links contain "/podcast/" and have an "?i=" query or an episode id.
    for a in soup.select('a[href]'):
        href = a.get("href")
        if not href:
            continue
        if href.startswith("/"):
            href = urljoin(APPLE, href)
        if "podcasts.apple.com" not in href:
            continue

        # Heuristic for episode links:
        if ("?i=" not in href) and ("/id" not in href):
            continue

        title = a.get_text(" ", strip=True) or None
        if not title or len(title) < 4:
            continue

        # Walk up to a container and grab nearby text (date/duration often live there)
        container = a
        for _ in range(5):
            if not container:
                break
            txt = container.get_text(" ", strip=True)
            if txt and len(txt) > 50:
                break
            container = container.parent
        c = container if container else a
        blob = c.get_text(" ", strip=True)

        # Very rough extraction; tighten once you inspect your saved HTML.
        date_guess = None
        m = re.search(r"\b(\w{3,9}\s+\d{1,2},\s+\d{4})\b", blob)
        if m:
            date_guess = m.group(1)

        duration_guess = None
        d = re.search(r"\b(\d+\s*(?:min|mins|minutes|hr|hrs|hours))\b", blob, re.I)
        if d:
            duration_guess = d.group(1)

        episodes.append({
            "show_url": show_url,
            "episode_url": href,
            "title": title,
            "publish_date_guess": date_guess,
            "duration_guess": duration_guess,
        })

    # De-dupe episode URLs
    uniq = {}
    for ep in episodes:
        uniq[ep["episode_url"]] = ep

    return list(uniq.values())

Important: pagination / “See All”

Some show pages display only a subset of episodes and require a “See All” flow.

Two practical options:

  1. find the “See All” URL and crawl it
  2. use a browser automation runner for this target

If you inspect the page source, you’ll often find a link that contains something like ?see-all= or a path to an episode list page.

When you find it, treat it like a second crawl step.


Step 5: Orchestrate the crawl

This crawl graph is:

  1. chart → 2) shows → 3) episodes
import json
import csv


def write_json(path: str, obj) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)


def write_csv(path: str, rows: list[dict], fields: list[str]) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fields})


def crawl(chart_url: str, *, proxiesapi_key: str | None = None, limit_shows: int = 20) -> dict:
    chart_html = fetch(chart_url, proxiesapi_key=proxiesapi_key)
    shows = parse_chart(chart_html, base_url=chart_url)
    shows = shows[:limit_shows]

    show_records = []
    episode_records = []

    for i, s in enumerate(shows, start=1):
        show_url = s["show_url"]
        html = fetch(show_url, proxiesapi_key=proxiesapi_key)
        show_meta = parse_show(html, show_url)
        episodes = parse_episodes(html, show_url)

        show_records.append({**s, **show_meta, "episode_count_guess": len(episodes)})
        episode_records.extend(episodes)

        print(f"{i}/{len(shows)} shows: {show_meta.get('title')} episodes={len(episodes)}")
        time.sleep(0.8)

    return {"chart_url": chart_url, "shows": show_records, "episodes": episode_records}


if __name__ == "__main__":
    chart_url = "https://podcasts.apple.com/us/charts"  # verify in browser
    data = crawl(chart_url, proxiesapi_key=None, limit_shows=10)

    write_json("apple_podcasts.json", data)

    write_csv(
        "apple_podcasts_shows.csv",
        data["shows"],
        fields=["show_url", "title", "publisher_guess", "episode_count_guess"],
    )

    write_csv(
        "apple_podcasts_episodes.csv",
        data["episodes"],
        fields=["show_url", "episode_url", "title", "publish_date_guess", "duration_guess"],
    )

    print("wrote apple_podcasts.json + CSVs")

Where ProxiesAPI fits (honestly)

Apple Podcasts pages are usually accessible, but crawling can get flaky when you:

  • hit charts + dozens of shows + hundreds of episodes
  • run frequently
  • scrape multiple countries/categories

ProxiesAPI helps by giving you a consistent, proxy-backed fetch URL (same code, fewer networking surprises).


QA checklist

  • chart parser returns a non-empty list of shows
  • show parser extracts a sane title for 3–5 shows
  • episodes parser returns some episodes per show
  • exports load in pandas without errors
  • failures retry and don’t crash the whole run

Next upgrades

  • discover and crawl the “See All Episodes” page for full history
  • parse durations into seconds and publish dates into ISO
  • store results in SQLite and do incremental updates
  • enrich with transcript availability / explicit flag (if present)
Scale podcast crawling with ProxiesAPI

Charts → shows → episodes is a classic crawl graph. ProxiesAPI helps keep those many small requests stable with a proxy-backed fetch URL and optional rendering when targets get picky.

Related guides

Scrape Sports Scores from ESPN with Python (via ProxiesAPI)
Extract scoreboard game rows (teams, status, score) from ESPN reliably, handle retries, and export to CSV.
tutorial#python#espn#sports
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape GitHub Repository Data (Stars, Releases, Issues) with Python + ProxiesAPI
Scrape GitHub repo metadata from HTML (not just the API): stars, forks, latest release, open issues, and pull requests. Includes a ProxiesAPI fetch layer, safe parsing, and CSV export + screenshot.
tutorial#python#github#web-scraping
Scrape Wikipedia Article Data at Scale (Tables + Infobox + Links)
Extract structured fields from many Wikipedia pages (infobox + tables + links) with ProxiesAPI + Python, then save to CSV/JSON.
tutorial#python#wikipedia#web-scraping