Scrape Podcast Charts & Episode Metadata from Apple Podcasts with Python (via ProxiesAPI)
Apple Podcasts is a great “real world” scraping target because it has:
- public, crawlable chart pages
- show pages with structured content
- episode lists that can be paginated or truncated in the UI
In this guide we’ll build a Python scraper that:
- pulls a chart page (top podcasts)
- extracts show URLs + ids
- crawls each show page
- extracts episode metadata (title, publish date, duration, episode url)
- exports to JSON and CSV
- uses ProxiesAPI as the network layer when you scale the crawl

Charts → shows → episodes is a classic crawl graph. ProxiesAPI helps keep those many small requests stable with a proxy-backed fetch URL and optional rendering when targets get picky.
What we’re scraping (URLs)
Apple Podcasts has multiple public surfaces. Two common ones:
- Charts pages (by country / category)
- Show pages (podcast details + episodes)
The exact chart URL format can evolve, so don’t hardcode country/category without checking in the browser first.
For scraping, the important pattern is:
- chart page contains many show links
- each show link leads to a show page that contains an episode list
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Step 1: A fetch layer with retries + ProxiesAPI
Same as any crawl: timeouts, retries, and a single switch to route requests through ProxiesAPI.
import random
import time
from urllib.parse import quote
import requests
TIMEOUT = (10, 30)
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
session = requests.Session()
def fetch(url: str, *, proxiesapi_key: str | None = None, retries: int = 4) -> str:
last = None
for attempt in range(1, retries + 1):
try:
if proxiesapi_key:
proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
else:
r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
r.raise_for_status()
return r.text
except Exception as e:
last = e
time.sleep((2 ** attempt) + random.random())
raise RuntimeError(f"failed: {last}")
Step 2: Scrape a chart page (extract show URLs)
Start by loading the chart page you care about in a browser, then “View Source” and look for repeated show links.
We’ll implement extraction as:
- collect all
<a href>links - keep only those that look like podcast show URLs
- de-dupe
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
APPLE = "https://podcasts.apple.com"
def is_show_url(href: str) -> bool:
# Common pattern: /<country>/podcast/<slug>/id<digits>
return bool(re.search(r"/podcast/.*/id\d+", href))
def parse_chart(html: str, base_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
shows = []
for a in soup.select("a[href]"):
href = a.get("href")
if not href:
continue
if href.startswith("/"):
href = urljoin(APPLE, href)
if not href.startswith("http"):
continue
if "podcasts.apple.com" not in href:
continue
if not is_show_url(href):
continue
title = a.get_text(" ", strip=True) or None
shows.append({"show_url": href, "link_text": title, "chart_url": base_url})
# De-dupe by show_url
uniq = {}
for s in shows:
uniq[s["show_url"]] = s
return list(uniq.values())
Tip: save a local HTML snapshot
chart_url = "https://podcasts.apple.com/us/charts" # example; verify current structure
html = fetch(chart_url)
open("apple_charts.html", "w", encoding="utf-8").write(html)
print("saved apple_charts.html", len(html))
If /us/charts redirects somewhere else, keep the final URL you see in your browser and use that.
Step 3: Scrape a show page (podcast metadata)
On most Apple show pages you can extract:
- show title
- publisher
- description
- and then an episode list (often near the bottom)
def parse_show(html: str, show_url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
# Keep selectors simple and adaptable; Apple changes class names.
title = None
h1 = soup.select_one("h1")
if h1:
title = h1.get_text(" ", strip=True)
# Publisher is often in a span/label-like element near the header.
publisher = None
pub = soup.find(string=re.compile(r"^by ", re.I))
if pub:
publisher = str(pub).strip()
# Description: meta tag is a reliable fallback
desc = None
meta = soup.select_one('meta[name="description"]')
if meta:
desc = meta.get("content")
return {
"show_url": show_url,
"title": title,
"publisher_guess": publisher,
"description": desc,
}
Step 4: Extract episodes from a show page
Episode rows are often links that include something like /podcast/.../id... plus an episode slug, and they typically contain:
- episode title
- publish date
- duration
Because markup changes, we’ll:
- identify candidate links inside an episode list region
- for each candidate, extract nearby text
def parse_episodes(html: str, show_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
episodes = []
# Broad: find links that look like episode URLs.
# Many episode links contain "/podcast/" and have an "?i=" query or an episode id.
for a in soup.select('a[href]'):
href = a.get("href")
if not href:
continue
if href.startswith("/"):
href = urljoin(APPLE, href)
if "podcasts.apple.com" not in href:
continue
# Heuristic for episode links:
if ("?i=" not in href) and ("/id" not in href):
continue
title = a.get_text(" ", strip=True) or None
if not title or len(title) < 4:
continue
# Walk up to a container and grab nearby text (date/duration often live there)
container = a
for _ in range(5):
if not container:
break
txt = container.get_text(" ", strip=True)
if txt and len(txt) > 50:
break
container = container.parent
c = container if container else a
blob = c.get_text(" ", strip=True)
# Very rough extraction; tighten once you inspect your saved HTML.
date_guess = None
m = re.search(r"\b(\w{3,9}\s+\d{1,2},\s+\d{4})\b", blob)
if m:
date_guess = m.group(1)
duration_guess = None
d = re.search(r"\b(\d+\s*(?:min|mins|minutes|hr|hrs|hours))\b", blob, re.I)
if d:
duration_guess = d.group(1)
episodes.append({
"show_url": show_url,
"episode_url": href,
"title": title,
"publish_date_guess": date_guess,
"duration_guess": duration_guess,
})
# De-dupe episode URLs
uniq = {}
for ep in episodes:
uniq[ep["episode_url"]] = ep
return list(uniq.values())
Important: pagination / “See All”
Some show pages display only a subset of episodes and require a “See All” flow.
Two practical options:
- find the “See All” URL and crawl it
- use a browser automation runner for this target
If you inspect the page source, you’ll often find a link that contains something like ?see-all= or a path to an episode list page.
When you find it, treat it like a second crawl step.
Step 5: Orchestrate the crawl
This crawl graph is:
- chart → 2) shows → 3) episodes
import json
import csv
def write_json(path: str, obj) -> None:
with open(path, "w", encoding="utf-8") as f:
json.dump(obj, f, ensure_ascii=False, indent=2)
def write_csv(path: str, rows: list[dict], fields: list[str]) -> None:
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fields)
w.writeheader()
for r in rows:
w.writerow({k: r.get(k) for k in fields})
def crawl(chart_url: str, *, proxiesapi_key: str | None = None, limit_shows: int = 20) -> dict:
chart_html = fetch(chart_url, proxiesapi_key=proxiesapi_key)
shows = parse_chart(chart_html, base_url=chart_url)
shows = shows[:limit_shows]
show_records = []
episode_records = []
for i, s in enumerate(shows, start=1):
show_url = s["show_url"]
html = fetch(show_url, proxiesapi_key=proxiesapi_key)
show_meta = parse_show(html, show_url)
episodes = parse_episodes(html, show_url)
show_records.append({**s, **show_meta, "episode_count_guess": len(episodes)})
episode_records.extend(episodes)
print(f"{i}/{len(shows)} shows: {show_meta.get('title')} episodes={len(episodes)}")
time.sleep(0.8)
return {"chart_url": chart_url, "shows": show_records, "episodes": episode_records}
if __name__ == "__main__":
chart_url = "https://podcasts.apple.com/us/charts" # verify in browser
data = crawl(chart_url, proxiesapi_key=None, limit_shows=10)
write_json("apple_podcasts.json", data)
write_csv(
"apple_podcasts_shows.csv",
data["shows"],
fields=["show_url", "title", "publisher_guess", "episode_count_guess"],
)
write_csv(
"apple_podcasts_episodes.csv",
data["episodes"],
fields=["show_url", "episode_url", "title", "publish_date_guess", "duration_guess"],
)
print("wrote apple_podcasts.json + CSVs")
Where ProxiesAPI fits (honestly)
Apple Podcasts pages are usually accessible, but crawling can get flaky when you:
- hit charts + dozens of shows + hundreds of episodes
- run frequently
- scrape multiple countries/categories
ProxiesAPI helps by giving you a consistent, proxy-backed fetch URL (same code, fewer networking surprises).
QA checklist
- chart parser returns a non-empty list of shows
- show parser extracts a sane title for 3–5 shows
- episodes parser returns some episodes per show
- exports load in pandas without errors
- failures retry and don’t crash the whole run
Next upgrades
- discover and crawl the “See All Episodes” page for full history
- parse durations into seconds and publish dates into ISO
- store results in SQLite and do incremental updates
- enrich with transcript availability / explicit flag (if present)
Charts → shows → episodes is a classic crawl graph. ProxiesAPI helps keep those many small requests stable with a proxy-backed fetch URL and optional rendering when targets get picky.