Scrape Patreon Creator Data with Python (Profiles, Tiers, Posts)
Patreon creator pages look simple when you open them in a browser.
But once you try to collect data at scale (hundreds/thousands of creators), you run into the usual scraping realities:
- inconsistent HTML across regions/experiments
- occasional bot checks / transient 403s
- slow responses and timeouts
- pages that load extra content via embedded JSON
In this guide, we’ll build a practical Python scraper that:
- captures a screenshot-first “what are we scraping?” artifact
- fetches a creator page via ProxiesAPI (with retries + timeouts)
- extracts creator profile fields you can usually rely on
- discovers tiers (when present)
- pulls a small sample of recent public posts (best-effort)

Creator pages are a classic target for rate limits and geo-based variations. ProxiesAPI helps keep your fetch layer stable when you scale from 1 creator to 10,000.
A quick note on ethics + stability
Patreon content can be paid-gated and personal. Only scrape what you’re allowed to access, respect robots/ToS, and avoid collecting sensitive data.
Also: Patreon is a modern web app. Some data is server-rendered, some is hydrated via JSON. We’ll focus on a best-effort HTML + embedded JSON approach that works surprisingly often.
What we’re scraping
Given a creator URL like:
https://www.patreon.com/<creator>
We’ll try to extract:
- creator display name
- short description / tagline
- category tags (if visible)
- “about” text snippet
- tier list (name + price + description)
- recent public posts (title + url + published date if visible)
Because Patreon’s DOM can change, we’ll implement:
- explicit timeouts
- exponential backoff retries
- selector fallbacks
- a “save raw HTML” debug hook
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv
Create a .env file:
PROXIESAPI_KEY=your_api_key_here
Step 1: Screenshot-first workflow (manual but mandatory)
Before writing selectors, open a creator page in your browser and take a screenshot. This becomes your stable reference when the site inevitably changes.
We’ll store screenshots at:
public/images/posts/<slug>/patreon-creator-page.jpg
(We’ll automate this screenshot in the publishing workflow using the browser tool.)
Step 2: ProxiesAPI-backed fetch with retries
A production scraper lives or dies on the network layer.
Below is a minimal fetch helper that:
- uses ProxiesAPI as the proxy gateway
- sets realistic connect/read timeouts
- retries transient failures (timeouts, 429/403/5xx)
import os
import time
import random
from dataclasses import dataclass
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@dataclass
class FetchConfig:
proxiesapi_key: str
timeout: tuple[int, int] = (10, 40) # connect, read
def proxiesapi_url(target_url: str, api_key: str) -> str:
# ProxiesAPI simple gateway pattern
# If your ProxiesAPI plan uses a different endpoint style, adjust here.
from urllib.parse import quote
return f"https://api.proxiesapi.com/?auth_key={api_key}&url={quote(target_url, safe='')}"
class TransientHTTPError(RuntimeError):
pass
@retry(
reraise=True,
stop=stop_after_attempt(6),
wait=wait_exponential(multiplier=1, min=1, max=20),
retry=retry_if_exception_type((requests.RequestException, TransientHTTPError)),
)
def fetch_html(url: str, cfg: FetchConfig, session: requests.Session | None = None) -> str:
s = session or requests.Session()
# small jitter helps when you’re crawling lists
time.sleep(random.uniform(0.3, 1.0))
gateway = proxiesapi_url(url, cfg.proxiesapi_key)
headers = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
r = s.get(gateway, headers=headers, timeout=cfg.timeout)
# Treat common transient statuses as retryable
if r.status_code in (403, 408, 429, 500, 502, 503, 504):
raise TransientHTTPError(f"Transient status {r.status_code}")
r.raise_for_status()
return r.text
if __name__ == "__main__":
key = os.environ.get("PROXIESAPI_KEY")
assert key, "Missing PROXIESAPI_KEY"
cfg = FetchConfig(proxiesapi_key=key)
html = fetch_html("https://www.patreon.com/patreon", cfg)
print("bytes:", len(html))
print(html[:200])
Step 3: Parse creator profile fields
Patreon pages often include embedded JSON hydration payloads.
We’ll attempt two strategies:
- HTML selectors (fast, simple)
- Embedded JSON scan (more resilient when classnames change)
import json
import re
from bs4 import BeautifulSoup
def clean_text(x: str | None) -> str | None:
if not x:
return None
x = re.sub(r"\s+", " ", x).strip()
return x or None
def parse_profile_from_html(html: str) -> dict:
soup = BeautifulSoup(html, "lxml")
# Heuristic selectors: keep them conservative.
# Patreon changes frequently; prefer semantic locations when possible.
title = None
og_title = soup.select_one('meta[property="og:title"]')
if og_title:
title = clean_text(og_title.get("content"))
og_desc = soup.select_one('meta[property="og:description"]')
description = clean_text(og_desc.get("content")) if og_desc else None
og_url = soup.select_one('meta[property="og:url"]')
canonical_url = clean_text(og_url.get("content")) if og_url else None
return {
"title": title,
"description": description,
"canonical_url": canonical_url,
}
def extract_embedded_json_candidates(html: str) -> list[dict]:
# Patreon may embed JSON in script tags.
# We’ll pull large JSON-looking blobs and try to decode.
out = []
# a broad heuristic: look for "{" ... "}" blocks bigger than N chars inside <script>
for m in re.finditer(r"<script[^>]*>(.*?)</script>", html, flags=re.S | re.I):
body = m.group(1).strip()
if len(body) < 2000:
continue
# try to find JSON object literals
# (not perfect, but useful for debugging)
if "{\"" in body or "\":\"" in body:
# sometimes it’s already JSON
try:
j = json.loads(body)
if isinstance(j, dict):
out.append(j)
except Exception:
pass
return out
def parse_creator(html: str) -> dict:
data = {
"profile": parse_profile_from_html(html),
"tiers": [],
"recent_posts": [],
"debug": {},
}
# Store a tiny debug hint so you can inspect later
data["debug"]["html_bytes"] = len(html)
data["debug"]["json_candidates"] = 0
candidates = extract_embedded_json_candidates(html)
data["debug"]["json_candidates"] = len(candidates)
return data
At this point, you already have stable metadata (via OpenGraph) that is relatively consistent across modern sites.
Step 4: Extract tiers (best-effort)
Tier extraction is the brittle part.
In practice, I recommend:
- first capture tiers from the page (if visible)
- if tiers are not present or the page is heavily dynamic, switch to a browser automation approach
Here’s a conservative HTML-based tier parser that looks for common tier price patterns.
from bs4 import BeautifulSoup
import re
PRICE_RE = re.compile(r"(\$|₹|£|€)\s*([0-9][0-9,]*(?:\.[0-9]{1,2})?)")
def parse_tiers_best_effort(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
tiers = []
# broad heuristic: find repeated blocks that contain a price + a short heading
# This will not be perfect, but it will work often enough to bootstrap.
text_nodes = soup.get_text("\n", strip=True)
# If the HTML is too JS-heavy, this will be mostly boilerplate.
# Fallback: scan for prices and capture nearby headings using DOM proximity.
for el in soup.find_all(string=PRICE_RE):
price_text = el.strip()
m = PRICE_RE.search(price_text)
if not m:
continue
# climb to a parent container and search for the nearest heading
container = el.parent
for _ in range(4):
if not container:
break
container = container.parent
if not container:
continue
heading = None
for h in container.select("h1,h2,h3")[0:1]:
heading = h.get_text(" ", strip=True)
desc = None
p = container.find("p")
if p:
desc = p.get_text(" ", strip=True)
tiers.append({
"name": heading,
"price_text": price_text,
"description": desc,
})
if len(tiers) >= 12:
break
# de-dupe by (name, price_text)
seen = set()
uniq = []
for t in tiers:
key = (t.get("name"), t.get("price_text"))
if key in seen:
continue
seen.add(key)
uniq.append(t)
return uniq
If this returns zero tiers for your target creator, don’t panic. That’s a signal the page is dynamic or the creator hides tiers.
Step 5: Extract recent public posts (best-effort)
Patreon public posts are also heavily dynamic, but you can often discover a few via:
og:metadata on post pages- links on the creator page that match a
/posts/pattern
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re
def extract_recent_post_links(html: str, base_url: str) -> list[str]:
soup = BeautifulSoup(html, "lxml")
links = []
for a in soup.select("a[href]"):
href = a.get("href")
if not href:
continue
if "/posts/" not in href:
continue
url = urljoin(base_url, href)
links.append(url)
# de-dupe while preserving order
seen = set()
out = []
for u in links:
if u in seen:
continue
seen.add(u)
out.append(u)
return out[:10]
You can then fetch each post URL and pull OpenGraph metadata:
from bs4 import BeautifulSoup
def parse_og(url: str, html: str) -> dict:
soup = BeautifulSoup(html, "lxml")
title = soup.select_one('meta[property="og:title"]')
desc = soup.select_one('meta[property="og:description"]')
return {
"url": url,
"title": title.get("content") if title else None,
"description": desc.get("content") if desc else None,
}
Full runnable example: scrape one creator
import os
import json
import requests
from bs4 import BeautifulSoup
# reuse fetch_html + helpers from above
def scrape_creator(creator_url: str) -> dict:
key = os.environ.get("PROXIESAPI_KEY")
assert key, "Missing PROXIESAPI_KEY"
cfg = FetchConfig(proxiesapi_key=key)
session = requests.Session()
html = fetch_html(creator_url, cfg, session=session)
profile = parse_profile_from_html(html)
tiers = parse_tiers_best_effort(html)
posts = []
for post_url in extract_recent_post_links(html, creator_url):
try:
post_html = fetch_html(post_url, cfg, session=session)
posts.append(parse_og(post_url, post_html))
except Exception:
continue
return {
"creator_url": creator_url,
"profile": profile,
"tiers": tiers,
"recent_posts": posts,
}
if __name__ == "__main__":
data = scrape_creator("https://www.patreon.com/patreon")
print(json.dumps(data, indent=2, ensure_ascii=False))
Pagination + scaling to many creators
In real usage you’ll have a list of creators (from a directory, search results, or your own input).
A reliable pattern is:
- store the list in SQLite
- crawl in batches
- record last-success timestamp + HTTP status
- re-try failures with backoff
If you want a lightweight starting point, a newline-delimited file works too:
https://www.patreon.com/creator1
https://www.patreon.com/creator2
Then:
with open("creators.txt", "r", encoding="utf-8") as f:
creators = [line.strip() for line in f if line.strip()]
for url in creators:
try:
data = scrape_creator(url)
# write one JSON per creator for incremental runs
slug = url.rstrip("/").split("/")[-1]
with open(f"out/{slug}.json", "w", encoding="utf-8") as out:
json.dump(data, out, ensure_ascii=False, indent=2)
print("ok", url)
except Exception as e:
print("fail", url, e)
QA checklist
- Screenshot saved for the creator page you tested
-
fetch_html()uses timeouts + retries - Profile fields populate (at least
og:title,og:description) - Tier extractor returns sensible values (or you decide to use browser automation)
- You never hammer Patreon (jitter + batching)
Where ProxiesAPI fits (honestly)
ProxiesAPI doesn’t magically “solve” dynamic pages.
What it does help with is the boring-but-critical part of scraping:
- more consistent request success rates
- fewer random 403/429 spikes during long crawls
- the ability to distribute load across IPs/regions if your project needs it
Combine that with screenshot-first debugging and conservative parsers, and you’ll ship scrapers that stay alive longer than a weekend.
Creator pages are a classic target for rate limits and geo-based variations. ProxiesAPI helps keep your fetch layer stable when you scale from 1 creator to 10,000.