Scrape Patreon Creator Data with Python (Profiles, Tiers, Posts)

Patreon creator pages look simple when you open them in a browser.

But once you try to collect data at scale (hundreds/thousands of creators), you run into the usual scraping realities:

  • inconsistent HTML across regions/experiments
  • occasional bot checks / transient 403s
  • slow responses and timeouts
  • pages that load extra content via embedded JSON

In this guide, we’ll build a practical Python scraper that:

  1. captures a screenshot-first “what are we scraping?” artifact
  2. fetches a creator page via ProxiesAPI (with retries + timeouts)
  3. extracts creator profile fields you can usually rely on
  4. discovers tiers (when present)
  5. pulls a small sample of recent public posts (best-effort)

Patreon creator page (example target for profile + tiers scraping)

Make Patreon scraping more reliable with ProxiesAPI

Creator pages are a classic target for rate limits and geo-based variations. ProxiesAPI helps keep your fetch layer stable when you scale from 1 creator to 10,000.


A quick note on ethics + stability

Patreon content can be paid-gated and personal. Only scrape what you’re allowed to access, respect robots/ToS, and avoid collecting sensitive data.

Also: Patreon is a modern web app. Some data is server-rendered, some is hydrated via JSON. We’ll focus on a best-effort HTML + embedded JSON approach that works surprisingly often.


What we’re scraping

Given a creator URL like:

  • https://www.patreon.com/<creator>

We’ll try to extract:

  • creator display name
  • short description / tagline
  • category tags (if visible)
  • “about” text snippet
  • tier list (name + price + description)
  • recent public posts (title + url + published date if visible)

Because Patreon’s DOM can change, we’ll implement:

  • explicit timeouts
  • exponential backoff retries
  • selector fallbacks
  • a “save raw HTML” debug hook

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

Create a .env file:

PROXIESAPI_KEY=your_api_key_here

Step 1: Screenshot-first workflow (manual but mandatory)

Before writing selectors, open a creator page in your browser and take a screenshot. This becomes your stable reference when the site inevitably changes.

We’ll store screenshots at:

  • public/images/posts/<slug>/patreon-creator-page.jpg

(We’ll automate this screenshot in the publishing workflow using the browser tool.)


Step 2: ProxiesAPI-backed fetch with retries

A production scraper lives or dies on the network layer.

Below is a minimal fetch helper that:

  • uses ProxiesAPI as the proxy gateway
  • sets realistic connect/read timeouts
  • retries transient failures (timeouts, 429/403/5xx)
import os
import time
import random
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


@dataclass
class FetchConfig:
    proxiesapi_key: str
    timeout: tuple[int, int] = (10, 40)  # connect, read


def proxiesapi_url(target_url: str, api_key: str) -> str:
    # ProxiesAPI simple gateway pattern
    # If your ProxiesAPI plan uses a different endpoint style, adjust here.
    from urllib.parse import quote
    return f"https://api.proxiesapi.com/?auth_key={api_key}&url={quote(target_url, safe='')}"


class TransientHTTPError(RuntimeError):
    pass


@retry(
    reraise=True,
    stop=stop_after_attempt(6),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException, TransientHTTPError)),
)
def fetch_html(url: str, cfg: FetchConfig, session: requests.Session | None = None) -> str:
    s = session or requests.Session()

    # small jitter helps when you’re crawling lists
    time.sleep(random.uniform(0.3, 1.0))

    gateway = proxiesapi_url(url, cfg.proxiesapi_key)
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    }

    r = s.get(gateway, headers=headers, timeout=cfg.timeout)

    # Treat common transient statuses as retryable
    if r.status_code in (403, 408, 429, 500, 502, 503, 504):
        raise TransientHTTPError(f"Transient status {r.status_code}")

    r.raise_for_status()
    return r.text


if __name__ == "__main__":
    key = os.environ.get("PROXIESAPI_KEY")
    assert key, "Missing PROXIESAPI_KEY"

    cfg = FetchConfig(proxiesapi_key=key)
    html = fetch_html("https://www.patreon.com/patreon", cfg)
    print("bytes:", len(html))
    print(html[:200])

Step 3: Parse creator profile fields

Patreon pages often include embedded JSON hydration payloads.

We’ll attempt two strategies:

  1. HTML selectors (fast, simple)
  2. Embedded JSON scan (more resilient when classnames change)
import json
import re
from bs4 import BeautifulSoup


def clean_text(x: str | None) -> str | None:
    if not x:
        return None
    x = re.sub(r"\s+", " ", x).strip()
    return x or None


def parse_profile_from_html(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Heuristic selectors: keep them conservative.
    # Patreon changes frequently; prefer semantic locations when possible.
    title = None
    og_title = soup.select_one('meta[property="og:title"]')
    if og_title:
        title = clean_text(og_title.get("content"))

    og_desc = soup.select_one('meta[property="og:description"]')
    description = clean_text(og_desc.get("content")) if og_desc else None

    og_url = soup.select_one('meta[property="og:url"]')
    canonical_url = clean_text(og_url.get("content")) if og_url else None

    return {
        "title": title,
        "description": description,
        "canonical_url": canonical_url,
    }


def extract_embedded_json_candidates(html: str) -> list[dict]:
    # Patreon may embed JSON in script tags.
    # We’ll pull large JSON-looking blobs and try to decode.
    out = []

    # a broad heuristic: look for "{" ... "}" blocks bigger than N chars inside <script>
    for m in re.finditer(r"<script[^>]*>(.*?)</script>", html, flags=re.S | re.I):
        body = m.group(1).strip()
        if len(body) < 2000:
            continue

        # try to find JSON object literals
        # (not perfect, but useful for debugging)
        if "{\"" in body or "\":\"" in body:
            # sometimes it’s already JSON
            try:
                j = json.loads(body)
                if isinstance(j, dict):
                    out.append(j)
            except Exception:
                pass

    return out


def parse_creator(html: str) -> dict:
    data = {
        "profile": parse_profile_from_html(html),
        "tiers": [],
        "recent_posts": [],
        "debug": {},
    }

    # Store a tiny debug hint so you can inspect later
    data["debug"]["html_bytes"] = len(html)
    data["debug"]["json_candidates"] = 0

    candidates = extract_embedded_json_candidates(html)
    data["debug"]["json_candidates"] = len(candidates)

    return data

At this point, you already have stable metadata (via OpenGraph) that is relatively consistent across modern sites.


Step 4: Extract tiers (best-effort)

Tier extraction is the brittle part.

In practice, I recommend:

  • first capture tiers from the page (if visible)
  • if tiers are not present or the page is heavily dynamic, switch to a browser automation approach

Here’s a conservative HTML-based tier parser that looks for common tier price patterns.

from bs4 import BeautifulSoup
import re

PRICE_RE = re.compile(r"(\$|₹|£|€)\s*([0-9][0-9,]*(?:\.[0-9]{1,2})?)")


def parse_tiers_best_effort(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    tiers = []

    # broad heuristic: find repeated blocks that contain a price + a short heading
    # This will not be perfect, but it will work often enough to bootstrap.
    text_nodes = soup.get_text("\n", strip=True)
    # If the HTML is too JS-heavy, this will be mostly boilerplate.

    # Fallback: scan for prices and capture nearby headings using DOM proximity.
    for el in soup.find_all(string=PRICE_RE):
        price_text = el.strip()
        m = PRICE_RE.search(price_text)
        if not m:
            continue

        # climb to a parent container and search for the nearest heading
        container = el.parent
        for _ in range(4):
            if not container:
                break
            container = container.parent

        if not container:
            continue

        heading = None
        for h in container.select("h1,h2,h3")[0:1]:
            heading = h.get_text(" ", strip=True)

        desc = None
        p = container.find("p")
        if p:
            desc = p.get_text(" ", strip=True)

        tiers.append({
            "name": heading,
            "price_text": price_text,
            "description": desc,
        })

        if len(tiers) >= 12:
            break

    # de-dupe by (name, price_text)
    seen = set()
    uniq = []
    for t in tiers:
        key = (t.get("name"), t.get("price_text"))
        if key in seen:
            continue
        seen.add(key)
        uniq.append(t)

    return uniq

If this returns zero tiers for your target creator, don’t panic. That’s a signal the page is dynamic or the creator hides tiers.


Step 5: Extract recent public posts (best-effort)

Patreon public posts are also heavily dynamic, but you can often discover a few via:

  • og: metadata on post pages
  • links on the creator page that match a /posts/ pattern
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


def extract_recent_post_links(html: str, base_url: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    links = []

    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue
        if "/posts/" not in href:
            continue
        url = urljoin(base_url, href)
        links.append(url)

    # de-dupe while preserving order
    seen = set()
    out = []
    for u in links:
        if u in seen:
            continue
        seen.add(u)
        out.append(u)

    return out[:10]

You can then fetch each post URL and pull OpenGraph metadata:

from bs4 import BeautifulSoup


def parse_og(url: str, html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")
    title = soup.select_one('meta[property="og:title"]')
    desc = soup.select_one('meta[property="og:description"]')
    return {
        "url": url,
        "title": title.get("content") if title else None,
        "description": desc.get("content") if desc else None,
    }

Full runnable example: scrape one creator

import os
import json
import requests

from bs4 import BeautifulSoup

# reuse fetch_html + helpers from above


def scrape_creator(creator_url: str) -> dict:
    key = os.environ.get("PROXIESAPI_KEY")
    assert key, "Missing PROXIESAPI_KEY"

    cfg = FetchConfig(proxiesapi_key=key)
    session = requests.Session()

    html = fetch_html(creator_url, cfg, session=session)

    profile = parse_profile_from_html(html)
    tiers = parse_tiers_best_effort(html)

    posts = []
    for post_url in extract_recent_post_links(html, creator_url):
        try:
            post_html = fetch_html(post_url, cfg, session=session)
            posts.append(parse_og(post_url, post_html))
        except Exception:
            continue

    return {
        "creator_url": creator_url,
        "profile": profile,
        "tiers": tiers,
        "recent_posts": posts,
    }


if __name__ == "__main__":
    data = scrape_creator("https://www.patreon.com/patreon")
    print(json.dumps(data, indent=2, ensure_ascii=False))

Pagination + scaling to many creators

In real usage you’ll have a list of creators (from a directory, search results, or your own input).

A reliable pattern is:

  • store the list in SQLite
  • crawl in batches
  • record last-success timestamp + HTTP status
  • re-try failures with backoff

If you want a lightweight starting point, a newline-delimited file works too:

https://www.patreon.com/creator1
https://www.patreon.com/creator2

Then:

with open("creators.txt", "r", encoding="utf-8") as f:
    creators = [line.strip() for line in f if line.strip()]

for url in creators:
    try:
        data = scrape_creator(url)
        # write one JSON per creator for incremental runs
        slug = url.rstrip("/").split("/")[-1]
        with open(f"out/{slug}.json", "w", encoding="utf-8") as out:
            json.dump(data, out, ensure_ascii=False, indent=2)
        print("ok", url)
    except Exception as e:
        print("fail", url, e)

QA checklist

  • Screenshot saved for the creator page you tested
  • fetch_html() uses timeouts + retries
  • Profile fields populate (at least og:title, og:description)
  • Tier extractor returns sensible values (or you decide to use browser automation)
  • You never hammer Patreon (jitter + batching)

Where ProxiesAPI fits (honestly)

ProxiesAPI doesn’t magically “solve” dynamic pages.

What it does help with is the boring-but-critical part of scraping:

  • more consistent request success rates
  • fewer random 403/429 spikes during long crawls
  • the ability to distribute load across IPs/regions if your project needs it

Combine that with screenshot-first debugging and conservative parsers, and you’ll ship scrapers that stay alive longer than a weekend.

Make Patreon scraping more reliable with ProxiesAPI

Creator pages are a classic target for rate limits and geo-based variations. ProxiesAPI helps keep your fetch layer stable when you scale from 1 creator to 10,000.

Related guides

Scrape NBA Scores and Standings from ESPN with Python (Box Scores + Schedule)
Build a clean dataset of today’s NBA games and standings from ESPN pages using robust selectors and proxy-safe requests.
tutorial#python#nba#espn
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Rightmove Sold Prices (Second Angle): Price History Dataset Builder
Build a clean Rightmove sold-price history dataset with dedupe + incremental updates, plus a screenshot of the sold-price flow and ProxiesAPI-backed fetching.
tutorial#python#rightmove#web-scraping
Scrape Rightmove Sold Prices with Python: Sold Listings + Price History Dataset (with ProxiesAPI)
Build a Rightmove Sold Prices scraper: crawl sold-property results, paginate, fetch property detail pages, and normalize into a clean dataset. Includes a target-page screenshot and ProxiesAPI integration.
tutorial#python#rightmove#property-data