Scrape Stack Overflow User Profiles and Badges with Python

Public Stack Overflow profiles are rich, structured pages. A single user page can tell you:

  • reputation and overall stats
  • gold, silver, and bronze badge counts
  • top tags with post counts and tag-specific score
  • community participation across the Stack Exchange network

That makes user profiles useful for hiring research, community analysis, expert discovery, or building a lightweight “developer influence” dataset.

In this guide we’ll scrape a real public profile page, extract the fields above, and export them to JSON and CSV.

Stack Overflow user profile page (we’ll extract stats, badge counts, and top tags)

Keep large profile crawls stable with ProxiesAPI

A few user pages are easy. Thousands of profile fetches across hiring, research, or community analytics are where retry logic, proxy rotation, and IP reputation start to matter. ProxiesAPI plugs into the same `requests` code you already have.


What we’re scraping

We’ll use a public user page like:

https://stackoverflow.com/users/22656/jon-skeet

The HTML is still mostly server-rendered, which is what makes this target practical without a browser automation stack for the core scrape.

From the live page source, these areas are especially useful:

  • #stats contains the reputation, answers, questions, and reach figures
  • the badges section contains gold, silver, and bronze counts
  • #top-tags contains per-tag score, post counts, and post share

That gives us stable extraction targets with normal requests plus BeautifulSoup.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup with lxml for parsing
  • tenacity for retry/backoff

Step 1: Build a fetch layer with retries

Create stack_overflow_profiles.py:

from __future__ import annotations

import json
import os
import random
import re
import time
from dataclasses import asdict, dataclass
from typing import Any

import requests
from bs4 import BeautifulSoup
from tenacity import retry, stop_after_attempt, wait_exponential

BASE = "https://stackoverflow.com"
TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/126.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)


def build_proxies() -> dict[str, str] | None:
    proxy = os.getenv("PROXIESAPI_PROXY")
    if not proxy:
        return None
    return {"http": f"http://{proxy}", "https": f"http://{proxy}"}


PROXIES = build_proxies()


def sleep_jitter(low: float = 0.4, high: float = 1.2) -> None:
    time.sleep(random.uniform(low, high))


@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=16))
def fetch_html(url: str) -> str:
    response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)
    response.raise_for_status()

    html = response.text
    if "User " not in html or "Stack Overflow" not in html:
        raise RuntimeError("Unexpected response body; possible interstitial or block page")
    return html

Where ProxiesAPI fits

The important part is that nothing in the parser changes when you scale up. You only add:

PROXIES = {
    "http": "http://YOUR_PROXIESAPI_PROXY",
    "https": "http://YOUR_PROXIESAPI_PROXY",
}

response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)

That means you can start with direct requests for low volume, then route through ProxiesAPI later when your crawl gets larger.


Step 2: Parse the profile stats card

On the live page, the stats block lives under #stats. Each stat item renders a numeric value plus a label like reputation, answers, or questions.

We’ll normalize values like 1,528,075 and 426.7m.

MULTIPLIERS = {
    "k": 1_000,
    "m": 1_000_000,
    "b": 1_000_000_000,
}


def parse_compact_number(text: str) -> float | int | None:
    if not text:
        return None

    cleaned = text.strip().lower().replace(",", "")
    match = re.fullmatch(r"(\d+(?:\.\d+)?)([kmb])?", cleaned)
    if not match:
        return None

    number = float(match.group(1))
    suffix = match.group(2)
    if suffix:
        number *= MULTIPLIERS[suffix]

    return int(number) if number.is_integer() else number


def parse_stats(soup: BeautifulSoup) -> dict[str, Any]:
    stats_card = soup.select_one("#stats .s-card")
    stats: dict[str, Any] = {}
    if not stats_card:
        return stats

    for item in stats_card.select("div.flex--item"):
        value_el = item.select_one(".fs-body3")
        if not value_el:
            continue

        value_text = value_el.get_text(" ", strip=True)
        raw_text = item.get_text(" ", strip=True)
        label = raw_text.replace(value_text, "", 1).strip().lower()

        if not label:
            continue

        key = label.replace(" ", "_")
        stats[key] = parse_compact_number(value_text)

    return stats

Typical output for the example profile looks like:

{
  "reputation": 1528075,
  "reached": 426700000,
  "answers": 35802,
  "questions": 56
}

Step 3: Parse badge counts

The public profile page includes separate cards for gold, silver, and bronze badges. The count is rendered in a bold numeric block followed by a caption like gold badges.

This is a case where a small regex over the raw HTML is simpler than forcing a brittle DOM traversal:

BADGE_RE = re.compile(
    r'<div class="fs-title fw-bold fc-black-600">\s*([\d,]+)\s*</div>\s*'
    r'<div class="fs-caption">\s*(gold|silver|bronze) badges',
    re.IGNORECASE,
)


def parse_badge_counts(html: str) -> dict[str, int]:
    counts = {"gold": 0, "silver": 0, "bronze": 0}
    for count_text, badge_type in BADGE_RE.findall(html):
        counts[badge_type.lower()] = int(count_text.replace(",", ""))
    return counts

This is one of those pragmatic scraping choices that is worth making. If the exact visual nesting changes but the text pattern remains stable, the regex version can survive longer than a deeply coupled CSS selector chain.


Step 4: Parse top tags

The top tags block is more structured. Each row in #top-tags gives us:

  • tag name
  • tag-specific score
  • post count
  • post percentage
def to_int(text: str) -> int | None:
    text = text.replace(",", "").strip()
    return int(text) if text.isdigit() else None


def parse_top_tags(soup: BeautifulSoup) -> list[dict[str, Any]]:
    rows = soup.select("#top-tags .p12")
    tags: list[dict[str, Any]] = []

    for row in rows:
        tag_link = row.select_one("a.s-tag")
        if not tag_link:
            continue

        metric_values = [el.get_text(" ", strip=True) for el in row.select(".fs-body3")]
        if len(metric_values) < 3:
            continue

        tags.append(
            {
                "tag": tag_link.get_text(" ", strip=True),
                "score": to_int(metric_values[0]),
                "posts": to_int(metric_values[1]),
                "posts_pct": to_int(metric_values[2]),
            }
        )

    return tags

Because Stack Overflow puts the tag table directly in the HTML, there is no need to run JavaScript to get these values.


Step 5: Put it together

@dataclass
class ProfileRecord:
    profile_url: str
    display_name: str | None
    reputation: int | float | None
    reached: int | float | None
    answers: int | float | None
    questions: int | float | None
    gold_badges: int
    silver_badges: int
    bronze_badges: int
    top_tags: list[dict[str, Any]]


def parse_profile(url: str) -> ProfileRecord:
    html = fetch_html(url)
    soup = BeautifulSoup(html, "lxml")

    title_el = soup.select_one("title")
    page_title = title_el.get_text(" ", strip=True) if title_el else ""
    display_name = page_title.removeprefix("User ").removesuffix(" - Stack Overflow").strip() or None

    stats = parse_stats(soup)
    badges = parse_badge_counts(html)
    top_tags = parse_top_tags(soup)

    return ProfileRecord(
        profile_url=url,
        display_name=display_name,
        reputation=stats.get("reputation"),
        reached=stats.get("reached"),
        answers=stats.get("answers"),
        questions=stats.get("questions"),
        gold_badges=badges["gold"],
        silver_badges=badges["silver"],
        bronze_badges=badges["bronze"],
        top_tags=top_tags,
    )

And run it:

if __name__ == "__main__":
    urls = [
        "https://stackoverflow.com/users/22656/jon-skeet",
        "https://stackoverflow.com/users/1144035/servy",
    ]

    records = []
    for url in urls:
        records.append(parse_profile(url))
        sleep_jitter()

    print(json.dumps([asdict(r) for r in records], indent=2))

Step 6: Export CSV and JSON

The profile object has a nested top_tags list, so the cleanest pattern is:

  • one JSON file for the full profile record
  • one flat CSV for profile-level metrics
  • one second CSV for per-tag rows
import csv
from pathlib import Path


def export(records: list[ProfileRecord], out_dir: str = "output") -> None:
    Path(out_dir).mkdir(parents=True, exist_ok=True)

    with open(f"{out_dir}/profiles.json", "w", encoding="utf-8") as f:
        json.dump([asdict(r) for r in records], f, ensure_ascii=False, indent=2)

    with open(f"{out_dir}/profiles.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(
            f,
            fieldnames=[
                "profile_url",
                "display_name",
                "reputation",
                "reached",
                "answers",
                "questions",
                "gold_badges",
                "silver_badges",
                "bronze_badges",
            ],
        )
        writer.writeheader()
        for r in records:
            writer.writerow(
                {
                    "profile_url": r.profile_url,
                    "display_name": r.display_name,
                    "reputation": r.reputation,
                    "reached": r.reached,
                    "answers": r.answers,
                    "questions": r.questions,
                    "gold_badges": r.gold_badges,
                    "silver_badges": r.silver_badges,
                    "bronze_badges": r.bronze_badges,
                }
            )

    with open(f"{out_dir}/profile_top_tags.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(
            f,
            fieldnames=["profile_url", "display_name", "tag", "score", "posts", "posts_pct"],
        )
        writer.writeheader()
        for r in records:
            for tag in r.top_tags:
                writer.writerow(
                    {
                        "profile_url": r.profile_url,
                        "display_name": r.display_name,
                        **tag,
                    }
                )

Common scraping issues on Stack Overflow profiles

1. Compact numbers

426.7m is not directly castable to int. Handle k, m, and b explicitly.

2. Layout-oriented classes

Some classes on Stack Overflow are utility classes, not semantic ones. Favor:

  • container IDs like #stats and #top-tags
  • text labels near values
  • small regexes for repeated visual patterns

3. Over-fetching

If you’re collecting thousands of profiles, slow down. Public pages are not a license to hammer a site.

Practical guardrails:

  • keep concurrency low
  • add jitter between requests
  • retry only on transient failures
  • cache successful fetches during development

When to add a browser

You do not need Selenium or Playwright for the core scrape here. Add a browser only when you specifically need:

  • authenticated views
  • screenshots for documentation
  • rendered-only data that does not appear in the initial HTML

For this tutorial, the data we need is already present in the server response, which is exactly why requests is the right default.


Final takeaway

Stack Overflow user profiles are a strong example of “real-world but still manageable” scraping. You can get meaningful structured data from:

  • the stats card
  • the badge counts
  • the top tags table

without forcing the problem into browser automation.

Start with direct requests, keep your parser grounded in the real HTML structure, and add ProxiesAPI only when your crawl volume or failure rate makes it worthwhile.

Keep large profile crawls stable with ProxiesAPI

A few user pages are easy. Thousands of profile fetches across hiring, research, or community analytics are where retry logic, proxy rotation, and IP reputation start to matter. ProxiesAPI plugs into the same `requests` code you already have.

Related guides

How to Scrape Stack Overflow Questions and Accepted Answers with Python (By Tag)
Build a resilient Stack Overflow scraper: crawl tag pages, extract question metadata, follow links, and parse accepted answers. Includes retries, dedupe, and ProxiesAPI-ready requests + a screenshot of the tag page.
tutorial#python#stack-overflow#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Collect Stack Overflow Q&A for a tag with pagination, answer extraction, and a proof screenshot. Export clean JSON for analysis.
tutorial#python#stack-overflow#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books