Scrape Stack Overflow User Profiles and Badges with Python

Jun 18, 2026 · tutorial · #python, #stack-overflow, #web-scraping, #beautifulsoup, #requests, #csv, #json, #proxies

Public Stack Overflow profiles are rich, structured pages. A single user page can tell you:

reputation and overall stats
gold, silver, and bronze badge counts
top tags with post counts and tag-specific score
community participation across the Stack Exchange network

That makes user profiles useful for hiring research, community analysis, expert discovery, or building a lightweight “developer influence” dataset.

In this guide we’ll scrape a real public profile page, extract the fields above, and export them to JSON and CSV.

Stack Overflow user profile page (we’ll extract stats, badge counts, and top tags)

Keep large profile crawls stable with ProxiesAPI

A few user pages are easy. Thousands of profile fetches across hiring, research, or community analytics are where retry logic, proxy rotation, and IP reputation start to matter. ProxiesAPI plugs into the same `requests` code you already have.

Get 1,000 free API calls View pricing

What we’re scraping

We’ll use a public user page like:

https://stackoverflow.com/users/22656/jon-skeet

The HTML is still mostly server-rendered, which is what makes this target practical without a browser automation stack for the core scrape.

From the live page source, these areas are especially useful:

#stats contains the reputation, answers, questions, and reach figures
the badges section contains gold, silver, and bronze counts
#top-tags contains per-tag score, post counts, and post share

That gives us stable extraction targets with normal requests plus BeautifulSoup.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup with lxml for parsing
tenacity for retry/backoff

Step 1: Build a fetch layer with retries

Create stack_overflow_profiles.py:

from __future__ import annotations

import json
import os
import random
import re
import time
from dataclasses import asdict, dataclass
from typing import Any

import requests
from bs4 import BeautifulSoup
from tenacity import retry, stop_after_attempt, wait_exponential

BASE = "https://stackoverflow.com"
TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/126.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)


def build_proxies() -> dict[str, str] | None:
    proxy = os.getenv("PROXIESAPI_PROXY")
    if not proxy:
        return None
    return {"http": f"http://{proxy}", "https": f"http://{proxy}"}


PROXIES = build_proxies()


def sleep_jitter(low: float = 0.4, high: float = 1.2) -> None:
    time.sleep(random.uniform(low, high))


@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=16))
def fetch_html(url: str) -> str:
    response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)
    response.raise_for_status()

    html = response.text
    if "User " not in html or "Stack Overflow" not in html:
        raise RuntimeError("Unexpected response body; possible interstitial or block page")
    return html

Where ProxiesAPI fits

The important part is that nothing in the parser changes when you scale up. You only add:

PROXIES = {
    "http": "http://YOUR_PROXIESAPI_PROXY",
    "https": "http://YOUR_PROXIESAPI_PROXY",
}

response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)

That means you can start with direct requests for low volume, then route through ProxiesAPI later when your crawl gets larger.

Step 2: Parse the profile stats card

On the live page, the stats block lives under #stats. Each stat item renders a numeric value plus a label like reputation, answers, or questions.

We’ll normalize values like 1,528,075 and 426.7m.

MULTIPLIERS = {
    "k": 1_000,
    "m": 1_000_000,
    "b": 1_000_000_000,
}


def parse_compact_number(text: str) -> float | int | None:
    if not text:
        return None

    cleaned = text.strip().lower().replace(",", "")
    match = re.fullmatch(r"(\d+(?:\.\d+)?)([kmb])?", cleaned)
    if not match:
        return None

    number = float(match.group(1))
    suffix = match.group(2)
    if suffix:
        number *= MULTIPLIERS[suffix]

    return int(number) if number.is_integer() else number


def parse_stats(soup: BeautifulSoup) -> dict[str, Any]:
    stats_card = soup.select_one("#stats .s-card")
    stats: dict[str, Any] = {}
    if not stats_card:
        return stats

    for item in stats_card.select("div.flex--item"):
        value_el = item.select_one(".fs-body3")
        if not value_el:
            continue

        value_text = value_el.get_text(" ", strip=True)
        raw_text = item.get_text(" ", strip=True)
        label = raw_text.replace(value_text, "", 1).strip().lower()

        if not label:
            continue

        key = label.replace(" ", "_")
        stats[key] = parse_compact_number(value_text)

    return stats

Typical output for the example profile looks like:

{
  "reputation": 1528075,
  "reached": 426700000,
  "answers": 35802,
  "questions": 56
}

Step 3: Parse badge counts

The public profile page includes separate cards for gold, silver, and bronze badges. The count is rendered in a bold numeric block followed by a caption like gold badges.

This is a case where a small regex over the raw HTML is simpler than forcing a brittle DOM traversal:

BADGE_RE = re.compile(
    r'<div class="fs-title fw-bold fc-black-600">\s*([\d,]+)\s*</div>\s*'
    r'<div class="fs-caption">\s*(gold|silver|bronze) badges',
    re.IGNORECASE,
)


def parse_badge_counts(html: str) -> dict[str, int]:
    counts = {"gold": 0, "silver": 0, "bronze": 0}
    for count_text, badge_type in BADGE_RE.findall(html):
        counts[badge_type.lower()] = int(count_text.replace(",", ""))
    return counts

This is one of those pragmatic scraping choices that is worth making. If the exact visual nesting changes but the text pattern remains stable, the regex version can survive longer than a deeply coupled CSS selector chain.

Step 4: Parse top tags

The top tags block is more structured. Each row in #top-tags gives us:

tag name
tag-specific score
post count
post percentage

def to_int(text: str) -> int | None:
    text = text.replace(",", "").strip()
    return int(text) if text.isdigit() else None


def parse_top_tags(soup: BeautifulSoup) -> list[dict[str, Any]]:
    rows = soup.select("#top-tags .p12")
    tags: list[dict[str, Any]] = []

    for row in rows:
        tag_link = row.select_one("a.s-tag")
        if not tag_link:
            continue

        metric_values = [el.get_text(" ", strip=True) for el in row.select(".fs-body3")]
        if len(metric_values) < 3:
            continue

        tags.append(
            {
                "tag": tag_link.get_text(" ", strip=True),
                "score": to_int(metric_values[0]),
                "posts": to_int(metric_values[1]),
                "posts_pct": to_int(metric_values[2]),
            }
        )

    return tags

Because Stack Overflow puts the tag table directly in the HTML, there is no need to run JavaScript to get these values.

Step 5: Put it together

@dataclass
class ProfileRecord:
    profile_url: str
    display_name: str | None
    reputation: int | float | None
    reached: int | float | None
    answers: int | float | None
    questions: int | float | None
    gold_badges: int
    silver_badges: int
    bronze_badges: int
    top_tags: list[dict[str, Any]]


def parse_profile(url: str) -> ProfileRecord:
    html = fetch_html(url)
    soup = BeautifulSoup(html, "lxml")

    title_el = soup.select_one("title")
    page_title = title_el.get_text(" ", strip=True) if title_el else ""
    display_name = page_title.removeprefix("User ").removesuffix(" - Stack Overflow").strip() or None

    stats = parse_stats(soup)
    badges = parse_badge_counts(html)
    top_tags = parse_top_tags(soup)

    return ProfileRecord(
        profile_url=url,
        display_name=display_name,
        reputation=stats.get("reputation"),
        reached=stats.get("reached"),
        answers=stats.get("answers"),
        questions=stats.get("questions"),
        gold_badges=badges["gold"],
        silver_badges=badges["silver"],
        bronze_badges=badges["bronze"],
        top_tags=top_tags,
    )

And run it:

if __name__ == "__main__":
    urls = [
        "https://stackoverflow.com/users/22656/jon-skeet",
        "https://stackoverflow.com/users/1144035/servy",
    ]

    records = []
    for url in urls:
        records.append(parse_profile(url))
        sleep_jitter()

    print(json.dumps([asdict(r) for r in records], indent=2))

Step 6: Export CSV and JSON

The profile object has a nested top_tags list, so the cleanest pattern is:

one JSON file for the full profile record
one flat CSV for profile-level metrics
one second CSV for per-tag rows

import csv
from pathlib import Path


def export(records: list[ProfileRecord], out_dir: str = "output") -> None:
    Path(out_dir).mkdir(parents=True, exist_ok=True)

    with open(f"{out_dir}/profiles.json", "w", encoding="utf-8") as f:
        json.dump([asdict(r) for r in records], f, ensure_ascii=False, indent=2)

    with open(f"{out_dir}/profiles.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(
            f,
            fieldnames=[
                "profile_url",
                "display_name",
                "reputation",
                "reached",
                "answers",
                "questions",
                "gold_badges",
                "silver_badges",
                "bronze_badges",
            ],
        )
        writer.writeheader()
        for r in records:
            writer.writerow(
                {
                    "profile_url": r.profile_url,
                    "display_name": r.display_name,
                    "reputation": r.reputation,
                    "reached": r.reached,
                    "answers": r.answers,
                    "questions": r.questions,
                    "gold_badges": r.gold_badges,
                    "silver_badges": r.silver_badges,
                    "bronze_badges": r.bronze_badges,
                }
            )

    with open(f"{out_dir}/profile_top_tags.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(
            f,
            fieldnames=["profile_url", "display_name", "tag", "score", "posts", "posts_pct"],
        )
        writer.writeheader()
        for r in records:
            for tag in r.top_tags:
                writer.writerow(
                    {
                        "profile_url": r.profile_url,
                        "display_name": r.display_name,
                        **tag,
                    }
                )

Common scraping issues on Stack Overflow profiles

1. Compact numbers

426.7m is not directly castable to int. Handle k, m, and b explicitly.

2. Layout-oriented classes

Some classes on Stack Overflow are utility classes, not semantic ones. Favor:

container IDs like #stats and #top-tags
text labels near values
small regexes for repeated visual patterns

3. Over-fetching

If you’re collecting thousands of profiles, slow down. Public pages are not a license to hammer a site.

Practical guardrails:

keep concurrency low
add jitter between requests
retry only on transient failures
cache successful fetches during development

When to add a browser

You do not need Selenium or Playwright for the core scrape here. Add a browser only when you specifically need:

authenticated views
screenshots for documentation
rendered-only data that does not appear in the initial HTML

For this tutorial, the data we need is already present in the server response, which is exactly why requests is the right default.

Final takeaway

Stack Overflow user profiles are a strong example of “real-world but still manageable” scraping. You can get meaningful structured data from:

the stats card
the badge counts
the top tags table

without forcing the problem into browser automation.

Start with direct requests, keep your parser grounded in the real HTML structure, and add ProxiesAPI only when your crawl volume or failure rate makes it worthwhile.

Keep large profile crawls stable with ProxiesAPI

Get 1,000 free API calls View pricing

Related guides

Scrape Book Data from Goodreads

Build a Goodreads dataset with book titles, authors, ratings, and review counts from a public list page using Python and an optional ProxiesAPI fetch layer.

tutorial#python#goodreads#books

Scrape GitHub Trending Repositories with Python

Build a daily GitHub Trending dataset with Python: collect repository names, languages, star counts, and URLs, then export clean CSV or JSON with an optional ProxiesAPI fetch layer.

tutorial#python#github#web-scraping

Scrape Stack Overflow Newest Questions into CSV with Python

Collect Stack Overflow's newest questions with Python: titles, tags, votes, answers, timestamps, and URLs exported into clean CSV files with an optional ProxiesAPI request layer.

tutorial#python#stack-overflow#web-scraping

Scrape Stack Overflow Questions and Answers

Extract Stack Overflow question listings, votes, tags, accepted answers, and code blocks with Python. This guide uses real selectors and a ProxiesAPI-ready request layer for larger crawls.

tutorial#python#stack-overflow#web-scraping