Scrape GitHub Repository Data

Jun 20, 2026 · tutorial · #python, #github, #web-scraping, #beautifulsoup, #research, #proxiesapi

GitHub already has a public API, so why scrape the HTML at all? Because for fast competitor research, HTML is often the shortest path to the exact information humans see on the page: repo name, stars, forks, topics, release cadence, and last-updated metadata.

In this tutorial we will build a Python scraper that:

fetches repository pages directly
extracts repo identity, stars, forks, topics, and last release date
handles GitHub-style abbreviated counts like 1.3k
exports rows to CSV for further analysis
keeps ProxiesAPI isolated to the fetch layer

GitHub repository page screenshot

Use ProxiesAPI when GitHub scraping stops being lightweight

GitHub is often easy to parse at low volume, but research workflows can balloon into hundreds of pages and detail requests. ProxiesAPI gives you a cleaner way to add retries and rotation without rebuilding your scraper.

Get 1,000 free API calls View pricing

What to scrape on a repository page

For market and competitor research, the HTML repository page is usually enough. A good baseline record contains:

owner
repo
stars
forks
topics
latest_release_tag
latest_release_date

GitHub’s markup changes occasionally, so the safest strategy is to rely on stable patterns:

star link: a[href$="/stargazers"]
fork link: a[href$="/forks"]
topics: a.topic-tag
release tag: a[href*="/releases/tag/"]
release date: relative-time

Those selectors are much less brittle than trying to anchor to one giant layout container.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We will scrape repository pages such as:

https://github.com/psf/requests
https://github.com/pallets/flask
https://github.com/fastapi/fastapi

That is enough to build a useful comparison sheet for tools in the same category.

Step 1: Fetch HTML with retries and optional ProxiesAPI

import os
import random
import re
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 40)
UA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0 Safari/537.36"

session = requests.Session()
session.headers.update({"User-Agent": UA, "Accept-Language": "en-US,en;q=0.9"})


def proxiesapi_url(target_url: str) -> str:
    key = os.environ.get("PROXIESAPI_KEY")
    if not key:
        return target_url
    return f"http://api.proxiesapi.com/?auth_key={quote(key)}&url={quote(target_url, safe='')}"


def fetch_html(target_url: str, max_retries: int = 4) -> str:
    last_error = None
    for attempt in range(1, max_retries + 1):
        try:
            response = session.get(proxiesapi_url(target_url), timeout=TIMEOUT)
            response.raise_for_status()
            if len(response.text) < 5000:
                raise ValueError("unexpectedly short HTML")
            return response.text
        except Exception as exc:
            last_error = exc
            if attempt == max_retries:
                break
            time.sleep(min(10, 1.7 ** attempt) + random.uniform(0, 0.5))
    raise RuntimeError(f"failed to fetch {target_url}: {last_error}")

If your workflow is just a few repositories, direct requests may be enough. If you are pulling dozens or hundreds of pages repeatedly, ProxiesAPI becomes useful because you can keep the scraping logic unchanged and only swap the transport.

Step 2: Count parser for GitHub-style abbreviations

GitHub often shows counts as 915, 1.3k, or 2.1m. Normalize those before exporting.

def parse_count(text: str | None) -> int | None:
    if not text:
        return None

    value = text.lower().replace(",", "").strip()
    match = re.search(r"([0-9]*\.?[0-9]+)\s*([km])?", value)
    if not match:
        return None

    number = float(match.group(1))
    suffix = match.group(2)

    if suffix == "k":
        return int(number * 1_000)
    if suffix == "m":
        return int(number * 1_000_000)
    return int(number)

That one helper makes your CSV sortable again.

Step 3: Parse one repository page

from urllib.parse import urlparse
from bs4 import BeautifulSoup


def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
    for selector in selectors:
        node = soup.select_one(selector)
        if node:
            text = " ".join(node.get_text(" ", strip=True).split())
            if text:
                return text
    return None


def parse_repo_page(repo_url: str, html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    parts = urlparse(repo_url).path.strip("/").split("/")
    owner, repo = parts[:2]

    stars_text = first_text(soup, ['a[href$="/stargazers"] strong', 'a[href$="/stargazers"]'])
    forks_text = first_text(soup, ['a[href$="/forks"] strong', 'a[href$="/forks"]'])
    release_tag = first_text(soup, ['a[href*="/releases/tag/"]'])

    release_time = soup.select_one("relative-time")
    topics = [
        " ".join(tag.get_text(" ", strip=True).split())
        for tag in soup.select("a.topic-tag")
    ]

    return {
        "owner": owner,
        "repo": repo,
        "repo_url": repo_url,
        "stars": parse_count(stars_text),
        "forks": parse_count(forks_text),
        "topics": ", ".join(topics),
        "latest_release_tag": release_tag,
        "latest_release_date": release_time.get("datetime") if release_time else None,
    }

Why scrape per-repository pages instead of just the search page? Because the repository page gives you cleaner topic tags and release metadata, which are usually the first things you want in competitor research.

Step 4: Crawl a list of repositories and export CSV

import csv


def scrape_repositories(repo_urls: list[str]) -> list[dict]:
    rows = []
    for repo_url in repo_urls:
        html = fetch_html(repo_url)
        rows.append(parse_repo_page(repo_url, html))
    return rows


def write_csv(path: str, rows: list[dict]) -> None:
    fieldnames = [
        "owner",
        "repo",
        "repo_url",
        "stars",
        "forks",
        "topics",
        "latest_release_tag",
        "latest_release_date",
    ]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for row in rows:
            writer.writerow(row)


if __name__ == "__main__":
    targets = [
        "https://github.com/psf/requests",
        "https://github.com/pallets/flask",
        "https://github.com/fastapi/fastapi",
    ]
    rows = scrape_repositories(targets)
    write_csv("github_repos.csv", rows)
    print(f"wrote {len(rows)} rows to github_repos.csv")

Sample output columns:

owner	repo	stars	forks	topics	latest_release_tag
psf	requests	53000	9500	python, http, requests	v2.32.5

That is already enough to compare maturity, adoption, and scope across competing repositories.

When HTML scraping is the right call

Use repository-page scraping when:

you want exactly what users see
you only need lightweight research
you prefer one scraping stack across many sites

Use the GitHub API when:

you need authenticated access
you want issue, PR, or release pagination at scale
you need long-term schema stability

That tradeoff looks like this:

Approach	Best for	Main tradeoff
HTML scraping	fast competitor scans, visible metadata	selectors can drift
GitHub API	structured bulk extraction	more endpoint-specific code

The important point is not to overcomplicate the first version. If your actual job is market research, a clean HTML-first scraper often gets you to answers faster.

Use ProxiesAPI when GitHub scraping stops being lightweight

Get 1,000 free API calls View pricing

Scrape GitHub repo metadata from HTML (not just the API): stars, forks, latest release, open issues, and pull requests. Includes a ProxiesAPI fetch layer, safe parsing, and CSV export + screenshot.

tutorial#python#github#web-scraping

Scrape Craigslist Listings by Category and City

Show how to pull listing titles, prices, neighborhoods, and posting URLs from Craigslist search pages into a clean dataset.

tutorial#python#craigslist#web-scraping

Scrape GitHub Releases

Collect release tags, publish dates, changelog text, and asset links from GitHub Releases pages with Python so you can monitor repos automatically.

tutorial#python#github#web-scraping

Scrape GitHub Repository Data

Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.