Scrape GitHub Repository Data

GitHub already has a public API, so why scrape the HTML at all? Because for fast competitor research, HTML is often the shortest path to the exact information humans see on the page: repo name, stars, forks, topics, release cadence, and last-updated metadata.

In this tutorial we will build a Python scraper that:

  • fetches repository pages directly
  • extracts repo identity, stars, forks, topics, and last release date
  • handles GitHub-style abbreviated counts like 1.3k
  • exports rows to CSV for further analysis
  • keeps ProxiesAPI isolated to the fetch layer

GitHub repository page screenshot

Use ProxiesAPI when GitHub scraping stops being lightweight

GitHub is often easy to parse at low volume, but research workflows can balloon into hundreds of pages and detail requests. ProxiesAPI gives you a cleaner way to add retries and rotation without rebuilding your scraper.


What to scrape on a repository page

For market and competitor research, the HTML repository page is usually enough. A good baseline record contains:

  • owner
  • repo
  • stars
  • forks
  • topics
  • latest_release_tag
  • latest_release_date

GitHub’s markup changes occasionally, so the safest strategy is to rely on stable patterns:

  • star link: a[href$="/stargazers"]
  • fork link: a[href$="/forks"]
  • topics: a.topic-tag
  • release tag: a[href*="/releases/tag/"]
  • release date: relative-time

Those selectors are much less brittle than trying to anchor to one giant layout container.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We will scrape repository pages such as:

  • https://github.com/psf/requests
  • https://github.com/pallets/flask
  • https://github.com/fastapi/fastapi

That is enough to build a useful comparison sheet for tools in the same category.


Step 1: Fetch HTML with retries and optional ProxiesAPI

import os
import random
import re
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 40)
UA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0 Safari/537.36"

session = requests.Session()
session.headers.update({"User-Agent": UA, "Accept-Language": "en-US,en;q=0.9"})


def proxiesapi_url(target_url: str) -> str:
    key = os.environ.get("PROXIESAPI_KEY")
    if not key:
        return target_url
    return f"http://api.proxiesapi.com/?auth_key={quote(key)}&url={quote(target_url, safe='')}"


def fetch_html(target_url: str, max_retries: int = 4) -> str:
    last_error = None
    for attempt in range(1, max_retries + 1):
        try:
            response = session.get(proxiesapi_url(target_url), timeout=TIMEOUT)
            response.raise_for_status()
            if len(response.text) < 5000:
                raise ValueError("unexpectedly short HTML")
            return response.text
        except Exception as exc:
            last_error = exc
            if attempt == max_retries:
                break
            time.sleep(min(10, 1.7 ** attempt) + random.uniform(0, 0.5))
    raise RuntimeError(f"failed to fetch {target_url}: {last_error}")

If your workflow is just a few repositories, direct requests may be enough. If you are pulling dozens or hundreds of pages repeatedly, ProxiesAPI becomes useful because you can keep the scraping logic unchanged and only swap the transport.


Step 2: Count parser for GitHub-style abbreviations

GitHub often shows counts as 915, 1.3k, or 2.1m. Normalize those before exporting.

def parse_count(text: str | None) -> int | None:
    if not text:
        return None

    value = text.lower().replace(",", "").strip()
    match = re.search(r"([0-9]*\.?[0-9]+)\s*([km])?", value)
    if not match:
        return None

    number = float(match.group(1))
    suffix = match.group(2)

    if suffix == "k":
        return int(number * 1_000)
    if suffix == "m":
        return int(number * 1_000_000)
    return int(number)

That one helper makes your CSV sortable again.


Step 3: Parse one repository page

from urllib.parse import urlparse
from bs4 import BeautifulSoup


def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
    for selector in selectors:
        node = soup.select_one(selector)
        if node:
            text = " ".join(node.get_text(" ", strip=True).split())
            if text:
                return text
    return None


def parse_repo_page(repo_url: str, html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    parts = urlparse(repo_url).path.strip("/").split("/")
    owner, repo = parts[:2]

    stars_text = first_text(soup, ['a[href$="/stargazers"] strong', 'a[href$="/stargazers"]'])
    forks_text = first_text(soup, ['a[href$="/forks"] strong', 'a[href$="/forks"]'])
    release_tag = first_text(soup, ['a[href*="/releases/tag/"]'])

    release_time = soup.select_one("relative-time")
    topics = [
        " ".join(tag.get_text(" ", strip=True).split())
        for tag in soup.select("a.topic-tag")
    ]

    return {
        "owner": owner,
        "repo": repo,
        "repo_url": repo_url,
        "stars": parse_count(stars_text),
        "forks": parse_count(forks_text),
        "topics": ", ".join(topics),
        "latest_release_tag": release_tag,
        "latest_release_date": release_time.get("datetime") if release_time else None,
    }

Why scrape per-repository pages instead of just the search page? Because the repository page gives you cleaner topic tags and release metadata, which are usually the first things you want in competitor research.


Step 4: Crawl a list of repositories and export CSV

import csv


def scrape_repositories(repo_urls: list[str]) -> list[dict]:
    rows = []
    for repo_url in repo_urls:
        html = fetch_html(repo_url)
        rows.append(parse_repo_page(repo_url, html))
    return rows


def write_csv(path: str, rows: list[dict]) -> None:
    fieldnames = [
        "owner",
        "repo",
        "repo_url",
        "stars",
        "forks",
        "topics",
        "latest_release_tag",
        "latest_release_date",
    ]
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for row in rows:
            writer.writerow(row)


if __name__ == "__main__":
    targets = [
        "https://github.com/psf/requests",
        "https://github.com/pallets/flask",
        "https://github.com/fastapi/fastapi",
    ]
    rows = scrape_repositories(targets)
    write_csv("github_repos.csv", rows)
    print(f"wrote {len(rows)} rows to github_repos.csv")

Sample output columns:

ownerrepostarsforkstopicslatest_release_tag
psfrequests530009500python, http, requestsv2.32.5

That is already enough to compare maturity, adoption, and scope across competing repositories.


When HTML scraping is the right call

Use repository-page scraping when:

  • you want exactly what users see
  • you only need lightweight research
  • you prefer one scraping stack across many sites

Use the GitHub API when:

  • you need authenticated access
  • you want issue, PR, or release pagination at scale
  • you need long-term schema stability

That tradeoff looks like this:

ApproachBest forMain tradeoff
HTML scrapingfast competitor scans, visible metadataselectors can drift
GitHub APIstructured bulk extractionmore endpoint-specific code

The important point is not to overcomplicate the first version. If your actual job is market research, a clean HTML-first scraper often gets you to answers faster.

Use ProxiesAPI when GitHub scraping stops being lightweight

GitHub is often easy to parse at low volume, but research workflows can balloon into hundreds of pages and detail requests. ProxiesAPI gives you a cleaner way to add retries and rotation without rebuilding your scraper.

Related guides

Scrape GitHub Repository Data (Stars, Releases, Issues) with Python + ProxiesAPI
Scrape GitHub repo metadata from HTML (not just the API): stars, forks, latest release, open issues, and pull requests. Includes a ProxiesAPI fetch layer, safe parsing, and CSV export + screenshot.
tutorial#python#github#web-scraping
Scrape Craigslist Listings by Category and City
Show how to pull listing titles, prices, neighborhoods, and posting URLs from Craigslist search pages into a clean dataset.
tutorial#python#craigslist#web-scraping
Scrape GitHub Releases
Collect release tags, publish dates, changelog text, and asset links from GitHub Releases pages with Python so you can monitor repos automatically.
tutorial#python#github#web-scraping
Scrape GitHub Repository Data
Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.
tutorial#python#github#web-scraping