How to Scrape PyPI Project Pages with Python

Mar 11, 2026 · tutorial · #python, #pypi, #web-scraping, #requests, #beautifulsoup

PyPI project pages are a compact source of package metadata:

latest version
summary / description
project URLs
classifiers
release history links

This tutorial shows how to scrape a PyPI project page with Python and BeautifulSoup, extract the fields you actually care about, and export them into structured JSON.

PyPI project page

Turn PyPI scraping into a stable monitoring workflow

If you’re collecting package metadata across many projects on a schedule, the fetch layer becomes the fragile part. ProxiesAPI helps keep those HTTP requests predictable.

Get 1,000 free API calls View pricing

Target URL

https://pypi.org/project/requests/

We’ll use requests as the example package, but the same scraper pattern works for other PyPI project pages.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch the HTML

import requests

URL = "https://pypi.org/project/requests/"
TIMEOUT = (10, 30)
UA = "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"

html = requests.get(URL, timeout=TIMEOUT, headers={"User-Agent": UA}).text
print(len(html))
print(html[:300])

This gives us raw HTML we can inspect before we guess selectors.

Step 2: Inspect the DOM

On a PyPI project page, the key metadata is usually near the top:

package name
current version
one-line summary
side-panel metadata
classifiers lower on the page

The exact markup can change over time, so the safe approach is:

extract the obvious high-signal fields first
keep selectors narrow and readable
use defensive fallbacks where possible

Step 3: Extract the package name and version

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

name = soup.select_one("h1.package-header__name").get_text(" ", strip=True)
print(name)

PyPI combines the package name and version in the header, so you’ll usually parse this into parts.

header_text = soup.select_one("h1.package-header__name").get_text(" ", strip=True)
parts = header_text.split()
package_name = parts[0]
version = parts[-1]

print(package_name)
print(version)

Example output:

requests
2.32.3

Step 4: Extract the summary

summary = soup.select_one("p.package-header__summary")
summary_text = summary.get_text(" ", strip=True) if summary else None
print(summary_text)

This gives you the one-line description shown under the package header.

Step 5: Extract project URLs and metadata

PyPI project pages often include a metadata sidebar with:

homepage
documentation link
source repo
author / maintainers
license

A practical strategy is to collect the visible key-value rows and normalize them later.

sidebar = soup.select("div.sidebar-section")
for block in sidebar[:5]:
    print(block.get_text(" ", strip=True)[:200])

For a more structured pull, you can extract labeled links:

project_links = {}
for a in soup.select("a.vertical-tabs__tab, a[href]"):
    text = a.get_text(" ", strip=True)
    href = a.get("href")
    if text and href and text.lower() in {"homepage", "documentation", "source", "release history"}:
        project_links[text] = href

print(project_links)

Step 6: Extract classifiers

Classifiers are one of the most useful fields on PyPI because they help categorize packages by:

supported Python versions
license
topic
development status

classifiers = []
for li in soup.select(".sidebar-section li"):
    text = li.get_text(" ", strip=True)
    if "::" in text:
        classifiers.append(text)

print(classifiers[:10])

You won’t always get a perfectly isolated classifier list with one selector, so it’s fine to filter by the :: pattern if the DOM is mixed.

Step 7: Put it all together

import requests
from bs4 import BeautifulSoup


def scrape_pypi_project(url: str) -> dict:
    html = requests.get(
        url,
        timeout=(10, 30),
        headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"},
    ).text

    soup = BeautifulSoup(html, "lxml")

    header = soup.select_one("h1.package-header__name")
    header_text = header.get_text(" ", strip=True) if header else ""
    parts = header_text.split()

    package_name = parts[0] if parts else None
    version = parts[-1] if len(parts) >= 2 else None

    summary = soup.select_one("p.package-header__summary")
    summary_text = summary.get_text(" ", strip=True) if summary else None

    classifiers = []
    for li in soup.select(".sidebar-section li"):
        text = li.get_text(" ", strip=True)
        if "::" in text:
            classifiers.append(text)

    return {
        "url": url,
        "package_name": package_name,
        "version": version,
        "summary": summary_text,
        "classifiers": classifiers,
    }


if __name__ == "__main__":
    data = scrape_pypi_project("https://pypi.org/project/requests/")
    print(data)

Example output

{
  "url": "https://pypi.org/project/requests/",
  "package_name": "requests",
  "version": "2.32.3",
  "summary": "Python HTTP for Humans.",
  "classifiers": [
    "Development Status :: 5 - Production/Stable",
    "Intended Audience :: Developers",
    "License :: OSI Approved :: Apache Software License"
  ]
}

Common gotchas

1. Header parsing is combined

The package name and version often live in the same header string, so don’t assume separate selectors exist.

2. Classifiers may need filtering

Depending on the page structure, the easiest path is sometimes “collect candidate list items, then filter by ::”.

3. Don’t overfit to one package

Always test on a few packages with different amounts of metadata before you productionize the scraper.

Export to JSON

import json

with open("pypi_project.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

Selector summary

Field	Selector / rule
Package header	`h1.package-header__name`
Summary	`p.package-header__summary`
Classifiers	`.sidebar-section li` filtered by `::`

When to use a Proxy API

For one-off scraping, direct requests are fine.

For larger monitoring jobs — scraping hundreds or thousands of package pages on a schedule — the request layer becomes the fragile part.

def fetch_with_proxy(url: str) -> str:
    proxy_url = f"http://api.proxiesapi.com/?key=YOUR_API_KEY&url={url}"
    return requests.get(proxy_url).text

If you're building a scraping project that needs to scale beyond a few hundred pages, check out Proxies API — we handle proxy rotation, browser fingerprinting, CAPTCHAs, and automatic retries so you can focus on the data extraction logic. Start with 1,000 free API calls.

Turn PyPI scraping into a stable monitoring workflow

If you’re collecting package metadata across many projects on a schedule, the fetch layer becomes the fragile part. ProxiesAPI helps keep those HTTP requests predictable.

Get 1,000 free API calls View pricing

Related guides

Scrape GitHub Repository Data

Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.

tutorial#python#github#web-scraping

Scrape Secondhand Fashion Listings from Vinted

Show how to extract Vinted search listings, prices, brands, and image URLs into a resale-market dataset with Python, screenshots, and a ProxiesAPI-ready fetch layer.

tutorial#python#vinted#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Secondhand Fashion Listings from Vinted

Capture Vinted search listings with title, price, brand, size, image, and listing URL into a reusable resale dataset.

tutorial#python#vinted#ecommerce