#web-scraping

14 guides

How to Scrape the Python Docs Module Index with Python
Build a searchable dataset from the Python docs module index using Python and BeautifulSoup.
How to Scrape MDN Docs Pages with Python
Extract headings and table-of-contents structure from MDN docs pages with Python and BeautifulSoup.
How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)
A real-world IMDb scraping tutorial covering browser-rendered HTML, verified selectors, sample output, and why naive requests can fail.
ScrapingBee Pricing: Best Alternatives and When to Use Each
A practical guide to ScrapingBee pricing, alternatives, and when a simpler proxy API may be a better fit for your scraping workload.
How to Scrape PyPI Project Pages with Python
Fetch PyPI project pages and extract package metadata like version, description, and classifiers with Python and BeautifulSoup.
How to Scrape npm Package Pages with Python
Scrape npm package pages to extract version, description, and package metadata with Python and BeautifulSoup.
Soft-Block Detection for Web Scraping (Python): Catch ‘HTTP 200 but Wrong Page’
Most scrapers fail silently: the request succeeds but the HTML is a block/consent/login page. Here’s how to detect soft-blocks before parsing.
How to Scrape GitHub Trending with Python (and Export to CSV/JSON)
A practical GitHub Trending scraper: fetch the Trending page, extract repo names + language + stars, and export a clean dataset.
How to Scrape GitHub Releases with Python (Versions + Notes + Diffs)
Scrape a GitHub Releases page, extract versions and release notes, and store structured data so you can alert on changes.
Free Proxy Lists vs a Proxy API: Why Free Breaks in Production
Free proxies look attractive — until your scraper scales. Here’s what fails first, what a proxy API actually fixes, and how to choose the right setup.
Scrape a WordPress Site via sitemap_index.xml (Python): Crawl, Extract, Dedupe, Export
A production-grade, sitemap-first WordPress scraper in Python (no guessed selectors): crawl sitemaps, fetch posts, extract clean text + metadata, and export to CSV/JSON.
Scrape Stack Overflow Questions by Tag with Python (No API): Titles, Votes, Answers
A practical Stack Overflow scraper that collects questions from a tag page (e.g. web-scraping), follows pagination, extracts key fields, and exports to CSV/JSON.
Retries, Timeouts, and Backoff for Web Scraping (Python): Production Defaults That Work
Most scrapers fail because of networking, not parsing. Here are sane timeout defaults, a retry policy that won’t DDoS a site, and a drop-in requests/httpx implementation.
How to Scrape Hacker News (HN) with Python: Stories + Pagination + Comments
A production-grade Hacker News scraper: parse the real HTML, crawl multiple pages, extract stories and comment threads, and export clean JSON. Includes terminal-style runs and selector rationale.