Web Scraping with Python: The Complete 2026 Tutorial

Mar 17, 2026 · guide · #web scraping python, #python, #web-scraping, #requests, #beautifulsoup, #proxies, #tutorial

If you searched for web scraping python, you probably want one thing: a scraper that works today and doesn’t collapse the moment you scale it.

This guide is a complete 2026-ready walkthrough covering:

the core stack: requests + BeautifulSoup
selector strategy (how to avoid “guessy” scrapers)
pagination
retries + exponential backoff
parsing + validation
exporting CSV/JSON
a reusable “production template” you can adapt to any HTML site
where ProxiesAPI fits (network reliability), without overclaiming

Make your scrapers more reliable with ProxiesAPI

Once your scraper grows beyond a handful of URLs, failures often come from the network layer. ProxiesAPI gives you a simple proxy-backed fetch URL so your Python scraper fails less and retries recover more often.

Get 1,000 free API calls View pricing

1) Choose the right approach: HTML vs API

Before you scrape, check if the site already provides:

a public API
an RSS feed
downloadable exports

Scraping HTML is fine when:

the data is publicly visible in the browser
the HTML structure is stable enough
you can crawl politely (rate limits, limited pages)

If you do scrape HTML, treat it like integration work: it will break sometimes.

2) Setup (the boring part that prevents 80% of bugs)

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Why lxml? It’s fast and generally more forgiving.

3) A fetch layer that won’t betray you

Most “beginner” scrapers die because:

no timeouts (script hangs forever)
no retries (transient failures kill the run)
no headers (you get alternate HTML)

Use a session + sane defaults.

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)  # connect, read

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

session = requests.Session()


def fetch(url: str, proxiesapi_key: str | None = None, retries: int = 4) -> str:
    last = None
    for attempt in range(1, retries + 1):
        try:
            if proxiesapi_key:
                proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
                r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
            else:
                r = session.get(url, headers=HEADERS, timeout=TIMEOUT)

            r.raise_for_status()
            return r.text
        except Exception as e:
            last = e
            sleep_s = (2 ** attempt) + random.random()
            print(f"attempt {attempt}/{retries} failed: {e}. sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"failed after {retries} retries: {last}")

The ProxiesAPI request shape

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

This keeps your parsing code identical; only the fetch URL changes.

4) Parsing: stop guessing selectors

Your goal is to extract values using selectors that map to real HTML.

A practical workflow:

Open the page in a browser
Inspect the element you need
Copy a stable selector pattern (ids, data-* attributes)
Add a fallback selector (A/B tests happen)

Here’s a helper that makes fallbacks easy:

from bs4 import BeautifulSoup


def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
    for sel in selectors:
        el = soup.select_one(sel)
        if not el:
            continue
        txt = el.get_text(" ", strip=True)
        if txt:
            return txt
    return None

5) Example target: a simple blog index with pagination

Assume a site like:

index page: https://example.com/blog
page 2: https://example.com/blog?page=2
each post card has: title link, author, date

Your parser should be:

specific enough to avoid false positives
flexible enough to survive small changes

from urllib.parse import urljoin


def parse_index(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    for card in soup.select("article"):
        a = card.select_one("h2 a") or card.select_one("a")
        if not a:
            continue

        title = a.get_text(" ", strip=True)
        href = a.get("href")
        url = urljoin(base_url, href) if href else None

        if not title or not url:
            continue

        author = first_text(card, [".author", "[rel='author']"])  # example fallbacks
        date = first_text(card, ["time", "[data-testid='date']"])  # example fallbacks

        out.append({"title": title, "url": url, "author": author, "date": date})

    return out

6) Pagination: crawl N pages safely

Key rules:

crawl a fixed max pages (don’t “while True” without a stop)
dedupe by a stable key (URL, id)
sleep a bit between pages

import time


def crawl(base_url: str, pages: int = 5, proxiesapi_key: str | None = None) -> list[dict]:
    all_items = []
    seen = set()

    for p in range(1, pages + 1):
        url = base_url if p == 1 else f"{base_url}?page={p}"
        html = fetch(url, proxiesapi_key=proxiesapi_key)
        items = parse_index(html, base_url=base_url)

        for it in items:
            key = it.get("url")
            if not key or key in seen:
                continue
            seen.add(key)
            all_items.append(it)

        print(f"page {p}/{pages}: {len(items)} items (total {len(all_items)})")
        time.sleep(1.0)

    return all_items

7) Validate, then export

Validation is underrated. At minimum, check:

required fields are present
numeric fields parse
URLs look like URLs

Export JSON + CSV:

import csv
import json


def export(items: list[dict], name: str = "scrape"):
    with open(f"{name}.json", "w", encoding="utf-8") as f:
        json.dump(items, f, ensure_ascii=False, indent=2)

    if items:
        with open(f"{name}.csv", "w", encoding="utf-8", newline="") as f:
            w = csv.DictWriter(f, fieldnames=list(items[0].keys()))
            w.writeheader()
            for it in items:
                w.writerow(it)

    print("wrote", f"{name}.json", "rows", len(items))

8) A reusable “production template”

This is the pattern you can reuse:

fetch() (timeouts, headers, retries, optional ProxiesAPI)
parse_*() functions per page type
crawl() that orchestrates and dedupes
export()

Put it together:

def main():
    start_url = "https://example.com/blog"
    proxiesapi_key = None  # "YOUR_KEY"

    items = crawl(start_url, pages=3, proxiesapi_key=proxiesapi_key)

    # basic validation
    items = [it for it in items if it.get("title") and it.get("url")]

    export(items, name="blog_posts")


if __name__ == "__main__":
    main()

9) Common failure modes (and fixes)

Empty fields: selector mismatch → inspect HTML and update selectors.
Different HTML per request: missing headers/cookies → set headers, keep a session.
Random 403/429: throttling → add backoff, reduce rate, consider proxy-backed fetch.
Broken pagination: you assumed ?page= but it’s ?p= or start= → confirm by clicking “Next”.

10) Where ProxiesAPI fits

When you’re scraping at small scale (a few pages), you might not need any proxying.

When you scale up:

more URLs
more repeat runs
more failures from IP-based throttling

…ProxiesAPI gives you a simple proxy-backed fetch URL while keeping your scraper code the same:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

Combine that with timeouts + retries + polite pagination and your success rate typically improves.

Make your scrapers more reliable with ProxiesAPI

Get 1,000 free API calls View pricing

A practical SERP scraping workflow in Python: handle consent/interstitials, parse organic results defensively, rotate IPs, backoff on blocks, and export clean results. Includes a ProxiesAPI-backed fetch layer.

guide#how to scrape google search results with python#python#serp

Scrape GitHub Repository Data

Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.

tutorial#python#github#web-scraping

Scrape Secondhand Fashion Listings from Vinted

Show how to extract Vinted search listings, prices, brands, and image URLs into a resale-market dataset with Python, screenshots, and a ProxiesAPI-ready fetch layer.

tutorial#python#vinted#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Web Scraping with Python: The Complete 2026 Tutorial

Related guides