Web Scraping with Python: The Complete 2026 Tutorial

If you searched for web scraping python, you probably want one thing: a scraper that works today and doesn’t collapse the moment you scale it.

This guide is a complete 2026-ready walkthrough covering:

  • the core stack: requests + BeautifulSoup
  • selector strategy (how to avoid “guessy” scrapers)
  • pagination
  • retries + exponential backoff
  • parsing + validation
  • exporting CSV/JSON
  • a reusable “production template” you can adapt to any HTML site
  • where ProxiesAPI fits (network reliability), without overclaiming
Make your scrapers more reliable with ProxiesAPI

Once your scraper grows beyond a handful of URLs, failures often come from the network layer. ProxiesAPI gives you a simple proxy-backed fetch URL so your Python scraper fails less and retries recover more often.


1) Choose the right approach: HTML vs API

Before you scrape, check if the site already provides:

  • a public API
  • an RSS feed
  • downloadable exports

Scraping HTML is fine when:

  • the data is publicly visible in the browser
  • the HTML structure is stable enough
  • you can crawl politely (rate limits, limited pages)

If you do scrape HTML, treat it like integration work: it will break sometimes.


2) Setup (the boring part that prevents 80% of bugs)

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Why lxml? It’s fast and generally more forgiving.


3) A fetch layer that won’t betray you

Most “beginner” scrapers die because:

  • no timeouts (script hangs forever)
  • no retries (transient failures kill the run)
  • no headers (you get alternate HTML)

Use a session + sane defaults.

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)  # connect, read

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

session = requests.Session()


def fetch(url: str, proxiesapi_key: str | None = None, retries: int = 4) -> str:
    last = None
    for attempt in range(1, retries + 1):
        try:
            if proxiesapi_key:
                proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
                r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
            else:
                r = session.get(url, headers=HEADERS, timeout=TIMEOUT)

            r.raise_for_status()
            return r.text
        except Exception as e:
            last = e
            sleep_s = (2 ** attempt) + random.random()
            print(f"attempt {attempt}/{retries} failed: {e}. sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"failed after {retries} retries: {last}")

The ProxiesAPI request shape

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

This keeps your parsing code identical; only the fetch URL changes.


4) Parsing: stop guessing selectors

Your goal is to extract values using selectors that map to real HTML.

A practical workflow:

  1. Open the page in a browser
  2. Inspect the element you need
  3. Copy a stable selector pattern (ids, data-* attributes)
  4. Add a fallback selector (A/B tests happen)

Here’s a helper that makes fallbacks easy:

from bs4 import BeautifulSoup


def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
    for sel in selectors:
        el = soup.select_one(sel)
        if not el:
            continue
        txt = el.get_text(" ", strip=True)
        if txt:
            return txt
    return None

5) Example target: a simple blog index with pagination

Assume a site like:

  • index page: https://example.com/blog
  • page 2: https://example.com/blog?page=2
  • each post card has: title link, author, date

Your parser should be:

  • specific enough to avoid false positives
  • flexible enough to survive small changes
from urllib.parse import urljoin


def parse_index(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    for card in soup.select("article"):
        a = card.select_one("h2 a") or card.select_one("a")
        if not a:
            continue

        title = a.get_text(" ", strip=True)
        href = a.get("href")
        url = urljoin(base_url, href) if href else None

        if not title or not url:
            continue

        author = first_text(card, [".author", "[rel='author']"])  # example fallbacks
        date = first_text(card, ["time", "[data-testid='date']"])  # example fallbacks

        out.append({"title": title, "url": url, "author": author, "date": date})

    return out

6) Pagination: crawl N pages safely

Key rules:

  • crawl a fixed max pages (don’t “while True” without a stop)
  • dedupe by a stable key (URL, id)
  • sleep a bit between pages
import time


def crawl(base_url: str, pages: int = 5, proxiesapi_key: str | None = None) -> list[dict]:
    all_items = []
    seen = set()

    for p in range(1, pages + 1):
        url = base_url if p == 1 else f"{base_url}?page={p}"
        html = fetch(url, proxiesapi_key=proxiesapi_key)
        items = parse_index(html, base_url=base_url)

        for it in items:
            key = it.get("url")
            if not key or key in seen:
                continue
            seen.add(key)
            all_items.append(it)

        print(f"page {p}/{pages}: {len(items)} items (total {len(all_items)})")
        time.sleep(1.0)

    return all_items

7) Validate, then export

Validation is underrated. At minimum, check:

  • required fields are present
  • numeric fields parse
  • URLs look like URLs

Export JSON + CSV:

import csv
import json


def export(items: list[dict], name: str = "scrape"):
    with open(f"{name}.json", "w", encoding="utf-8") as f:
        json.dump(items, f, ensure_ascii=False, indent=2)

    if items:
        with open(f"{name}.csv", "w", encoding="utf-8", newline="") as f:
            w = csv.DictWriter(f, fieldnames=list(items[0].keys()))
            w.writeheader()
            for it in items:
                w.writerow(it)

    print("wrote", f"{name}.json", "rows", len(items))

8) A reusable “production template”

This is the pattern you can reuse:

  • fetch() (timeouts, headers, retries, optional ProxiesAPI)
  • parse_*() functions per page type
  • crawl() that orchestrates and dedupes
  • export()

Put it together:

def main():
    start_url = "https://example.com/blog"
    proxiesapi_key = None  # "YOUR_KEY"

    items = crawl(start_url, pages=3, proxiesapi_key=proxiesapi_key)

    # basic validation
    items = [it for it in items if it.get("title") and it.get("url")]

    export(items, name="blog_posts")


if __name__ == "__main__":
    main()

9) Common failure modes (and fixes)

  • Empty fields: selector mismatch → inspect HTML and update selectors.
  • Different HTML per request: missing headers/cookies → set headers, keep a session.
  • Random 403/429: throttling → add backoff, reduce rate, consider proxy-backed fetch.
  • Broken pagination: you assumed ?page= but it’s ?p= or start= → confirm by clicking “Next”.

10) Where ProxiesAPI fits

When you’re scraping at small scale (a few pages), you might not need any proxying.

When you scale up:

  • more URLs
  • more repeat runs
  • more failures from IP-based throttling

…ProxiesAPI gives you a simple proxy-backed fetch URL while keeping your scraper code the same:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

Combine that with timeouts + retries + polite pagination and your success rate typically improves.

Make your scrapers more reliable with ProxiesAPI

Once your scraper grows beyond a handful of URLs, failures often come from the network layer. ProxiesAPI gives you a simple proxy-backed fetch URL so your Python scraper fails less and retries recover more often.

Related guides

Scrape Product Data from Amazon (with Python + ProxiesAPI)
Extract Amazon product title, price, rating, and availability from a product page using requests + BeautifulSoup, with retries and proxy-backed fetching via ProxiesAPI.
tutorial#python#amazon#web-scraping
Build a Job Board with Data from Indeed (Python scraper tutorial)
Scrape Indeed job listings (title, company, location, salary, summary) with Python (requests + BeautifulSoup), then save a clean dataset you can render as a simple job board. Includes pagination + ProxiesAPI fetch.
tutorial#python#indeed#jobs
Scrape OpenStreetMap Wiki pages with Python
Collect category pages and linked wiki entries into a structured index for research or monitoring.
tutorial#python#openstreetmap#osm
How to Scrape MDN Docs Pages with Python
Extract headings and table-of-contents structure from MDN docs pages with Python and BeautifulSoup.
tutorial#python#mdn#web-scraping