Web Scraping with Python: The Complete 2026 Tutorial
If you searched for web scraping python, you probably want one thing: a scraper that works today and doesn’t collapse the moment you scale it.
This guide is a complete 2026-ready walkthrough covering:
- the core stack:
requests+BeautifulSoup - selector strategy (how to avoid “guessy” scrapers)
- pagination
- retries + exponential backoff
- parsing + validation
- exporting CSV/JSON
- a reusable “production template” you can adapt to any HTML site
- where ProxiesAPI fits (network reliability), without overclaiming
Once your scraper grows beyond a handful of URLs, failures often come from the network layer. ProxiesAPI gives you a simple proxy-backed fetch URL so your Python scraper fails less and retries recover more often.
1) Choose the right approach: HTML vs API
Before you scrape, check if the site already provides:
- a public API
- an RSS feed
- downloadable exports
Scraping HTML is fine when:
- the data is publicly visible in the browser
- the HTML structure is stable enough
- you can crawl politely (rate limits, limited pages)
If you do scrape HTML, treat it like integration work: it will break sometimes.
2) Setup (the boring part that prevents 80% of bugs)
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Why lxml? It’s fast and generally more forgiving.
3) A fetch layer that won’t betray you
Most “beginner” scrapers die because:
- no timeouts (script hangs forever)
- no retries (transient failures kill the run)
- no headers (you get alternate HTML)
Use a session + sane defaults.
import random
import time
from urllib.parse import quote
import requests
TIMEOUT = (10, 30) # connect, read
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
session = requests.Session()
def fetch(url: str, proxiesapi_key: str | None = None, retries: int = 4) -> str:
last = None
for attempt in range(1, retries + 1):
try:
if proxiesapi_key:
proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
else:
r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
r.raise_for_status()
return r.text
except Exception as e:
last = e
sleep_s = (2 ** attempt) + random.random()
print(f"attempt {attempt}/{retries} failed: {e}. sleeping {sleep_s:.1f}s")
time.sleep(sleep_s)
raise RuntimeError(f"failed after {retries} retries: {last}")
The ProxiesAPI request shape
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
This keeps your parsing code identical; only the fetch URL changes.
4) Parsing: stop guessing selectors
Your goal is to extract values using selectors that map to real HTML.
A practical workflow:
- Open the page in a browser
- Inspect the element you need
- Copy a stable selector pattern (ids,
data-*attributes) - Add a fallback selector (A/B tests happen)
Here’s a helper that makes fallbacks easy:
from bs4 import BeautifulSoup
def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
for sel in selectors:
el = soup.select_one(sel)
if not el:
continue
txt = el.get_text(" ", strip=True)
if txt:
return txt
return None
5) Example target: a simple blog index with pagination
Assume a site like:
- index page:
https://example.com/blog - page 2:
https://example.com/blog?page=2 - each post card has: title link, author, date
Your parser should be:
- specific enough to avoid false positives
- flexible enough to survive small changes
from urllib.parse import urljoin
def parse_index(html: str, base_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
for card in soup.select("article"):
a = card.select_one("h2 a") or card.select_one("a")
if not a:
continue
title = a.get_text(" ", strip=True)
href = a.get("href")
url = urljoin(base_url, href) if href else None
if not title or not url:
continue
author = first_text(card, [".author", "[rel='author']"]) # example fallbacks
date = first_text(card, ["time", "[data-testid='date']"]) # example fallbacks
out.append({"title": title, "url": url, "author": author, "date": date})
return out
6) Pagination: crawl N pages safely
Key rules:
- crawl a fixed max pages (don’t “while True” without a stop)
- dedupe by a stable key (URL, id)
- sleep a bit between pages
import time
def crawl(base_url: str, pages: int = 5, proxiesapi_key: str | None = None) -> list[dict]:
all_items = []
seen = set()
for p in range(1, pages + 1):
url = base_url if p == 1 else f"{base_url}?page={p}"
html = fetch(url, proxiesapi_key=proxiesapi_key)
items = parse_index(html, base_url=base_url)
for it in items:
key = it.get("url")
if not key or key in seen:
continue
seen.add(key)
all_items.append(it)
print(f"page {p}/{pages}: {len(items)} items (total {len(all_items)})")
time.sleep(1.0)
return all_items
7) Validate, then export
Validation is underrated. At minimum, check:
- required fields are present
- numeric fields parse
- URLs look like URLs
Export JSON + CSV:
import csv
import json
def export(items: list[dict], name: str = "scrape"):
with open(f"{name}.json", "w", encoding="utf-8") as f:
json.dump(items, f, ensure_ascii=False, indent=2)
if items:
with open(f"{name}.csv", "w", encoding="utf-8", newline="") as f:
w = csv.DictWriter(f, fieldnames=list(items[0].keys()))
w.writeheader()
for it in items:
w.writerow(it)
print("wrote", f"{name}.json", "rows", len(items))
8) A reusable “production template”
This is the pattern you can reuse:
fetch()(timeouts, headers, retries, optional ProxiesAPI)parse_*()functions per page typecrawl()that orchestrates and dedupesexport()
Put it together:
def main():
start_url = "https://example.com/blog"
proxiesapi_key = None # "YOUR_KEY"
items = crawl(start_url, pages=3, proxiesapi_key=proxiesapi_key)
# basic validation
items = [it for it in items if it.get("title") and it.get("url")]
export(items, name="blog_posts")
if __name__ == "__main__":
main()
9) Common failure modes (and fixes)
- Empty fields: selector mismatch → inspect HTML and update selectors.
- Different HTML per request: missing headers/cookies → set headers, keep a session.
- Random 403/429: throttling → add backoff, reduce rate, consider proxy-backed fetch.
- Broken pagination: you assumed
?page=but it’s?p=orstart=→ confirm by clicking “Next”.
10) Where ProxiesAPI fits
When you’re scraping at small scale (a few pages), you might not need any proxying.
When you scale up:
- more URLs
- more repeat runs
- more failures from IP-based throttling
…ProxiesAPI gives you a simple proxy-backed fetch URL while keeping your scraper code the same:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
Combine that with timeouts + retries + polite pagination and your success rate typically improves.
Once your scraper grows beyond a handful of URLs, failures often come from the network layer. ProxiesAPI gives you a simple proxy-backed fetch URL so your Python scraper fails less and retries recover more often.