Python BeautifulSoup Tutorial: Scraping Your First Website (2026)

BeautifulSoup is the fastest way to go from:

“I need data from a website”

to:

“I have clean rows in a CSV.”

This tutorial is designed for beginners, but it’s written the way you’d build a scraper you can grow:

  • real timeouts (no hanging forever)
  • a Session (connection reuse)
  • predictable selectors
  • pagination loops
  • export to CSV

We’ll scrape a simple target: the Hacker News front page, because it’s server-rendered HTML and has clean pagination.

When your first scraper grows up, ProxiesAPI helps

Your first BeautifulSoup scraper usually works… until it doesn’t. As you crawl more pages, stability becomes the problem (timeouts, blocks, flaky HTML). ProxiesAPI belongs in your fetch layer so your parsing code stays simple.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML with requests (the right way)

import requests

BASE = "https://news.ycombinator.com"
TIMEOUT = (10, 30)

session = requests.Session()
session.headers.update(
    {
        "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
        "Accept-Language": "en-US,en;q=0.9",
    }
)


def fetch(path: str) -> str:
    url = path if path.startswith("http") else f"{BASE}{path}"
    r = session.get(url, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text

Why this matters:

  • timeouts prevent a single stuck request from freezing your script
  • Session() reuses TCP connections (faster + friendlier)
  • a real User-Agent reduces “bot-ish” responses on many sites

Step 2: Parse one page with BeautifulSoup selectors

HN story rows are:

  • tr.athing (title row)
  • followed by the next tr containing td.subtext (metadata)
import re
from bs4 import BeautifulSoup


def parse_int(text: str) -> int | None:
    m = re.search(r"(\\d+)", text or "")
    return int(m.group(1)) if m else None


def parse_front_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    stories = []

    for row in soup.select("tr.athing"):
        story_id = row.get("id")

        title_a = row.select_one("span.titleline > a")
        title = title_a.get_text(strip=True) if title_a else None
        href = title_a.get("href") if title_a else None

        subtext_row = row.find_next_sibling("tr")
        subtext = subtext_row.select_one("td.subtext") if subtext_row else None

        points = None
        author = None
        age = None
        comments = None

        if subtext:
            score = subtext.select_one("span.score")
            points = parse_int(score.get_text(" ", strip=True) if score else "")

            user = subtext.select_one("a.hnuser")
            author = user.get_text(strip=True) if user else None

            age_a = subtext.select_one("span.age a")
            age = age_a.get_text(strip=True) if age_a else None

            links = subtext.select("a")
            if links:
                comments = parse_int(links[-1].get_text(" ", strip=True))

        stories.append(
            {
                "id": story_id,
                "title": title,
                "url": href,
                "points": points,
                "author": author,
                "age": age,
                "comments": comments,
            }
        )

    return stories

Sanity check:

stories = parse_front_page(fetch("/"))
print("stories:", len(stories))
print(stories[0])

Step 3: Pagination (crawl N pages)

HN pagination is explicit:

  • page 1: /
  • page N: /?p=N
def crawl_front_pages(pages: int = 3) -> list[dict]:
    all_stories = []
    seen = set()

    for p in range(1, pages + 1):
        path = "/" if p == 1 else f"/?p={p}"
        html = fetch(path)
        batch = parse_front_page(html)

        for s in batch:
            if s["id"] in seen:
                continue
            seen.add(s["id"])
            s["page"] = p
            all_stories.append(s)

    return all_stories


all_stories = crawl_front_pages(pages=3)
print("total:", len(all_stories))

Step 4: Export to CSV

import csv


def write_csv(path: str, rows: list[dict]) -> None:
    if not rows:
        raise ValueError("no rows to write")
    fieldnames = list(rows[0].keys())

    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)


write_csv("hn_stories.csv", all_stories)

Common BeautifulSoup mistakes (and how to avoid them)

1) Parsing without understanding the HTML

Don’t guess selectors. Use DevTools first, then implement the selectors in code.

2) Regex-parsing HTML

HTML is not a regular language. Use BeautifulSoup (or lxml/XPath) for structure.

3) Ignoring encoding issues

If you see broken characters, ensure you read/write UTF-8 and use lxml parser.

4) No timeouts

This is the #1 “my scraper hangs sometimes” issue.


Where ProxiesAPI fits (and why it’s not magical)

When you scrape a “friendly” target, direct requests can be fine.

When you scale, the hard problems show up:

  • timeouts
  • connection resets
  • intermittent blocks
  • inconsistent HTML due to bot checks

ProxiesAPI is a wrapper URL:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://news.ycombinator.com/" | head

In Python, you wrap the URL before fetching:

from urllib.parse import urlencode


def proxiesapi_wrap(target_url: str, api_key: str) -> str:
    base = "http://api.proxiesapi.com/"
    return base + "?" + urlencode({"key": api_key, "url": target_url})


API_KEY = "API_KEY"
wrapped = proxiesapi_wrap("https://news.ycombinator.com/", API_KEY)
html = fetch(wrapped)
stories = parse_front_page(html)

The honest benefit: your parsing code doesn’t change. You’re simply making the network layer more resilient.

When your first scraper grows up, ProxiesAPI helps

Your first BeautifulSoup scraper usually works… until it doesn’t. As you crawl more pages, stability becomes the problem (timeouts, blocks, flaky HTML). ProxiesAPI belongs in your fetch layer so your parsing code stays simple.

Related guides

Scrape eBay Listings + Sold Prices with Python (Active + Completed Listings)
Build a small eBay dataset (title, price, condition, shipping) from search results, then pull completed/sold prices from the Sold filter. Includes pagination, CSV export, and ProxiesAPI in the fetch layer.
tutorial#python#ebay#web-scraping
Scrape Goodreads Book Reviews + Ratings with Python (Pagination + CSV)
Extract Goodreads community reviews (rating, review text, reviewer, date) from a book page, paginate using Goodreads’ "More reviews" cursor link, and export results to CSV. Includes screenshot and ProxiesAPI fetch-layer integration.
tutorial#python#goodreads#web-scraping
Scrape Trustpilot Category Rankings (Top Companies + Ratings) with ProxiesAPI
Extract top companies in a Trustpilot category (name, website, rating, review count) across pages using stable DOM anchors, then export to CSV. Includes selector rationale and a proof screenshot.
tutorial#python#trustpilot#reviews
Scrape Yahoo Finance Top Gainers/Losers Screener with ProxiesAPI (CSV Export)
Scrape Yahoo Finance movers tables (gainers + losers), extract tickers, prices, % change, and volume using stable data-testid anchors, then export to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks