Python BeautifulSoup Tutorial: Scraping Your First Website (2026)

BeautifulSoup is the fastest way to go from:

“I need data from a website”

to:

“I have clean rows in a CSV.”

This tutorial is designed for beginners, but it’s written the way you’d build a scraper you can grow:

  • real timeouts (no hanging forever)
  • a Session (connection reuse)
  • predictable selectors
  • pagination loops
  • export to CSV

We’ll scrape a simple target: the Hacker News front page, because it’s server-rendered HTML and has clean pagination.

When your first scraper grows up, ProxiesAPI helps

Your first BeautifulSoup scraper usually works… until it doesn’t. As you crawl more pages, stability becomes the problem (timeouts, blocks, flaky HTML). ProxiesAPI belongs in your fetch layer so your parsing code stays simple.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML with requests (the right way)

import requests

BASE = "https://news.ycombinator.com"
TIMEOUT = (10, 30)

session = requests.Session()
session.headers.update(
    {
        "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
        "Accept-Language": "en-US,en;q=0.9",
    }
)


def fetch(path: str) -> str:
    url = path if path.startswith("http") else f"{BASE}{path}"
    r = session.get(url, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text

Why this matters:

  • timeouts prevent a single stuck request from freezing your script
  • Session() reuses TCP connections (faster + friendlier)
  • a real User-Agent reduces “bot-ish” responses on many sites

Step 2: Parse one page with BeautifulSoup selectors

HN story rows are:

  • tr.athing (title row)
  • followed by the next tr containing td.subtext (metadata)
import re
from bs4 import BeautifulSoup


def parse_int(text: str) -> int | None:
    m = re.search(r"(\\d+)", text or "")
    return int(m.group(1)) if m else None


def parse_front_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    stories = []

    for row in soup.select("tr.athing"):
        story_id = row.get("id")

        title_a = row.select_one("span.titleline > a")
        title = title_a.get_text(strip=True) if title_a else None
        href = title_a.get("href") if title_a else None

        subtext_row = row.find_next_sibling("tr")
        subtext = subtext_row.select_one("td.subtext") if subtext_row else None

        points = None
        author = None
        age = None
        comments = None

        if subtext:
            score = subtext.select_one("span.score")
            points = parse_int(score.get_text(" ", strip=True) if score else "")

            user = subtext.select_one("a.hnuser")
            author = user.get_text(strip=True) if user else None

            age_a = subtext.select_one("span.age a")
            age = age_a.get_text(strip=True) if age_a else None

            links = subtext.select("a")
            if links:
                comments = parse_int(links[-1].get_text(" ", strip=True))

        stories.append(
            {
                "id": story_id,
                "title": title,
                "url": href,
                "points": points,
                "author": author,
                "age": age,
                "comments": comments,
            }
        )

    return stories

Sanity check:

stories = parse_front_page(fetch("/"))
print("stories:", len(stories))
print(stories[0])

Step 3: Pagination (crawl N pages)

HN pagination is explicit:

  • page 1: /
  • page N: /?p=N
def crawl_front_pages(pages: int = 3) -> list[dict]:
    all_stories = []
    seen = set()

    for p in range(1, pages + 1):
        path = "/" if p == 1 else f"/?p={p}"
        html = fetch(path)
        batch = parse_front_page(html)

        for s in batch:
            if s["id"] in seen:
                continue
            seen.add(s["id"])
            s["page"] = p
            all_stories.append(s)

    return all_stories


all_stories = crawl_front_pages(pages=3)
print("total:", len(all_stories))

Step 4: Export to CSV

import csv


def write_csv(path: str, rows: list[dict]) -> None:
    if not rows:
        raise ValueError("no rows to write")
    fieldnames = list(rows[0].keys())

    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)


write_csv("hn_stories.csv", all_stories)

Common BeautifulSoup mistakes (and how to avoid them)

1) Parsing without understanding the HTML

Don’t guess selectors. Use DevTools first, then implement the selectors in code.

2) Regex-parsing HTML

HTML is not a regular language. Use BeautifulSoup (or lxml/XPath) for structure.

3) Ignoring encoding issues

If you see broken characters, ensure you read/write UTF-8 and use lxml parser.

4) No timeouts

This is the #1 “my scraper hangs sometimes” issue.


Where ProxiesAPI fits (and why it’s not magical)

When you scrape a “friendly” target, direct requests can be fine.

When you scale, the hard problems show up:

  • timeouts
  • connection resets
  • intermittent blocks
  • inconsistent HTML due to bot checks

ProxiesAPI is a wrapper URL:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://news.ycombinator.com/" | head

In Python, you wrap the URL before fetching:

from urllib.parse import urlencode


def proxiesapi_wrap(target_url: str, api_key: str) -> str:
    base = "http://api.proxiesapi.com/"
    return base + "?" + urlencode({"key": api_key, "url": target_url})


API_KEY = "API_KEY"
wrapped = proxiesapi_wrap("https://news.ycombinator.com/", API_KEY)
html = fetch(wrapped)
stories = parse_front_page(html)

The honest benefit: your parsing code doesn’t change. You’re simply making the network layer more resilient.

When your first scraper grows up, ProxiesAPI helps

Your first BeautifulSoup scraper usually works… until it doesn’t. As you crawl more pages, stability becomes the problem (timeouts, blocks, flaky HTML). ProxiesAPI belongs in your fetch layer so your parsing code stays simple.

Related guides

Scrape Book Reviews and Ratings from Goodreads
Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.
tutorial#python#goodreads#web-scraping
Scrape eBay Listings + Sold Prices with Python (Active + Completed Listings)
Build a small eBay dataset (title, price, condition, shipping) from search results, then pull completed/sold prices from the Sold filter. Includes pagination, CSV export, and ProxiesAPI in the fetch layer.
tutorial#python#ebay#web-scraping
Scrape Secondhand Fashion Listings from Vinted
Capture Vinted search listings with title, price, brand, size, image, and listing URL into a reusable resale dataset.
tutorial#python#vinted#ecommerce
Scrape Financial Data from Yahoo Finance (Green List site)
Fetch a quote page via ProxiesAPI, parse price + key stats, and export to CSV (with a screenshot).
tutorial#python#yahoo-finance#stocks