Python BeautifulSoup Tutorial: Scraping Your First Website (2026)

May 20, 2026 · tutorial · #python beautifulsoup tutorial, #python, #beautifulsoup, #requests, #web-scraping, #pagination, #csv, #proxies

BeautifulSoup is the fastest way to go from:

“I need data from a website”

to:

“I have clean rows in a CSV.”

This tutorial is designed for beginners, but it’s written the way you’d build a scraper you can grow:

real timeouts (no hanging forever)
a Session (connection reuse)
predictable selectors
pagination loops
export to CSV

We’ll scrape a simple target: the Hacker News front page, because it’s server-rendered HTML and has clean pagination.

When your first scraper grows up, ProxiesAPI helps

Your first BeautifulSoup scraper usually works… until it doesn’t. As you crawl more pages, stability becomes the problem (timeouts, blocks, flaky HTML). ProxiesAPI belongs in your fetch layer so your parsing code stays simple.

Get 1,000 free API calls View pricing

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML with requests (the right way)

import requests

BASE = "https://news.ycombinator.com"
TIMEOUT = (10, 30)

session = requests.Session()
session.headers.update(
    {
        "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
        "Accept-Language": "en-US,en;q=0.9",
    }
)


def fetch(path: str) -> str:
    url = path if path.startswith("http") else f"{BASE}{path}"
    r = session.get(url, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text

Why this matters:

timeouts prevent a single stuck request from freezing your script
Session() reuses TCP connections (faster + friendlier)
a real User-Agent reduces “bot-ish” responses on many sites

Step 2: Parse one page with BeautifulSoup selectors

HN story rows are:

tr.athing (title row)
followed by the next tr containing td.subtext (metadata)

import re
from bs4 import BeautifulSoup


def parse_int(text: str) -> int | None:
    m = re.search(r"(\\d+)", text or "")
    return int(m.group(1)) if m else None


def parse_front_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    stories = []

    for row in soup.select("tr.athing"):
        story_id = row.get("id")

        title_a = row.select_one("span.titleline > a")
        title = title_a.get_text(strip=True) if title_a else None
        href = title_a.get("href") if title_a else None

        subtext_row = row.find_next_sibling("tr")
        subtext = subtext_row.select_one("td.subtext") if subtext_row else None

        points = None
        author = None
        age = None
        comments = None

        if subtext:
            score = subtext.select_one("span.score")
            points = parse_int(score.get_text(" ", strip=True) if score else "")

            user = subtext.select_one("a.hnuser")
            author = user.get_text(strip=True) if user else None

            age_a = subtext.select_one("span.age a")
            age = age_a.get_text(strip=True) if age_a else None

            links = subtext.select("a")
            if links:
                comments = parse_int(links[-1].get_text(" ", strip=True))

        stories.append(
            {
                "id": story_id,
                "title": title,
                "url": href,
                "points": points,
                "author": author,
                "age": age,
                "comments": comments,
            }
        )

    return stories

Sanity check:

stories = parse_front_page(fetch("/"))
print("stories:", len(stories))
print(stories[0])

Step 3: Pagination (crawl N pages)

HN pagination is explicit:

page 1: /
page N: /?p=N

def crawl_front_pages(pages: int = 3) -> list[dict]:
    all_stories = []
    seen = set()

    for p in range(1, pages + 1):
        path = "/" if p == 1 else f"/?p={p}"
        html = fetch(path)
        batch = parse_front_page(html)

        for s in batch:
            if s["id"] in seen:
                continue
            seen.add(s["id"])
            s["page"] = p
            all_stories.append(s)

    return all_stories


all_stories = crawl_front_pages(pages=3)
print("total:", len(all_stories))

Step 4: Export to CSV

import csv


def write_csv(path: str, rows: list[dict]) -> None:
    if not rows:
        raise ValueError("no rows to write")
    fieldnames = list(rows[0].keys())

    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)


write_csv("hn_stories.csv", all_stories)

Common BeautifulSoup mistakes (and how to avoid them)

1) Parsing without understanding the HTML

Don’t guess selectors. Use DevTools first, then implement the selectors in code.

2) Regex-parsing HTML

HTML is not a regular language. Use BeautifulSoup (or lxml/XPath) for structure.

3) Ignoring encoding issues

If you see broken characters, ensure you read/write UTF-8 and use lxml parser.

4) No timeouts

This is the #1 “my scraper hangs sometimes” issue.

Where ProxiesAPI fits (and why it’s not magical)

When you scrape a “friendly” target, direct requests can be fine.

When you scale, the hard problems show up:

timeouts
connection resets
intermittent blocks
inconsistent HTML due to bot checks

ProxiesAPI is a wrapper URL:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://news.ycombinator.com/" | head

In Python, you wrap the URL before fetching:

from urllib.parse import urlencode


def proxiesapi_wrap(target_url: str, api_key: str) -> str:
    base = "http://api.proxiesapi.com/"
    return base + "?" + urlencode({"key": api_key, "url": target_url})


API_KEY = "API_KEY"
wrapped = proxiesapi_wrap("https://news.ycombinator.com/", API_KEY)
html = fetch(wrapped)
stories = parse_front_page(html)

The honest benefit: your parsing code doesn’t change. You’re simply making the network layer more resilient.

When your first scraper grows up, ProxiesAPI helps

Get 1,000 free API calls View pricing

Related guides

Scrape eBay Listings and Prices

Build an eBay scraper that captures titles, prices, item URLs, and pagination into CSV-ready output.

tutorial#python#ebay#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, pagination cursors, and reviewer metadata into a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape eBay Listings + Sold Prices with Python (Active + Completed Listings)

Build a small eBay dataset (title, price, condition, shipping) from search results, then pull completed/sold prices from the Sold filter. Includes pagination, CSV export, and ProxiesAPI in the fetch layer.

tutorial#python#ebay#web-scraping

Scrape Stock Prices and Financial Data with Python

Use Python + ProxiesAPI to pull Yahoo Finance quote pages, key stats tables, and historical price rows into CSV without building a heavyweight browser scraper.

tutorial#python#stocks#finance