Python BeautifulSoup Tutorial: Scraping Your First Website (2026)
BeautifulSoup is the fastest way to go from:
“I need data from a website”
to:
“I have clean rows in a CSV.”
This tutorial is designed for beginners, but it’s written the way you’d build a scraper you can grow:
- real timeouts (no hanging forever)
- a Session (connection reuse)
- predictable selectors
- pagination loops
- export to CSV
We’ll scrape a simple target: the Hacker News front page, because it’s server-rendered HTML and has clean pagination.
Your first BeautifulSoup scraper usually works… until it doesn’t. As you crawl more pages, stability becomes the problem (timeouts, blocks, flaky HTML). ProxiesAPI belongs in your fetch layer so your parsing code stays simple.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Step 1: Fetch HTML with requests (the right way)
import requests
BASE = "https://news.ycombinator.com"
TIMEOUT = (10, 30)
session = requests.Session()
session.headers.update(
{
"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
"Accept-Language": "en-US,en;q=0.9",
}
)
def fetch(path: str) -> str:
url = path if path.startswith("http") else f"{BASE}{path}"
r = session.get(url, timeout=TIMEOUT)
r.raise_for_status()
return r.text
Why this matters:
- timeouts prevent a single stuck request from freezing your script
Session()reuses TCP connections (faster + friendlier)- a real User-Agent reduces “bot-ish” responses on many sites
Step 2: Parse one page with BeautifulSoup selectors
HN story rows are:
tr.athing(title row)- followed by the next
trcontainingtd.subtext(metadata)
import re
from bs4 import BeautifulSoup
def parse_int(text: str) -> int | None:
m = re.search(r"(\\d+)", text or "")
return int(m.group(1)) if m else None
def parse_front_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
stories = []
for row in soup.select("tr.athing"):
story_id = row.get("id")
title_a = row.select_one("span.titleline > a")
title = title_a.get_text(strip=True) if title_a else None
href = title_a.get("href") if title_a else None
subtext_row = row.find_next_sibling("tr")
subtext = subtext_row.select_one("td.subtext") if subtext_row else None
points = None
author = None
age = None
comments = None
if subtext:
score = subtext.select_one("span.score")
points = parse_int(score.get_text(" ", strip=True) if score else "")
user = subtext.select_one("a.hnuser")
author = user.get_text(strip=True) if user else None
age_a = subtext.select_one("span.age a")
age = age_a.get_text(strip=True) if age_a else None
links = subtext.select("a")
if links:
comments = parse_int(links[-1].get_text(" ", strip=True))
stories.append(
{
"id": story_id,
"title": title,
"url": href,
"points": points,
"author": author,
"age": age,
"comments": comments,
}
)
return stories
Sanity check:
stories = parse_front_page(fetch("/"))
print("stories:", len(stories))
print(stories[0])
Step 3: Pagination (crawl N pages)
HN pagination is explicit:
- page 1:
/ - page N:
/?p=N
def crawl_front_pages(pages: int = 3) -> list[dict]:
all_stories = []
seen = set()
for p in range(1, pages + 1):
path = "/" if p == 1 else f"/?p={p}"
html = fetch(path)
batch = parse_front_page(html)
for s in batch:
if s["id"] in seen:
continue
seen.add(s["id"])
s["page"] = p
all_stories.append(s)
return all_stories
all_stories = crawl_front_pages(pages=3)
print("total:", len(all_stories))
Step 4: Export to CSV
import csv
def write_csv(path: str, rows: list[dict]) -> None:
if not rows:
raise ValueError("no rows to write")
fieldnames = list(rows[0].keys())
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()
w.writerows(rows)
write_csv("hn_stories.csv", all_stories)
Common BeautifulSoup mistakes (and how to avoid them)
1) Parsing without understanding the HTML
Don’t guess selectors. Use DevTools first, then implement the selectors in code.
2) Regex-parsing HTML
HTML is not a regular language. Use BeautifulSoup (or lxml/XPath) for structure.
3) Ignoring encoding issues
If you see broken characters, ensure you read/write UTF-8 and use lxml parser.
4) No timeouts
This is the #1 “my scraper hangs sometimes” issue.
Where ProxiesAPI fits (and why it’s not magical)
When you scrape a “friendly” target, direct requests can be fine.
When you scale, the hard problems show up:
- timeouts
- connection resets
- intermittent blocks
- inconsistent HTML due to bot checks
ProxiesAPI is a wrapper URL:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://news.ycombinator.com/" | head
In Python, you wrap the URL before fetching:
from urllib.parse import urlencode
def proxiesapi_wrap(target_url: str, api_key: str) -> str:
base = "http://api.proxiesapi.com/"
return base + "?" + urlencode({"key": api_key, "url": target_url})
API_KEY = "API_KEY"
wrapped = proxiesapi_wrap("https://news.ycombinator.com/", API_KEY)
html = fetch(wrapped)
stories = parse_front_page(html)
The honest benefit: your parsing code doesn’t change. You’re simply making the network layer more resilient.
Your first BeautifulSoup scraper usually works… until it doesn’t. As you crawl more pages, stability becomes the problem (timeouts, blocks, flaky HTML). ProxiesAPI belongs in your fetch layer so your parsing code stays simple.