Scrape Hacker News: Top Stories + Comments (Python + ProxiesAPI)
Hacker News (HN) is one of the best “learn by doing” scraping targets because it’s mostly server-rendered HTML and the structure is consistent.
But we’re not going to write a toy script.
In this tutorial we’ll build a production-grade scraper that extracts:
- top stories (id, title, url, points, author, age, comment count)
- pagination across
?p=N - full comment threads per story (flat list + indentation so you can rebuild the tree)
- exports to JSON/JSONL
And we’ll wire the fetch layer through ProxiesAPI so you can reuse the same architecture on sites that aren’t friendly.

HN is friendly — but your next target won’t be. ProxiesAPI helps keep crawls stable when request volume grows and failures start coming from the network layer.
What we’re scraping (HN URL map)
- Front page:
https://news.ycombinator.com/ - Pagination:
https://news.ycombinator.com/?p=2 - Item page (story + comments):
https://news.ycombinator.com/item?id=ITEM_ID
Quick sanity check:
curl -s https://news.ycombinator.com/ | head -n 5
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
Step 1: Fetch pages via ProxiesAPI (timeouts + retries)
Set your key:
export PROXIESAPI_KEY="YOUR_API_KEY"
Fetcher:
import os
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY")
BASE = "https://news.ycombinator.com"
TIMEOUT = (10, 30)
SESSION = requests.Session()
class FetchError(RuntimeError):
pass
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=20),
retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch(path_or_url: str) -> str:
if not PROXIESAPI_KEY:
raise RuntimeError("Missing PROXIESAPI_KEY env var")
url = path_or_url if path_or_url.startswith("http") else f"{BASE}{path_or_url}"
api_url = "https://api.proxiesapi.com"
params = {"api_key": PROXIESAPI_KEY, "url": url}
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
r = SESSION.get(api_url, params=params, headers=headers, timeout=TIMEOUT)
if r.status_code in (429, 500, 502, 503, 504):
raise FetchError(f"Retryable status: {r.status_code}")
r.raise_for_status()
return r.text
Why ProxiesAPI here? Not because HN needs it — but because your scraper architecture should stay the same as you move to tougher targets.
Step 2: Parse stories from the front page (no guessed selectors)
HN story rows:
- main row:
tr.athing(title + link) - metadata is in the next row:
td.subtext(points, author, age, comments)
Parser:
import re
from bs4 import BeautifulSoup
def parse_int(text: str) -> int | None:
m = re.search(r"(\d+)", text or "")
return int(m.group(1)) if m else None
def parse_front_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
stories = []
for row in soup.select("tr.athing"):
story_id = row.get("id")
title_a = row.select_one("span.titleline > a")
title = title_a.get_text(strip=True) if title_a else None
href = title_a.get("href") if title_a else None
subtext_row = row.find_next_sibling("tr")
subtext = subtext_row.select_one("td.subtext") if subtext_row else None
points = author = age = None
comments = None
if subtext:
score = subtext.select_one("span.score")
points = parse_int(score.get_text(" ", strip=True) if score else "")
user = subtext.select_one("a.hnuser")
author = user.get_text(strip=True) if user else None
age_a = subtext.select_one("span.age a")
age = age_a.get_text(strip=True) if age_a else None
links = subtext.select("a")
if links:
comments = parse_int(links[-1].get_text(" ", strip=True))
stories.append({
"id": story_id,
"title": title,
"url": href,
"points": points,
"author": author,
"age": age,
"comments": comments,
"item_url": f"{BASE}/item?id={story_id}" if story_id else None,
})
if len(stories) < 20:
raise RuntimeError(f"Too few stories parsed: {len(stories)}")
return stories
Step 3: Crawl N pages (pagination)
HN supports ?p=N.
def crawl_front_pages(pages: int = 3) -> list[dict]:
all_stories = []
seen = set()
for p in range(1, pages + 1):
path = "/" if p == 1 else f"/?p={p}"
html = fetch(path)
batch = parse_front_page(html)
for s in batch:
sid = s.get("id")
if not sid or sid in seen:
continue
seen.add(sid)
all_stories.append(s)
print("page", p, "stories", len(batch), "total unique", len(all_stories))
return all_stories
stories = crawl_front_pages(5)
print("total unique stories:", len(stories))
print(stories[0])
Step 4: Scrape full comment threads for a story
Comments live on the item page (/item?id=...).
HN shows nesting via indentation in HTML. We’ll extract:
- comment id
- author
- age
- indent level (so you can rebuild tree)
- comment text
def parse_comments(item_html: str) -> list[dict]:
soup = BeautifulSoup(item_html, "lxml")
out = []
for tr in soup.select("tr.athing.comtr"):
cid = tr.get("id")
ind = tr.select_one("td.ind img")
indent = int(ind.get("width", 0)) if ind else 0
user = tr.select_one("a.hnuser")
author = user.get_text(strip=True) if user else None
age_a = tr.select_one("span.age a")
age = age_a.get_text(strip=True) if age_a else None
comment = tr.select_one("span.commtext")
text = comment.get_text("\n", strip=True) if comment else ""
out.append({
"id": cid,
"author": author,
"age": age,
"indent": indent,
"text": text,
})
return out
item_html = fetch(stories[0]["item_url"])
comments = parse_comments(item_html)
print("comments:", len(comments))
print(comments[:2])
If you want a real tree, you can post-process using indent (HN uses multiples of 40px for depth).
Export: JSONL (stories) + JSON (comments per story)
Stories to JSONL:
import json
stories = crawl_front_pages(3)
with open("hn_stories.jsonl", "w", encoding="utf-8") as f:
for s in stories:
f.write(json.dumps(s, ensure_ascii=False) + "\n")
print("wrote hn_stories.jsonl", len(stories))
Comments (one file per story id):
import json
story = stories[0]
item_html = fetch(story["item_url"])
comments = parse_comments(item_html)
out_path = f"hn_comments_{story['id']}.json"
with open(out_path, "w", encoding="utf-8") as f:
json.dump(comments, f, ensure_ascii=False, indent=2)
print("wrote", out_path, len(comments))
Politeness + scaling tips
HN is very scrape-friendly, but good habits transfer:
- Use timeouts (no hanging jobs)
- Retry only on transient errors
- Don’t request detail pages unless needed
- Add caching when iterating on your parser
If you later scrape sites with stricter controls, ProxiesAPI helps by making the network layer more consistent — reducing retries, timeouts, and flaky blocks.
QA checklist
- Front page parser returns ~30 stories
- Pagination increases unique story count
- Comment parser returns non-empty text for active threads
- Export files are valid JSON/JSONL
Next upgrades
- Turn flat comments into a tree
- Store into SQLite for incremental updates
- Add per-story crawl limits and backoff
HN is friendly — but your next target won’t be. ProxiesAPI helps keep crawls stable when request volume grows and failures start coming from the network layer.