Scrape Hacker News: Top Stories + Comments (Python + ProxiesAPI)

Hacker News (HN) is one of the best “learn by doing” scraping targets because it’s mostly server-rendered HTML and the structure is consistent.

But we’re not going to write a toy script.

In this tutorial we’ll build a production-grade scraper that extracts:

  • top stories (id, title, url, points, author, age, comment count)
  • pagination across ?p=N
  • full comment threads per story (flat list + indentation so you can rebuild the tree)
  • exports to JSON/JSONL

And we’ll wire the fetch layer through ProxiesAPI so you can reuse the same architecture on sites that aren’t friendly.

Hacker News front page (we’ll scrape story rows + subtext)

Scale your crawl reliably with ProxiesAPI

HN is friendly — but your next target won’t be. ProxiesAPI helps keep crawls stable when request volume grows and failures start coming from the network layer.


What we’re scraping (HN URL map)

  • Front page: https://news.ycombinator.com/
  • Pagination: https://news.ycombinator.com/?p=2
  • Item page (story + comments): https://news.ycombinator.com/item?id=ITEM_ID

Quick sanity check:

curl -s https://news.ycombinator.com/ | head -n 5

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

Step 1: Fetch pages via ProxiesAPI (timeouts + retries)

Set your key:

export PROXIESAPI_KEY="YOUR_API_KEY"

Fetcher:

import os
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY")
BASE = "https://news.ycombinator.com"

TIMEOUT = (10, 30)
SESSION = requests.Session()

class FetchError(RuntimeError):
    pass

@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch(path_or_url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Missing PROXIESAPI_KEY env var")

    url = path_or_url if path_or_url.startswith("http") else f"{BASE}{path_or_url}"

    api_url = "https://api.proxiesapi.com"
    params = {"api_key": PROXIESAPI_KEY, "url": url}

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }

    r = SESSION.get(api_url, params=params, headers=headers, timeout=TIMEOUT)

    if r.status_code in (429, 500, 502, 503, 504):
        raise FetchError(f"Retryable status: {r.status_code}")

    r.raise_for_status()
    return r.text

Why ProxiesAPI here? Not because HN needs it — but because your scraper architecture should stay the same as you move to tougher targets.


Step 2: Parse stories from the front page (no guessed selectors)

HN story rows:

  • main row: tr.athing (title + link)
  • metadata is in the next row: td.subtext (points, author, age, comments)

Parser:

import re
from bs4 import BeautifulSoup

def parse_int(text: str) -> int | None:
    m = re.search(r"(\d+)", text or "")
    return int(m.group(1)) if m else None


def parse_front_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    stories = []
    for row in soup.select("tr.athing"):
        story_id = row.get("id")

        title_a = row.select_one("span.titleline > a")
        title = title_a.get_text(strip=True) if title_a else None
        href = title_a.get("href") if title_a else None

        subtext_row = row.find_next_sibling("tr")
        subtext = subtext_row.select_one("td.subtext") if subtext_row else None

        points = author = age = None
        comments = None

        if subtext:
            score = subtext.select_one("span.score")
            points = parse_int(score.get_text(" ", strip=True) if score else "")

            user = subtext.select_one("a.hnuser")
            author = user.get_text(strip=True) if user else None

            age_a = subtext.select_one("span.age a")
            age = age_a.get_text(strip=True) if age_a else None

            links = subtext.select("a")
            if links:
                comments = parse_int(links[-1].get_text(" ", strip=True))

        stories.append({
            "id": story_id,
            "title": title,
            "url": href,
            "points": points,
            "author": author,
            "age": age,
            "comments": comments,
            "item_url": f"{BASE}/item?id={story_id}" if story_id else None,
        })

    if len(stories) < 20:
        raise RuntimeError(f"Too few stories parsed: {len(stories)}")

    return stories

Step 3: Crawl N pages (pagination)

HN supports ?p=N.

def crawl_front_pages(pages: int = 3) -> list[dict]:
    all_stories = []
    seen = set()

    for p in range(1, pages + 1):
        path = "/" if p == 1 else f"/?p={p}"
        html = fetch(path)
        batch = parse_front_page(html)

        for s in batch:
            sid = s.get("id")
            if not sid or sid in seen:
                continue
            seen.add(sid)
            all_stories.append(s)

        print("page", p, "stories", len(batch), "total unique", len(all_stories))

    return all_stories

stories = crawl_front_pages(5)
print("total unique stories:", len(stories))
print(stories[0])

Step 4: Scrape full comment threads for a story

Comments live on the item page (/item?id=...).

HN shows nesting via indentation in HTML. We’ll extract:

  • comment id
  • author
  • age
  • indent level (so you can rebuild tree)
  • comment text
def parse_comments(item_html: str) -> list[dict]:
    soup = BeautifulSoup(item_html, "lxml")

    out = []
    for tr in soup.select("tr.athing.comtr"):
        cid = tr.get("id")

        ind = tr.select_one("td.ind img")
        indent = int(ind.get("width", 0)) if ind else 0

        user = tr.select_one("a.hnuser")
        author = user.get_text(strip=True) if user else None

        age_a = tr.select_one("span.age a")
        age = age_a.get_text(strip=True) if age_a else None

        comment = tr.select_one("span.commtext")
        text = comment.get_text("\n", strip=True) if comment else ""

        out.append({
            "id": cid,
            "author": author,
            "age": age,
            "indent": indent,
            "text": text,
        })

    return out

item_html = fetch(stories[0]["item_url"])
comments = parse_comments(item_html)
print("comments:", len(comments))
print(comments[:2])

If you want a real tree, you can post-process using indent (HN uses multiples of 40px for depth).


Export: JSONL (stories) + JSON (comments per story)

Stories to JSONL:

import json

stories = crawl_front_pages(3)
with open("hn_stories.jsonl", "w", encoding="utf-8") as f:
    for s in stories:
        f.write(json.dumps(s, ensure_ascii=False) + "\n")

print("wrote hn_stories.jsonl", len(stories))

Comments (one file per story id):

import json

story = stories[0]
item_html = fetch(story["item_url"])
comments = parse_comments(item_html)

out_path = f"hn_comments_{story['id']}.json"
with open(out_path, "w", encoding="utf-8") as f:
    json.dump(comments, f, ensure_ascii=False, indent=2)

print("wrote", out_path, len(comments))

Politeness + scaling tips

HN is very scrape-friendly, but good habits transfer:

  • Use timeouts (no hanging jobs)
  • Retry only on transient errors
  • Don’t request detail pages unless needed
  • Add caching when iterating on your parser

If you later scrape sites with stricter controls, ProxiesAPI helps by making the network layer more consistent — reducing retries, timeouts, and flaky blocks.


QA checklist

  • Front page parser returns ~30 stories
  • Pagination increases unique story count
  • Comment parser returns non-empty text for active threads
  • Export files are valid JSON/JSONL

Next upgrades

  • Turn flat comments into a tree
  • Store into SQLite for incremental updates
  • Add per-story crawl limits and backoff
Scale your crawl reliably with ProxiesAPI

HN is friendly — but your next target won’t be. ProxiesAPI helps keep crawls stable when request volume grows and failures start coming from the network layer.

Related guides

How to Scrape Hacker News (HN) with Python: Stories + Pagination + Comments
A production-grade Hacker News scraper: parse the real HTML, crawl multiple pages, extract stories and comment threads, and export clean JSON. Includes terminal-style runs and selector rationale.
tutorial#python#hackernews#web-scraping
Scrape Netflix Catalogue Data with Python + ProxiesAPI (Titles, Genres, Availability)
Build a repeatable Netflix title dataset from listing pages: extract title rows, handle pagination defensively, dedupe, and export clean JSONL. Includes a screenshot of the target UI.
tutorial#python#netflix#web-scraping
Scrape Pinterest Images and Pins (Search + Board URLs) with Python + ProxiesAPI
Extract pin titles, image URLs, outbound links, and board metadata from Pinterest search + board pages with pagination, retries, and defensive parsing. Includes a screenshot of the target UI.
tutorial#python#pinterest#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping