Scrape BBC News Headlines & Article URLs (Python + ProxiesAPI)

Mar 25, 2026 · tutorial · #python, #bbc, #news, #web-scraping, #requests, #beautifulsoup, #jsonl, #proxies

BBC News is a useful scraping target because it’s:

high-signal content (headlines + URLs)
structured enough to extract reliably
a realistic “homepage changes every hour” scenario

In this tutorial we’ll build a Python scraper that:

fetches a BBC News page through ProxiesAPI
extracts headline text, section label, and canonical article URL
writes results to JSON Lines (JSONL) for easy streaming and ingestion

We’ll also capture a screenshot of the page we’re scraping.

BBC News homepage (we’ll extract headline links)

Keep news crawls stable with ProxiesAPI

News homepages can be noisy and change often. ProxiesAPI helps by making the network layer reliable when you’re crawling many pages or running frequently.

Get 1,000 free API calls View pricing

What we’re scraping

BBC News has multiple entry points. For a starter scraper:

Homepage: https://www.bbc.com/news

You can also scrape specific sections (e.g., business/technology) once the pipeline works.

Setup

python -m venv .venv
source .venv/bin/activate

pip install requests beautifulsoup4 lxml python-dotenv

ProxiesAPI integration (proxy URL)

Set a proxy URL in your environment:

export PROXIESAPI_PROXY_URL="http://YOUR_USERNAME:YOUR_PASSWORD@gw.proxiesapi.com:8080"

Then pass it to requests via the proxies= parameter.

Step 1: Fetch HTML with headers + timeouts

import os
import time
import random
import requests

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
TIMEOUT = (10, 30)

session = requests.Session()


def fetch(url: str) -> str:
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-GB,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    proxies = None
    if PROXY_URL:
        proxies = {"http": PROXY_URL, "https": PROXY_URL}

    r = session.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
    r.raise_for_status()

    time.sleep(random.uniform(0.5, 1.2))
    return r.text

Step 2: Extract headline candidates

BBC uses a mix of card modules. Instead of keying off brittle classnames, a good first pass is:

collect all anchor tags that look like article links
filter out nav/footer links
normalize to canonical URLs

Heuristics we’ll use:

href starts with /news/ or full https://www.bbc.com/news/…
anchor has non-trivial text (headline)

import re
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup

BBC_BASE = "https://www.bbc.com"
NEWS_URL = "https://www.bbc.com/news"


def normalize_url(href: str) -> str | None:
    if not href:
        return None
    url = urljoin(BBC_BASE, href)
    # strip fragments, keep query (sometimes used for tracking; optional)
    p = urlparse(url)
    clean = p._replace(fragment="").geturl()
    return clean


def extract_headlines(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    items = []
    seen = set()

    for a in soup.select("a"):
        text = a.get_text(" ", strip=True)
        if not text or len(text) < 12:
            continue

        href = a.get("href")
        if not href:
            continue

        # Keep article-ish links only
        if not (href.startswith("/news/") or href.startswith("https://www.bbc.com/news/")):
            continue

        url = normalize_url(href)
        if not url or url in seen:
            continue

        # Drop common non-article paths
        if re.search(r"/news/(live|av)/", url):
            continue

        # Section label (best-effort): often nearby in parent card
        card = a.find_parent(["article", "div", "li"])
        section = None
        if card:
            # Look for a short label element in the same card
            label = card.find(string=re.compile(r"^[A-Z][A-Za-z &]{2,30}$"))
            if label:
                section = str(label).strip()

        seen.add(url)
        items.append({
            "headline": text,
            "url": url,
            "section": section,
        })

    return items

Why heuristics beat brittle selectors

News homepages change. A selector like .gs-c-promo-heading__title might work today and fail next month.

Starting with link-based extraction gives you robustness. Once you have stable output, you can narrow selectors for cleaner results.

Step 3: Export to JSONL

JSON Lines is great for scraping pipelines:

each row is independent
you can stream output line-by-line
it plays nicely with tools like jq, BigQuery, and Python ingestion jobs

import json
from datetime import datetime


def export_jsonl(rows: list[dict], path: str):
    ts = datetime.utcnow().isoformat(timespec="seconds") + "Z"
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            r2 = {**r, "scraped_at": ts, "source": "bbc.com/news"}
            f.write(json.dumps(r2, ensure_ascii=False) + "\n")


if __name__ == "__main__":
    html = fetch(NEWS_URL)
    rows = extract_headlines(html)

    # Keep top N for a clean file; adjust as needed
    rows = rows[:60]

    export_jsonl(rows, "bbc_headlines.jsonl")
    print("wrote bbc_headlines.jsonl", len(rows))

    # Quick peek
    print(rows[:3])

Make it more “production” (optional upgrades)

Deduplicate across runs
- store URL hashes in SQLite
- only emit new URLs
Scrape multiple sections
- https://www.bbc.com/news/business
- https://www.bbc.com/news/technology
Enrich per-article
- fetch each article URL
- extract publish time + author + summary paragraphs
Respect crawl load
- keep concurrency low
- add jitter
- cache responses for short intervals

Where ProxiesAPI fits (honestly)

Scraping one page occasionally usually works without proxies.

But if you’re crawling multiple BBC sections, doing it frequently, or running from unstable networks, you’ll start seeing:

timeouts
intermittent 403s
inconsistent HTML

A proxy layer like ProxiesAPI helps keep your fetch step predictable so you can focus on parsing and data quality.

QA checklist

Headline text looks human-readable (not nav text)
URLs are canonical and start with https://www.bbc.com/news/…
JSONL validates (one JSON object per line)
Duplicates are low (basic seen set is working)
You can re-run without frequent blocks/timeouts

Keep news crawls stable with ProxiesAPI

News homepages can be noisy and change often. ProxiesAPI helps by making the network layer reliable when you’re crawling many pages or running frequently.

Get 1,000 free API calls View pricing

Build a BBC News topic-page scraper that collects headlines, article URLs, relative timestamps, and topic metadata from real topic hubs.

tutorial#python#bbc#news

Scrape BBC News Headlines and Article URLs with Python (Sections + Deduping)

Scrape BBC News section pages to collect headlines and article URLs with Python + BeautifulSoup. Includes a simple dedupe store (JSON), multiple sections, and a ProxiesAPI fetch wrapper for stability.

tutorial#python#bbc#news

How to Scrape Cars.com Used Car Prices (Python + ProxiesAPI)

Extract listing title, price, mileage, location, and dealer info from Cars.com search results + detail pages. Includes selector notes, pagination, and a polite crawl plan.

tutorial#python#cars.com#price-scraping

How to Scrape Eventbrite Events (Python + ProxiesAPI)

Collect event name, date/time, venue, price, organizer, and event URL from Eventbrite category/location searches. Includes pagination + detail-page enrichment.

tutorial#python#eventbrite#web-scraping

Scrape BBC News Headlines & Article URLs (Python + ProxiesAPI)

Related guides