Scrape BBC News Headlines & Article URLs (Python + ProxiesAPI)

BBC News is a useful scraping target because it’s:

  • high-signal content (headlines + URLs)
  • structured enough to extract reliably
  • a realistic “homepage changes every hour” scenario

In this tutorial we’ll build a Python scraper that:

  1. fetches a BBC News page through ProxiesAPI
  2. extracts headline text, section label, and canonical article URL
  3. writes results to JSON Lines (JSONL) for easy streaming and ingestion

We’ll also capture a screenshot of the page we’re scraping.

BBC News homepage (we’ll extract headline links)

Keep news crawls stable with ProxiesAPI

News homepages can be noisy and change often. ProxiesAPI helps by making the network layer reliable when you’re crawling many pages or running frequently.


What we’re scraping

BBC News has multiple entry points. For a starter scraper:

  • Homepage: https://www.bbc.com/news

You can also scrape specific sections (e.g., business/technology) once the pipeline works.


Setup

python -m venv .venv
source .venv/bin/activate

pip install requests beautifulsoup4 lxml python-dotenv

ProxiesAPI integration (proxy URL)

Set a proxy URL in your environment:

export PROXIESAPI_PROXY_URL="http://YOUR_USERNAME:YOUR_PASSWORD@gw.proxiesapi.com:8080"

Then pass it to requests via the proxies= parameter.


Step 1: Fetch HTML with headers + timeouts

import os
import time
import random
import requests

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
TIMEOUT = (10, 30)

session = requests.Session()


def fetch(url: str) -> str:
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-GB,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    proxies = None
    if PROXY_URL:
        proxies = {"http": PROXY_URL, "https": PROXY_URL}

    r = session.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
    r.raise_for_status()

    time.sleep(random.uniform(0.5, 1.2))
    return r.text

Step 2: Extract headline candidates

BBC uses a mix of card modules. Instead of keying off brittle classnames, a good first pass is:

  • collect all anchor tags that look like article links
  • filter out nav/footer links
  • normalize to canonical URLs

Heuristics we’ll use:

  • href starts with /news/ or full https://www.bbc.com/news/…
  • anchor has non-trivial text (headline)
import re
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup

BBC_BASE = "https://www.bbc.com"
NEWS_URL = "https://www.bbc.com/news"


def normalize_url(href: str) -> str | None:
    if not href:
        return None
    url = urljoin(BBC_BASE, href)
    # strip fragments, keep query (sometimes used for tracking; optional)
    p = urlparse(url)
    clean = p._replace(fragment="").geturl()
    return clean


def extract_headlines(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    items = []
    seen = set()

    for a in soup.select("a"):
        text = a.get_text(" ", strip=True)
        if not text or len(text) < 12:
            continue

        href = a.get("href")
        if not href:
            continue

        # Keep article-ish links only
        if not (href.startswith("/news/") or href.startswith("https://www.bbc.com/news/")):
            continue

        url = normalize_url(href)
        if not url or url in seen:
            continue

        # Drop common non-article paths
        if re.search(r"/news/(live|av)/", url):
            continue

        # Section label (best-effort): often nearby in parent card
        card = a.find_parent(["article", "div", "li"])
        section = None
        if card:
            # Look for a short label element in the same card
            label = card.find(string=re.compile(r"^[A-Z][A-Za-z &]{2,30}$"))
            if label:
                section = str(label).strip()

        seen.add(url)
        items.append({
            "headline": text,
            "url": url,
            "section": section,
        })

    return items

Why heuristics beat brittle selectors

News homepages change. A selector like .gs-c-promo-heading__title might work today and fail next month.

Starting with link-based extraction gives you robustness. Once you have stable output, you can narrow selectors for cleaner results.


Step 3: Export to JSONL

JSON Lines is great for scraping pipelines:

  • each row is independent
  • you can stream output line-by-line
  • it plays nicely with tools like jq, BigQuery, and Python ingestion jobs
import json
from datetime import datetime


def export_jsonl(rows: list[dict], path: str):
    ts = datetime.utcnow().isoformat(timespec="seconds") + "Z"
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            r2 = {**r, "scraped_at": ts, "source": "bbc.com/news"}
            f.write(json.dumps(r2, ensure_ascii=False) + "\n")


if __name__ == "__main__":
    html = fetch(NEWS_URL)
    rows = extract_headlines(html)

    # Keep top N for a clean file; adjust as needed
    rows = rows[:60]

    export_jsonl(rows, "bbc_headlines.jsonl")
    print("wrote bbc_headlines.jsonl", len(rows))

    # Quick peek
    print(rows[:3])

Make it more “production” (optional upgrades)

  1. Deduplicate across runs

    • store URL hashes in SQLite
    • only emit new URLs
  2. Scrape multiple sections

    • https://www.bbc.com/news/business
    • https://www.bbc.com/news/technology
  3. Enrich per-article

    • fetch each article URL
    • extract publish time + author + summary paragraphs
  4. Respect crawl load

    • keep concurrency low
    • add jitter
    • cache responses for short intervals

Where ProxiesAPI fits (honestly)

Scraping one page occasionally usually works without proxies.

But if you’re crawling multiple BBC sections, doing it frequently, or running from unstable networks, you’ll start seeing:

  • timeouts
  • intermittent 403s
  • inconsistent HTML

A proxy layer like ProxiesAPI helps keep your fetch step predictable so you can focus on parsing and data quality.


QA checklist

  • Headline text looks human-readable (not nav text)
  • URLs are canonical and start with https://www.bbc.com/news/…
  • JSONL validates (one JSON object per line)
  • Duplicates are low (basic seen set is working)
  • You can re-run without frequent blocks/timeouts
Keep news crawls stable with ProxiesAPI

News homepages can be noisy and change often. ProxiesAPI helps by making the network layer reliable when you’re crawling many pages or running frequently.

Related guides

Scrape Live Stock Prices from Yahoo Finance (Python + ProxiesAPI)
Fetch Yahoo Finance quote pages via ProxiesAPI, parse price + change + market cap, and export clean rows to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks
Scrape GitHub Repository Data (Stars, Releases, Issues) with Python + ProxiesAPI
Scrape GitHub repo pages as HTML (not just the API): stars, forks, open issues/PRs, latest release, and recent issues. Includes defensive selectors, CSV export, and a screenshot.
tutorial#python#github#web-scraping
How to Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)
Pull Craigslist listings for a chosen city + category, normalize fields, follow listing pages for details, and export clean CSV with retries and anti-block tips.
tutorial#python#craigslist#web-scraping
How to Scrape AutoTrader Used Car Listings with Python (Make/Model/Price/Mileage)
Scrape AutoTrader search results into a clean dataset: title, price, mileage, year, location, and dealer vs private hints. Includes ProxiesAPI fetch, robust selectors, and export to JSON.
tutorial#python#autotrader#cars