Scrape Sports Scores from ESPN with Python (via ProxiesAPI)
ESPN has a clean scoreboard experience that’s perfect for building a practical scraper:
- it’s updated frequently
- it’s request-heavy if you crawl multiple sports/days
- it’s the kind of site that can intermittently throttle or serve different markup
In this guide we’ll build a real Python scraper that:
- fetches an ESPN scoreboard page
- extracts each game row: teams, score, status, start time
- optionally follows game links for a little extra metadata
- exports results to CSV
- uses a fetch layer that can route through ProxiesAPI when you scale
We’ll keep the extraction honest: parse the HTML you actually receive, and write selectors that fail loudly when ESPN changes markup.

Sports sites change and throttle. ProxiesAPI gives you a proxy-backed fetch URL plus optional JS rendering so your scraper finishes more runs with fewer network headaches.
What we’re scraping (ESPN scoreboard)
ESPN scoreboard URLs vary by sport and date. For example:
- NBA scoreboard:
https://www.espn.com/nba/scoreboard - NFL scoreboard:
https://www.espn.com/nfl/scoreboard
Often there are date parameters or navigation that changes the URL, but you can start with the main scoreboard page and iterate.
Quick sanity check
curl -sL "https://www.espn.com/nba/scoreboard" | head -n 15
If the response is mostly empty or looks like a “please enable JS” shell, you’ll need either:
- a different endpoint ESPN exposes (sometimes there are JSON feeds), or
- to fetch through ProxiesAPI with JS rendering enabled (if available on your plan), or
- a browser automation approach (Playwright) for this target.
This tutorial focuses on HTML extraction, and shows where to switch the fetch layer.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsing
Step 1: A production fetch() with retries + ProxiesAPI
Two rules that make scrapers survive:
- timeouts always
- retries with exponential backoff
And when you scale, route the same request through ProxiesAPI.
import random
import time
from urllib.parse import quote
import requests
TIMEOUT = (10, 30) # connect, read
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
session = requests.Session()
def fetch(url: str, *, proxiesapi_key: str | None = None, retries: int = 4) -> str:
last = None
for attempt in range(1, retries + 1):
try:
if proxiesapi_key:
# ProxiesAPI simple proxy-backed fetch URL.
# Note: some accounts support extra params (rendering, country, etc.).
proxied = (
"http://api.proxiesapi.com/?key="
+ quote(proxiesapi_key)
+ "&url="
+ quote(url, safe="")
)
r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
else:
r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
r.raise_for_status()
return r.text
except Exception as e:
last = e
sleep_s = (2 ** attempt) + random.random()
time.sleep(sleep_s)
raise RuntimeError(f"fetch failed after {retries} retries: {last}")
If you want to test both modes:
html_direct = fetch("https://www.espn.com/nba/scoreboard")
print("direct chars:", len(html_direct))
# html_proxy = fetch("https://www.espn.com/nba/scoreboard", proxiesapi_key="YOUR_KEY")
# print("proxied chars:", len(html_proxy))
Step 2: Inspect the HTML and choose selectors
ESPN’s markup shifts. You can’t rely on “one magic class” forever.
Practical approach:
- Save a snapshot of the HTML you received.
- Find repeated “game card” containers.
- Build selectors around structure (not just long class strings).
Save a local snapshot:
html = fetch("https://www.espn.com/nba/scoreboard")
with open("espn_scoreboard.html", "w", encoding="utf-8") as f:
f.write(html)
print("wrote espn_scoreboard.html")
Open it and look for repeated blocks like:
- a wrapper for each event/game
- team names
- score numbers
- status text (final, Q4, scheduled time)
On many ESPN pages you’ll find some combination of:
section/divwrappers per event- links to the game recap (
/game/_/gameId/...) - team name text within nested spans
We’ll implement parsing in a way that is easy to adapt: selectors are centralized.
Step 3: Parse game rows into structured records
This parser tries a few common patterns:
- game containers are “cards” with a game link inside
- within a card, there are two teams
- each team has a name, and possibly a score
- status is present near the top/bottom of the card
If ESPN changes markup, you typically only edit a couple of selectors.
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
BASE = "https://www.espn.com"
def clean(text: str | None) -> str | None:
if not text:
return None
t = re.sub(r"\s+", " ", text).strip()
return t or None
def parse_int(text: str | None) -> int | None:
if not text:
return None
m = re.search(r"\d+", text)
return int(m.group(0)) if m else None
def parse_scoreboard(html: str, page_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
games: list[dict] = []
# Heuristic: find links that look like game pages and walk up to a container.
# ESPN game links often contain /game/_/gameId/
game_links = soup.select('a[href*="/game/_/gameId/"]')
seen = set()
for a in game_links:
href = a.get("href")
if not href:
continue
game_url = urljoin(BASE, href)
if game_url in seen:
continue
seen.add(game_url)
# Find a reasonable container around the link
card = a
for _ in range(6):
if not card:
break
# Stop when the container has enough text to plausibly be a game card
if getattr(card, "get_text", None):
txt = clean(card.get_text(" ", strip=True)) or ""
if len(txt) > 40:
break
card = card.parent
container = card if card else a
# Team names: look for two repeated name elements inside container.
# These selectors are intentionally broad; you should tighten them based on a real snapshot.
name_candidates = [
el.get_text(" ", strip=True)
for el in container.select('span, div')
if el.get_text(strip=True)
]
# Try to detect team-like names by excluding very short/very long tokens.
# This is a fallback; ideally you target specific selectors once you inspect your snapshot.
team_names = []
for t in name_candidates:
t = clean(t)
if not t:
continue
if len(t) < 3 or len(t) > 40:
continue
# Skip obvious non-team tokens
if t.lower() in {"final", "preview", "recap", "tickets"}:
continue
team_names.append(t)
# De-dupe while preserving order
uniq_names = []
seen_name = set()
for n in team_names:
if n in seen_name:
continue
seen_name.add(n)
uniq_names.append(n)
# Scores: many cards show numeric scores; grab a few numbers.
nums = [parse_int(clean(el.get_text(strip=True))) for el in container.select("span, div")]
nums = [n for n in nums if n is not None]
status = None
# Status tends to include words like Final, Q1, Half, or a time.
status_el = container.find(string=re.compile(r"Final|Q\d|Half|AM|PM", re.I))
if status_el:
status = clean(str(status_el))
games.append({
"page_url": page_url,
"game_url": game_url,
"teams_guess": uniq_names[:6],
"scores_guess": nums[:6],
"status_guess": status,
})
return games
This parser is deliberately conservative: it gives you a structured starting point even when ESPN changes HTML.
For a production scraper, you’ll do one more pass:
- open
espn_scoreboard.html - identify the exact game card container selector
- tighten team name selectors to those elements
That turns “guessy” output into stable output.
Step 4: Convert guesses into a clean schema
You usually want a normalized record like:
home_team,away_teamhome_score,away_scorestatus(Final / In Progress / Scheduled)game_url
Here’s a helper that attempts to map the first two team names + first two scores:
def normalize_game(g: dict) -> dict:
teams = g.get("teams_guess") or []
scores = g.get("scores_guess") or []
away_team = teams[0] if len(teams) > 0 else None
home_team = teams[1] if len(teams) > 1 else None
away_score = scores[0] if len(scores) > 0 else None
home_score = scores[1] if len(scores) > 1 else None
return {
"away_team": away_team,
"home_team": home_team,
"away_score": away_score,
"home_score": home_score,
"status": g.get("status_guess"),
"game_url": g.get("game_url"),
"page_url": g.get("page_url"),
}
Step 5: Export to CSV
import csv
def to_csv(rows: list[dict], path: str) -> None:
if not rows:
raise ValueError("no rows")
fields = ["away_team", "home_team", "away_score", "home_score", "status", "game_url", "page_url"]
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fields)
w.writeheader()
for r in rows:
w.writerow({k: r.get(k) for k in fields})
if __name__ == "__main__":
url = "https://www.espn.com/nba/scoreboard"
html = fetch(url) # or fetch(url, proxiesapi_key="YOUR_KEY")
raw = parse_scoreboard(html, page_url=url)
normalized = [normalize_game(g) for g in raw]
to_csv(normalized, "espn_scores.csv")
print("wrote espn_scores.csv", len(normalized))
Where ProxiesAPI fits (honestly)
If you’re scraping one scoreboard page occasionally, you may be fine without proxies.
ProxiesAPI becomes useful when you:
- crawl many sports + dates (lots of repetitive requests)
- follow each game to a detail/recap page
- run scheduled scrapes (hourly/daily) where intermittent blocks hurt
The key idea: keep your extraction logic the same, and swap the fetch layer to use the ProxiesAPI URL.
QA checklist
-
curl -sLshows real HTML (not empty shells) - Your snapshot contains multiple repeated game blocks
- Team names map correctly to home/away for 3–5 spot checks
- CSV outputs sane rows (no null spam)
- Retries/backoff work (simulate by disconnecting network)
Next upgrades
- add date selection (crawl yesterday/today/tomorrow)
- scrape additional fields: venue, broadcast network, odds (if present)
- store in SQLite for incremental updates
- tighten selectors based on your saved HTML snapshot
Sports sites change and throttle. ProxiesAPI gives you a proxy-backed fetch URL plus optional JS rendering so your scraper finishes more runs with fewer network headaches.