Scrape Stock Prices and Financial Data with Python (Yahoo Finance) + ProxiesAPI

Yahoo Finance is one of the fastest ways to bootstrap a daily stock dataset without paying for a market data vendor.

In this guide we’ll scrape a small-but-useful set of fields from Yahoo Finance quote pages (price, change, market cap, PE, 52-week range, etc.), then store it as:

  • CSV (easy to inspect)
  • SQLite (easy to query, dedupe, and keep history)

We’ll also build the part that usually breaks in production: the fetch layer.

  • timeouts
  • retries + backoff
  • randomized delays
  • proxy rotation via ProxiesAPI

Yahoo Finance quote page (we’ll scrape key/value fields from the summary panel)

Keep your finance scrapers stable with ProxiesAPI

Quote pages and endpoints can get flaky at scale. ProxiesAPI gives you a cleaner network layer (rotation, retries, fewer blocks) so your daily dataset stays reliable.


What we’re scraping (Yahoo Finance structure)

A quote page looks like:

  • https://finance.yahoo.com/quote/AAPL/
  • https://finance.yahoo.com/quote/TSLA/

On the page you’ll see multiple “cards”. Two patterns matter:

  • Top header: current price + change
  • Key/Value tables: “Previous Close”, “Open”, “Market Cap”, “PE Ratio (TTM)”, etc.

These tables are usually rendered as rows with two columns (label/value). We’ll parse them by selecting rows and reading both cells.

Important note (honesty)

Yahoo Finance is not an official market data API. HTML structure can change and the site may rate-limit or block aggressive traffic.

This tutorial is for educational / internal use. If you need guaranteed data correctness and licensing, use a paid provider.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml python-dotenv

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for robust parsing
  • python-dotenv for local configuration

Create a .env (or set env vars in your CI):

PROXIESAPI_KEY="YOUR_KEY"

Step 1: Build a production-ish fetch layer

When scraping quote pages for multiple tickers every day, you’ll hit transient failures:

  • 429s / bot challenges
  • occasional 5xx
  • slow responses

A resilient fetch layer does:

  • timeouts (no hanging)
  • retries (but capped)
  • backoff (don’t hammer)
  • rotating proxy via ProxiesAPI

Below is a template you can reuse across many sites.

import os
import random
import time
from dataclasses import dataclass
from typing import Optional

import requests
from dotenv import load_dotenv

load_dotenv()

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")

# You’ll need to confirm your exact ProxiesAPI endpoint + auth format.
# Many proxy services offer either:
# 1) a proxy URL like http(s)://USER:PASS@HOST:PORT
# 2) an API gateway URL you call with ?url=... and your key
#
# This example assumes ProxiesAPI provides a standard proxy endpoint.
PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")  # e.g. http://user:pass@gw.proxiesapi.com:10000

TIMEOUT = (10, 30)  # connect, read


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def make_session() -> requests.Session:
    s = requests.Session()
    s.headers.update({
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    })
    return s


def fetch(url: str, session: requests.Session, max_attempts: int = 4) -> FetchResult:
    last_exc: Optional[Exception] = None

    for attempt in range(1, max_attempts + 1):
        # jitter to look less “botty” and reduce burstiness
        time.sleep(random.uniform(0.8, 2.0))

        try:
            proxies = None
            if PROXY_URL:
                proxies = {"http": PROXY_URL, "https": PROXY_URL}

            r = session.get(url, timeout=TIMEOUT, proxies=proxies)

            # treat common temporary blocks as retryable
            if r.status_code in (403, 429, 500, 502, 503, 504):
                wait = min(10, 1.5 ** attempt) + random.uniform(0, 0.7)
                time.sleep(wait)
                continue

            r.raise_for_status()
            return FetchResult(url=url, status_code=r.status_code, text=r.text)

        except Exception as e:
            last_exc = e
            wait = min(10, 1.5 ** attempt) + random.uniform(0, 0.7)
            time.sleep(wait)

    raise RuntimeError(f"Failed to fetch {url} after {max_attempts} attempts") from last_exc

How ProxiesAPI fits

If you’re fetching 10 tickers/day from one machine, you might be fine without proxies.

If you’re fetching hundreds (or your IP gets flagged), ProxiesAPI is most useful as a drop-in network layer:

  • rotate IPs
  • isolate failures
  • reduce “one IP got blocked, everything breaks” events

Step 2: Parse quote data from HTML

Yahoo Finance’s HTML changes, so we’ll avoid overly brittle selectors.

Common stable patterns:

  • price in a prominent element near the header
  • key/value rows across “summary” tables

We’ll do two passes:

  • Parse key/value rows into a dictionary
  • Parse price + change separately
import re
from bs4 import BeautifulSoup


def normalize_space(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def parse_kv_tables(soup: BeautifulSoup) -> dict:
    """Return a dict like {"Previous Close": "189.12", "Market Cap": "287B", ...}."""
    out = {}

    # Many quote pages include rows where the label/value are adjacent cells.
    # We look for table rows and pull the first two cells.
    for row in soup.select("table tr"):
        cells = row.select("td")
        if len(cells) < 2:
            continue

        k = normalize_space(cells[0].get_text(" ", strip=True))
        v = normalize_space(cells[1].get_text(" ", strip=True))
        if not k or not v:
            continue

        # avoid extremely long text blobs
        if len(k) > 60 or len(v) > 120:
            continue

        out[k] = v

    return out


def parse_price_block(soup: BeautifulSoup) -> dict:
    """Try to extract current price and daily change."""

    text = soup.get_text("\n", strip=True)

    # Heuristic: price is typically the first large float near the top.
    # This is intentionally conservative; you should validate on 3–5 tickers.
    m = re.search(r"\b(\d{1,5}(?:,\d{3})*(?:\.\d+)?)\b", text)

    price = m.group(1) if m else None
    return {"Price": price}


def parse_quote_page(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    kv = parse_kv_tables(soup)
    price = parse_price_block(soup)

    # merge with price taking precedence
    return {**kv, **price}

Selector sanity check

Before you automate, do a 30-second manual check:

curl -L -s "https://finance.yahoo.com/quote/AAPL/" | head -n 20

If you’re getting a “consent” page or bot check HTML, that’s where:

  • headers
  • pacing
  • proxies

…make a difference.


Step 3: Turn tickers into daily rows

We’ll scrape multiple tickers and produce a normalized row per ticker/day.

from datetime import date


def scrape_ticker(ticker: str, session: requests.Session) -> dict:
    url = f"https://finance.yahoo.com/quote/{ticker}/"
    res = fetch(url, session=session)
    data = parse_quote_page(res.text)

    # A minimal schema (add fields as you need)
    return {
        "ticker": ticker.upper(),
        "as_of": date.today().isoformat(),
        "price": data.get("Price"),
        "previous_close": data.get("Previous Close"),
        "open": data.get("Open"),
        "day_range": data.get("Day's Range") or data.get("Day Range"),
        "week52_range": data.get("52 Week Range"),
        "market_cap": data.get("Market Cap"),
        "pe_ttm": data.get("PE Ratio (TTM)"),
        "eps_ttm": data.get("EPS (TTM)"),
        "volume": data.get("Volume"),
        "avg_volume": data.get("Avg. Volume"),
    }


def scrape_many(tickers: list[str]) -> list[dict]:
    s = make_session()
    rows = []
    for t in tickers:
        row = scrape_ticker(t, session=s)
        rows.append(row)
        print("scraped", t, "price", row.get("price"))
    return rows


if __name__ == "__main__":
    tickers = ["AAPL", "MSFT", "TSLA"]
    rows = scrape_many(tickers)
    print(rows)

Step 4: Export to CSV

import csv


def write_csv(rows: list[dict], path: str = "stocks_daily.csv") -> None:
    if not rows:
        return

    fieldnames = list(rows[0].keys())
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for r in rows:
            w.writerow(r)


# usage:
# write_csv(rows)

CSV is fine. SQLite is better for daily runs.

  • easy to append
  • easy to query “last 30 days”
  • dedupe with a unique constraint
import sqlite3


def init_db(db_path: str = "stocks.db") -> None:
    con = sqlite3.connect(db_path)
    cur = con.cursor()
    cur.execute(
        """
        CREATE TABLE IF NOT EXISTS stock_prices (
          ticker TEXT NOT NULL,
          as_of TEXT NOT NULL,
          price TEXT,
          previous_close TEXT,
          open TEXT,
          day_range TEXT,
          week52_range TEXT,
          market_cap TEXT,
          pe_ttm TEXT,
          eps_ttm TEXT,
          volume TEXT,
          avg_volume TEXT,
          PRIMARY KEY (ticker, as_of)
        );
        """
    )
    con.commit()
    con.close()


def upsert_rows(rows: list[dict], db_path: str = "stocks.db") -> None:
    con = sqlite3.connect(db_path)
    cur = con.cursor()

    cur.executemany(
        """
        INSERT OR REPLACE INTO stock_prices (
          ticker, as_of, price, previous_close, open, day_range, week52_range,
          market_cap, pe_ttm, eps_ttm, volume, avg_volume
        ) VALUES (
          :ticker, :as_of, :price, :previous_close, :open, :day_range, :week52_range,
          :market_cap, :pe_ttm, :eps_ttm, :volume, :avg_volume
        );
        """,
        rows,
    )

    con.commit()
    con.close()


# usage:
# init_db(); upsert_rows(rows)

QA checklist

  • Can you fetch 3 tickers without getting consent/bot HTML?
  • Are key fields present (Previous Close, Open, Market Cap) for at least 2 tickers?
  • Does your run complete in under 2 minutes for 50 tickers? (tune pacing)
  • Do retries recover from occasional 5xx/429?

Next upgrades

  • Incremental updates: skip tickers already scraped today
  • Structured typing: parse 2.87T into a numeric value
  • Scheduling: run at market close and store EOD snapshots
  • Alerting: notify when a ticker changes >X%

If you want to turn this into a reliable daily job, focus less on parsing tricks and more on the fetch layer + monitoring. That’s where ProxiesAPI earns its keep.

Keep your finance scrapers stable with ProxiesAPI

Quote pages and endpoints can get flaky at scale. ProxiesAPI gives you a cleaner network layer (rotation, retries, fewer blocks) so your daily dataset stays reliable.

Related guides

Scrape Live Stock Prices from Yahoo Finance (Python + ProxiesAPI)
Fetch Yahoo Finance quote pages via ProxiesAPI, parse price + change + market cap, and export clean rows to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks
Scrape Google Maps Business Listings with Python: Search → Place Details → Reviews (ProxiesAPI)
Extract local leads from Google Maps: search results → place details → reviews, with a resilient fetch pipeline and a screenshot-driven selector approach.
tutorial#python#google-maps#local-leads
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Restaurant Data from TripAdvisor (Reviews, Ratings, and Locations)
Build a practical TripAdvisor scraper in Python: discover restaurant listing URLs, extract name/rating/review count/address, and export clean CSV/JSON with ProxiesAPI in the fetch layer.
tutorial#python#web-scraping#beautifulsoup