Scrape Financial Data from Yahoo Finance

Yahoo Finance is still one of the fastest ways to assemble a practical market dataset for internal dashboards, side projects, and research.

In one page you can usually find:

  • the quote header
  • current market price
  • day range / 52-week range
  • valuation fields like market cap or PE ratio
  • historical price tables linked from the same symbol

The catch is reliability. Yahoo Finance is not a public, officially supported scraping API. HTML layouts move around, some background endpoints are throttled, and direct curl requests from one IP can return rate limits quickly.

That makes this a good example of where a simple, honest scraping stack matters:

  • parse the rendered quote page
  • normalize the fields you need
  • retry conservatively
  • add ProxiesAPI when your request volume grows

Yahoo Finance quote page for Microsoft

Keep finance scrapers steady with ProxiesAPI

Yahoo Finance pages are useful, but they do rate-limit and occasionally block aggressive requests. ProxiesAPI fits cleanly into the fetch layer so your scraper can retry politely without tying the whole workflow to one IP.


What we are scraping

For each ticker we want three buckets of data:

  1. Quote header

    • company name
    • ticker
    • current market price
  2. Summary statistics

    • previous close
    • open
    • bid / ask
    • market cap
    • PE ratio
    • 52-week range
  3. Historical rows

    • date
    • open / high / low / close
    • adjusted close
    • volume

In practice, quote headers and summary stats usually come from the HTML quote page, while historical rows are easiest to export from the table page once you already know which symbols you care about.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas python-dotenv

Create a .env file:

PROXIESAPI_PROXY_URL="http://USER:PASS@gateway.example:9000"

PROXIESAPI_PROXY_URL is intentionally generic here because teams wire ProxiesAPI in different ways. The scraper below expects a standard proxy URL that requests can pass as both the HTTP and HTTPS proxy.


Step 1: Build a fetch layer that can survive rate limits

import os
import random
import time
from dataclasses import dataclass
from typing import Optional

import requests
from dotenv import load_dotenv

load_dotenv()

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
TIMEOUT = (10, 30)


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def make_session() -> requests.Session:
    session = requests.Session()
    session.headers.update(
        {
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/125.0 Safari/537.36"
            ),
            "Accept-Language": "en-US,en;q=0.9",
        }
    )
    return session


def fetch(url: str, session: requests.Session, attempts: int = 4) -> FetchResult:
    last_error: Optional[Exception] = None
    proxies = {"http": PROXY_URL, "https": PROXY_URL} if PROXY_URL else None

    for attempt in range(1, attempts + 1):
        time.sleep(random.uniform(0.8, 1.8))
        try:
            response = session.get(url, timeout=TIMEOUT, proxies=proxies)
            if response.status_code in (403, 429, 500, 502, 503, 504):
                wait = min(10, 1.6 ** attempt) + random.uniform(0, 0.5)
                time.sleep(wait)
                continue

            response.raise_for_status()
            return FetchResult(url=url, status_code=response.status_code, text=response.text)
        except Exception as exc:
            last_error = exc
            wait = min(10, 1.6 ** attempt) + random.uniform(0, 0.5)
            time.sleep(wait)

    raise RuntimeError(f"Failed to fetch {url}") from last_error

This is where ProxiesAPI belongs. It does not magically fix selectors. It gives the fetch layer a cleaner network path so retries are less likely to come from the exact same IP that just got throttled.


Step 2: Parse the quote page

The quote page URL pattern is simple:

def quote_url(symbol: str) -> str:
    return f"https://finance.yahoo.com/quote/{symbol}/"

Now parse the rendered HTML:

import re
from bs4 import BeautifulSoup


def clean(text: str) -> str:
    return re.sub(r"\s+", " ", (text or "").strip())


def parse_quote_page(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    heading = soup.select_one("h1")
    price = soup.select_one('fin-streamer[data-field="regularMarketPrice"]')

    stats = {}
    for row in soup.select("table tr"):
        cells = row.select("td")
        if len(cells) < 2:
            continue
        label = clean(cells[0].get_text(" ", strip=True))
        value = clean(cells[1].get_text(" ", strip=True))
        if label and value:
            stats[label] = value

    return {
        "name": clean(heading.get_text(" ", strip=True)) if heading else None,
        "price": clean(price.get_text(" ", strip=True)) if price else None,
        "previous_close": stats.get("Previous Close"),
        "open": stats.get("Open"),
        "bid": stats.get("Bid"),
        "ask": stats.get("Ask"),
        "days_range": stats.get("Day's Range"),
        "fifty_two_week_range": stats.get("52 Week Range"),
        "volume": stats.get("Volume"),
        "avg_volume": stats.get("Avg. Volume"),
        "market_cap": stats.get("Market Cap"),
        "beta": stats.get("Beta (5Y Monthly)"),
        "pe_ratio": stats.get("PE Ratio (TTM)"),
        "eps": stats.get("EPS (TTM)"),
        "earnings_date": stats.get("Earnings Date"),
        "target_est": stats.get("1y Target Est"),
    }

The main trick is not trying to overfit to one exact card container. Yahoo often keeps the row pattern stable even when it moves sections around.


Step 3: Pull historical rows and export to CSV

Yahoo’s historical price pages are easier to scrape if you already know the symbol list and only need a small number of rows per ticker.

import pandas as pd


def history_url(symbol: str) -> str:
    return f"https://finance.yahoo.com/quote/{symbol}/history/"


def parse_history_table(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    rows = []

    for row in soup.select("table tbody tr"):
        cells = [clean(td.get_text(" ", strip=True)) for td in row.select("td")]
        if len(cells) != 7:
            continue
        rows.append(
            {
                "date": cells[0],
                "open": cells[1],
                "high": cells[2],
                "low": cells[3],
                "close": cells[4],
                "adj_close": cells[5],
                "volume": cells[6],
            }
        )

    return rows


def scrape_symbol(symbol: str, session: requests.Session) -> tuple[dict, list[dict]]:
    quote = parse_quote_page(fetch(quote_url(symbol), session).text)
    history = parse_history_table(fetch(history_url(symbol), session).text)
    quote["symbol"] = symbol
    return quote, history


def export(symbols: list[str], out_csv: str = "yahoo_finance_quotes.csv") -> None:
    session = make_session()
    quote_rows = []

    for symbol in symbols:
        quote, history = scrape_symbol(symbol, session)
        quote["history_rows"] = len(history)
        quote_rows.append(quote)

        pd.DataFrame(history).to_csv(f"{symbol.lower()}_history.csv", index=False)

    pd.DataFrame(quote_rows).to_csv(out_csv, index=False)


if __name__ == "__main__":
    export(["MSFT", "AAPL", "NVDA"])

That produces:

  • one summary CSV with the latest quote metadata
  • one historical CSV per symbol

Practical notes for production

1. Expect partial failures

Some symbols will work while others rate-limit. Log those separately and retry later instead of killing the whole batch.

2. Validate labels across a sample set

Before trusting the scraper, test 5 to 10 symbols across:

  • mega-cap stocks
  • ETFs
  • ADRs
  • one symbol with an upcoming earnings date

That helps catch empty fields early.

3. Store raw HTML for debugging

When a selector breaks, having the original response saved locally is much faster than re-running blind.

4. Be honest about the source

Yahoo Finance is great for internal monitoring and exploratory datasets. If you need contractual uptime, licensing guarantees, or intraday precision, use a paid market-data vendor instead.


When ProxiesAPI is worth it

If you scrape 3 symbols once a day, you might not need a proxy layer at all.

You should consider ProxiesAPI when:

  • you scrape hundreds of quote pages per run
  • jobs run from CI or one static cloud IP
  • 429s start appearing regularly
  • retries from the same IP keep failing

In that setup, ProxiesAPI is not “the scraper.” It is the infrastructure that makes your scraper less fragile.


Wrap-up

The reliable Yahoo Finance workflow is simple:

  • fetch quote pages conservatively
  • parse the table rows you actually need
  • export normalized CSV files
  • add ProxiesAPI when scaling the request volume makes one-IP scraping unreliable

That gives you a useful finance dataset without pretending Yahoo Finance is a formal API product.

Keep finance scrapers steady with ProxiesAPI

Yahoo Finance pages are useful, but they do rate-limit and occasionally block aggressive requests. ProxiesAPI fits cleanly into the fetch layer so your scraper can retry politely without tying the whole workflow to one IP.

Related guides

Scrape Financial Data from Yahoo Finance (Green List site)
Fetch a quote page via ProxiesAPI, parse price + key stats, and export to CSV (with a screenshot).
tutorial#python#yahoo-finance#stocks
Scrape Yahoo Finance Top Gainers/Losers Screener with ProxiesAPI (CSV Export)
Scrape Yahoo Finance movers tables (gainers + losers), extract tickers, prices, % change, and volume using stable data-testid anchors, then export to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks
Scrape GitHub Repository Data
Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.
tutorial#python#github#web-scraping
Scrape Book Reviews and Ratings from Goodreads
Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.
tutorial#python#goodreads#web-scraping