Scrape Financial Data from Yahoo Finance

Jun 12, 2026 · tutorial · #python, #yahoo-finance, #web-scraping, #csv, #beautifulsoup, #proxies

Yahoo Finance is still one of the fastest ways to assemble a practical market dataset for internal dashboards, side projects, and research.

In one page you can usually find:

the quote header
current market price
day range / 52-week range
valuation fields like market cap or PE ratio
historical price tables linked from the same symbol

The catch is reliability. Yahoo Finance is not a public, officially supported scraping API. HTML layouts move around, some background endpoints are throttled, and direct curl requests from one IP can return rate limits quickly.

That makes this a good example of where a simple, honest scraping stack matters:

parse the rendered quote page
normalize the fields you need
retry conservatively
add ProxiesAPI when your request volume grows

Yahoo Finance quote page for Microsoft

Keep finance scrapers steady with ProxiesAPI

Yahoo Finance pages are useful, but they do rate-limit and occasionally block aggressive requests. ProxiesAPI fits cleanly into the fetch layer so your scraper can retry politely without tying the whole workflow to one IP.

Get 1,000 free API calls View pricing

What we are scraping

For each ticker we want three buckets of data:

Quote header
- company name
- ticker
- current market price
Summary statistics
- previous close
- open
- bid / ask
- market cap
- PE ratio
- 52-week range
Historical rows
- date
- open / high / low / close
- adjusted close
- volume

In practice, quote headers and summary stats usually come from the HTML quote page, while historical rows are easiest to export from the table page once you already know which symbols you care about.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas python-dotenv

Create a .env file:

PROXIESAPI_PROXY_URL="http://USER:PASS@gateway.example:9000"

PROXIESAPI_PROXY_URL is intentionally generic here because teams wire ProxiesAPI in different ways. The scraper below expects a standard proxy URL that requests can pass as both the HTTP and HTTPS proxy.

Step 1: Build a fetch layer that can survive rate limits

import os
import random
import time
from dataclasses import dataclass
from typing import Optional

import requests
from dotenv import load_dotenv

load_dotenv()

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
TIMEOUT = (10, 30)


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def make_session() -> requests.Session:
    session = requests.Session()
    session.headers.update(
        {
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/125.0 Safari/537.36"
            ),
            "Accept-Language": "en-US,en;q=0.9",
        }
    )
    return session


def fetch(url: str, session: requests.Session, attempts: int = 4) -> FetchResult:
    last_error: Optional[Exception] = None
    proxies = {"http": PROXY_URL, "https": PROXY_URL} if PROXY_URL else None

    for attempt in range(1, attempts + 1):
        time.sleep(random.uniform(0.8, 1.8))
        try:
            response = session.get(url, timeout=TIMEOUT, proxies=proxies)
            if response.status_code in (403, 429, 500, 502, 503, 504):
                wait = min(10, 1.6 ** attempt) + random.uniform(0, 0.5)
                time.sleep(wait)
                continue

            response.raise_for_status()
            return FetchResult(url=url, status_code=response.status_code, text=response.text)
        except Exception as exc:
            last_error = exc
            wait = min(10, 1.6 ** attempt) + random.uniform(0, 0.5)
            time.sleep(wait)

    raise RuntimeError(f"Failed to fetch {url}") from last_error

This is where ProxiesAPI belongs. It does not magically fix selectors. It gives the fetch layer a cleaner network path so retries are less likely to come from the exact same IP that just got throttled.

Step 2: Parse the quote page

The quote page URL pattern is simple:

def quote_url(symbol: str) -> str:
    return f"https://finance.yahoo.com/quote/{symbol}/"

Now parse the rendered HTML:

import re
from bs4 import BeautifulSoup


def clean(text: str) -> str:
    return re.sub(r"\s+", " ", (text or "").strip())


def parse_quote_page(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    heading = soup.select_one("h1")
    price = soup.select_one('fin-streamer[data-field="regularMarketPrice"]')

    stats = {}
    for row in soup.select("table tr"):
        cells = row.select("td")
        if len(cells) < 2:
            continue
        label = clean(cells[0].get_text(" ", strip=True))
        value = clean(cells[1].get_text(" ", strip=True))
        if label and value:
            stats[label] = value

    return {
        "name": clean(heading.get_text(" ", strip=True)) if heading else None,
        "price": clean(price.get_text(" ", strip=True)) if price else None,
        "previous_close": stats.get("Previous Close"),
        "open": stats.get("Open"),
        "bid": stats.get("Bid"),
        "ask": stats.get("Ask"),
        "days_range": stats.get("Day's Range"),
        "fifty_two_week_range": stats.get("52 Week Range"),
        "volume": stats.get("Volume"),
        "avg_volume": stats.get("Avg. Volume"),
        "market_cap": stats.get("Market Cap"),
        "beta": stats.get("Beta (5Y Monthly)"),
        "pe_ratio": stats.get("PE Ratio (TTM)"),
        "eps": stats.get("EPS (TTM)"),
        "earnings_date": stats.get("Earnings Date"),
        "target_est": stats.get("1y Target Est"),
    }

The main trick is not trying to overfit to one exact card container. Yahoo often keeps the row pattern stable even when it moves sections around.

Step 3: Pull historical rows and export to CSV

Yahoo’s historical price pages are easier to scrape if you already know the symbol list and only need a small number of rows per ticker.

import pandas as pd


def history_url(symbol: str) -> str:
    return f"https://finance.yahoo.com/quote/{symbol}/history/"


def parse_history_table(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    rows = []

    for row in soup.select("table tbody tr"):
        cells = [clean(td.get_text(" ", strip=True)) for td in row.select("td")]
        if len(cells) != 7:
            continue
        rows.append(
            {
                "date": cells[0],
                "open": cells[1],
                "high": cells[2],
                "low": cells[3],
                "close": cells[4],
                "adj_close": cells[5],
                "volume": cells[6],
            }
        )

    return rows


def scrape_symbol(symbol: str, session: requests.Session) -> tuple[dict, list[dict]]:
    quote = parse_quote_page(fetch(quote_url(symbol), session).text)
    history = parse_history_table(fetch(history_url(symbol), session).text)
    quote["symbol"] = symbol
    return quote, history


def export(symbols: list[str], out_csv: str = "yahoo_finance_quotes.csv") -> None:
    session = make_session()
    quote_rows = []

    for symbol in symbols:
        quote, history = scrape_symbol(symbol, session)
        quote["history_rows"] = len(history)
        quote_rows.append(quote)

        pd.DataFrame(history).to_csv(f"{symbol.lower()}_history.csv", index=False)

    pd.DataFrame(quote_rows).to_csv(out_csv, index=False)


if __name__ == "__main__":
    export(["MSFT", "AAPL", "NVDA"])

That produces: