Scrape Stock Prices and Financial Data with Python (Step-by-Step)

May 02, 2026 · tutorial · #python, #stocks, #finance, #web-scraping, #csv, #sqlite, #requests, #datasets, #proxies

If you want stock prices and basic financial time series in a dataset you control, you don’t need a Bloomberg terminal to start.

In this tutorial we’ll build a small-but-real pipeline in Python that:

downloads a symbol list (tickers)
fetches historical OHLCV for each symbol as CSV
normalizes columns + dates
exports:
- a combined CSV (easy to inspect)
- a SQLite database (easy to query and keep history)

We’ll do it against Stooq (a public market data site with simple CSV endpoints). That keeps the focus on the engineering patterns you’ll reuse everywhere:

resilient network layer (timeouts, retries, backoff)
polite pacing
checkpointing so reruns don’t restart from scratch
optional proxy routing with ProxiesAPI

Stooq daily data download page (we’ll use the CSV endpoints behind this)

Keep your data pipelines stable with ProxiesAPI

When your daily dataset grows (more symbols, more downloads), transient 429/5xx and network flakiness show up fast. ProxiesAPI helps you keep the fetch layer consistent with a cleaner proxy + retry strategy.

Get 1,000 free API calls View pricing

What we’re scraping (and why Stooq)

Stooq exposes historical price data through simple CSV URLs like:

https://stooq.com/q/d/l/?s=aapl.us&i=d (AAPL daily)
https://stooq.com/q/d/l/?s=tsla.us&i=d (TSLA daily)

Where:

s= is the symbol (Stooq uses lowercase, and U.S. equities often end with .us)
i= is interval:
- d = daily
- w = weekly
- m = monthly

The response is a plain CSV:

Date,Open,High,Low,Close,Volume
2026-04-24,....

That’s ideal for a tutorial because we can build a dataset builder without brittle HTML parsing.

Important note (honesty)

Stooq is not an official exchange feed, and any public data source can change terms or availability.

If you need licensed, guaranteed market data for a commercial product, use a paid provider. This guide is for learning / internal datasets.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests pandas tenacity python-dotenv

We’ll use:

requests for HTTP
pandas for parsing and merging CSVs
tenacity for clean retries with backoff
python-dotenv for local env vars

Create .env:

PROXIESAPI_PROXY_URL="http://user:pass@gw.proxiesapi.com:10000"

Why this format? Most proxy services (including ProxiesAPI-style setups) provide a standard proxy gateway (host/port + credentials). requests can route through it with proxies={"http": ..., "https": ...}.

If your ProxiesAPI account uses a different integration (API gateway URL you call with ?url=), keep the same retry and timeout patterns and swap the session.get() implementation.

Step 1: A resilient fetch layer (timeouts + retries + optional ProxiesAPI)

Scraping (and downloading CSVs) fails in boring ways:

network timeouts
intermittent 5xx
temporary rate limits

A production-ish fetch layer should:

always set timeouts
retry a small number of times
back off between attempts
optionally route through proxies

import os
import random
import time
from dataclasses import dataclass
from typing import Optional

import requests
from dotenv import load_dotenv
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

load_dotenv()

PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")

TIMEOUT = (10, 30)  # connect, read

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)"
})


@dataclass
class FetchConfig:
    proxiesapi_proxy_url: Optional[str] = None
    min_delay_s: float = 0.4
    max_delay_s: float = 1.2


class FetchError(RuntimeError):
    pass


def _proxies(cfg: FetchConfig):
    if not cfg.proxiesapi_proxy_url:
        return None
    return {
        "http": cfg.proxiesapi_proxy_url,
        "https": cfg.proxiesapi_proxy_url,
    }


@retry(
    reraise=True,
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=1, max=12),
    retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch_bytes(url: str, cfg: FetchConfig) -> bytes:
    # polite jitter
    time.sleep(random.uniform(cfg.min_delay_s, cfg.max_delay_s))

    try:
        r = session.get(url, timeout=TIMEOUT, proxies=_proxies(cfg))
    except requests.RequestException as e:
        raise FetchError(f"request failed: {e}")

    if r.status_code >= 500:
        raise FetchError(f"server error {r.status_code}")

    if r.status_code == 429:
        # retry with backoff (tenacity handles waiting)
        raise FetchError("rate limited (429)")

    r.raise_for_status()
    return r.content

This is intentionally reusable. You can drop it into almost any scraper.

Step 2: Decide the symbols to download

Stooq has many assets (equities, ETFs, indices). For a tutorial dataset we’ll pick a small set of U.S. tickers.

You can hardcode a watchlist:

SYMBOLS = [
    "aapl.us",
    "msft.us",
    "goog.us",
    "amzn.us",
    "tsla.us",
]

Or load from a file (recommended):

from pathlib import Path

symbols = [
    s.strip().lower()
    for s in Path("symbols.txt").read_text(encoding="utf-8").splitlines()
    if s.strip() and not s.strip().startswith("#")
]

Example symbols.txt:

# US equities (Stooq uses .us)
aapl.us
msft.us
tsla.us

Step 3: Download historical OHLCV CSV for each symbol

Stooq historical endpoint:

https://stooq.com/q/d/l/?s={symbol}&i=d

We’ll download each symbol and parse into a DataFrame.

import io
import pandas as pd


def stooq_hist_url(symbol: str, interval: str = "d") -> str:
    return f"https://stooq.com/q/d/l/?s={symbol}&i={interval}"


def download_symbol_df(symbol: str, cfg: FetchConfig) -> pd.DataFrame:
    url = stooq_hist_url(symbol, interval="d")
    raw = fetch_bytes(url, cfg)

    df = pd.read_csv(io.BytesIO(raw))

    # normalize
    df.columns = [c.strip().lower() for c in df.columns]
    df["date"] = pd.to_datetime(df["date"], errors="coerce")

    # Some sources return '-' or empty values; coerce numerics
    for col in ["open", "high", "low", "close", "volume"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    df["symbol"] = symbol

    # drop rows with missing date
    df = df.dropna(subset=["date"]).sort_values("date")
    return df

Step 4: Merge into a dataset + write CSV

Now let’s build a combined table:

Columns:

date
symbol
open, high, low, close, volume

from pathlib import Path


def build_dataset(symbols: list[str], cfg: FetchConfig) -> pd.DataFrame:
    frames = []
    for i, sym in enumerate(symbols, start=1):
        try:
            df = download_symbol_df(sym, cfg)
            frames.append(df)
            print(f"[{i}/{len(symbols)}] {sym}: {len(df)} rows")
        except Exception as e:
            print(f"[{i}/{len(symbols)}] {sym}: FAILED: {e}")

    if not frames:
        raise RuntimeError("no data downloaded")

    out = pd.concat(frames, ignore_index=True)
    out = out[["date", "symbol", "open", "high", "low", "close", "volume"]]
    return out


if __name__ == "__main__":
    cfg = FetchConfig(proxiesapi_proxy_url=PROXIESAPI_PROXY_URL)
    symbols = ["aapl.us", "msft.us", "goog.us", "amzn.us", "tsla.us"]

    ds = build_dataset(symbols, cfg)
    Path("out").mkdir(exist_ok=True)

    csv_path = Path("out/stocks_ohlcv.csv")
    ds.to_csv(csv_path, index=False)
    print("wrote", csv_path, "rows", len(ds))

Step 5: Store in SQLite (so you can keep history)

CSV is great for inspection, but SQLite is perfect for:

incremental updates
deduping
querying by date ranges

import sqlite3


def write_sqlite(df: pd.DataFrame, path: str = "out/stocks.sqlite"):
    conn = sqlite3.connect(path)
    try:
        # write table
        df.to_sql("prices", conn, if_exists="append", index=False)

        # add an index (run once; ignore errors if it exists)
        try:
            conn.execute("CREATE INDEX idx_prices_symbol_date ON prices(symbol, date)")
        except Exception:
            pass

        # optional dedupe: keep distinct rows
        conn.execute(
            "CREATE TABLE IF NOT EXISTS prices_dedup AS SELECT DISTINCT * FROM prices"
        )
        conn.commit()
    finally:
        conn.close()

A more “correct” dedupe strategy is to use a table with a unique constraint and INSERT OR REPLACE, but the above keeps the tutorial straightforward.

Step 6: QA checklist (what to verify)

Before you trust a dataset, do quick sanity checks:

Spot-check one symbol’s first and last date
Ensure volume is numeric (not strings)
Confirm no duplicated (symbol, date) rows (if you plan to append daily)
Verify the CSV endpoint hasn’t changed (a sudden 0-row download is a red flag)

Example quick check:

print(ds.groupby("symbol")["date"].agg(["min", "max", "count"]))

Where ProxiesAPI fits (honestly)

For Stooq, you can usually download CSVs without a proxy.

But the pattern you built matters: when you scale to many symbols or switch to harder targets (investor sites, filings, portals, JS-heavy pages), your failure rate climbs.

ProxiesAPI helps you keep the network layer stable:

proxy rotation so you don’t concentrate traffic on one IP
fewer transient blocks when volume increases
cleaner separation between your scraper logic and “how requests get to the site”