HTML Table Scraping with Python: Parse Tables into CSV Reliably

Jun 22, 2026 · guides · #html table scraping python, #python, #pandas, #beautifulsoup, #web-scraping, #csv

If you search for html table scraping python, most tutorials show the easy case:

import pandas as pd
tables = pd.read_html(url)

And to be fair, that works surprisingly often.

But the real-world problem is not “can Python read a table?” The real problem is:

the page has six tables
headers are duplicated or merged
rows contain notes and footnotes
the site returns partial HTML or challenge pages
one table loads fine, another is malformed

So the reliable workflow is not one function. It is a sequence:

fetch the real HTML
identify the right table
parse with read_html() first
fall back to BeautifulSoup when the table is messy
normalize headers and cells
validate before export

That is the workflow we’ll build here.

Tables are easy until the HTML gets messy

The trick is not just `pandas.read_html()`. It’s knowing when to switch to BeautifulSoup, how to validate the extracted table, and how to keep the fetch layer stable when your scraper moves beyond friendly sites.

Get 1,000 free API calls View pricing

When `pandas.read_html()` is enough

If the page has:

server-rendered HTML
real <table> markup
clean headers
predictable rows

then read_html() is the fastest route from webpage to DataFrame.

import pandas as pd

url = "https://example.com/table-page"
tables = pd.read_html(url)
print("tables found:", len(tables))
print(tables[0].head())

That is step one, not the whole solution.

The reliable workflow

Here is the process I recommend for production-ish table scraping:

Step	Goal	Tool
fetch	make sure the HTML is real	`requests`
inspect	count and label candidate tables	`BeautifulSoup`
parse	extract structured rows fast	`pandas.read_html()`
clean	fix headers, footnotes, blanks	`pandas`
validate	confirm shape and columns	custom checks
export	save usable CSV	`to_csv()`

This sequence is what keeps “works once” from turning into “works reliably”.

Step 1: Fetch the HTML yourself first

Do not pass a URL directly into read_html() until you know what the response actually looks like.

import requests

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}


def fetch_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=(10, 30))
    response.raise_for_status()
    html = response.text

    if len(html) < 1000:
        raise ValueError(f"suspiciously short HTML: {len(html)} chars")

    return html

Why start here?

you can inspect status codes
you can detect challenge pages
you can save raw HTML for debugging
you keep transport separate from parsing

That separation matters even more when you later add retries or ProxiesAPI.

Step 2: Count the tables before you parse them

Use BeautifulSoup to inspect the page and understand what you are dealing with.

from bs4 import BeautifulSoup


def inspect_tables(html: str) -> None:
    soup = BeautifulSoup(html, "lxml")
    tables = soup.select("table")
    print("table count:", len(tables))

    for i, table in enumerate(tables[:10], start=1):
        headers = [th.get_text(" ", strip=True) for th in table.select("th")[:8]]
        print(f"table {i} headers:", headers)

This tells you:

how many tables exist
which one probably matters
whether the table has <th> headers at all

Without this step, you are often guessing.

Step 3: Parse with `pandas.read_html()`

Once you have the actual HTML string, pass that to read_html() instead of giving it the URL directly.

from io import StringIO
import pandas as pd


def parse_tables(html: str) -> list[pd.DataFrame]:
    return pd.read_html(StringIO(html))

Then choose the table by its columns, not just its position:

def choose_table(tables: list[pd.DataFrame], required_keywords: list[str]) -> pd.DataFrame:
    for df in tables:
        cols = [str(col) for col in df.columns]
        if all(any(keyword.lower() in col.lower() for col in cols) for keyword in required_keywords):
            return df.copy()
    raise ValueError("Could not find target table")

Example:

html = fetch_html("https://example.com/stats")
tables = parse_tables(html)
target = choose_table(tables, ["Country", "Population"])
print(target.head())

This is much safer than hard-coding “use table 0”.

Step 4: Flatten messy headers

Some tables come with multi-row or multi-index headers. That is where many “quick” scrapers fall apart.

A practical normalization pass:

def flatten_columns(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()

    if isinstance(out.columns, pd.MultiIndex):
        out.columns = [
            " ".join(str(part).strip() for part in col if str(part) != "nan").strip()
            for col in out.columns
        ]
    else:
        out.columns = [str(col).strip() for col in out.columns]

    out.columns = [
        col.lower().replace("\n", " ").replace("/", "_").replace("  ", " ").strip()
        for col in out.columns
    ]
    return out

This turns ugly column structures into something you can actually reference in code.

Step 5: Clean row values before export

Real tables often include:

blank rows
repeated header rows inside the body
footnote markers
extra whitespace

Here is a solid cleanup pass:

import re


def clean_cell(value):
    if pd.isna(value):
        return None
    text = str(value)
    text = re.sub(r"\[[^\]]+\]", "", text)   # remove footnotes like [1]
    text = re.sub(r"\s+", " ", text).strip()
    return text or None


def clean_table(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out = out.dropna(how="all")

    for col in out.columns:
        out[col] = out[col].map(clean_cell)

    # remove repeated header rows accidentally parsed into the body
    first_col = out.columns[0]
    out = out[out[first_col] != first_col]

    return out.reset_index(drop=True)

This is where “parse into CSV reliably” actually happens.

Step 6: Fall back to BeautifulSoup when `read_html()` struggles

Sometimes read_html() misses the table or mangles the structure. Then it is better to parse the table manually.

def extract_table_with_bs4(html: str, table_selector: str) -> pd.DataFrame:
    soup = BeautifulSoup(html, "lxml")
    table = soup.select_one(table_selector)
    if table is None:
        raise ValueError(f"Table not found for selector: {table_selector}")

    headers = [th.get_text(" ", strip=True) for th in table.select("tr th")]
    rows = []

    for tr in table.select("tr"):
        cells = [td.get_text(" ", strip=True) for td in tr.select("td")]
        if not cells:
            continue
        rows.append(cells)

    if headers and rows and len(headers) == len(rows[0]):
        return pd.DataFrame(rows, columns=headers)

    return pd.DataFrame(rows)

Use this fallback when:

the table has broken markup
headers are inconsistent
read_html() says “No tables found”
the page mixes tables with layout elements

Step 7: Validate before writing CSV

Do not export blindly.

Validate:

expected columns exist
row count is plausible
key column is not mostly empty

def validate_table(df: pd.DataFrame, required_columns: list[str], min_rows: int = 1) -> None:
    missing = [col for col in required_columns if col not in df.columns]
    if missing:
        raise ValueError(f"missing columns: {missing}")

    if len(df) < min_rows:
        raise ValueError(f"too few rows: {len(df)}")

    first_required = required_columns[0]
    if df[first_required].isna().all():
        raise ValueError(f"column {first_required} is entirely empty")

This protects you from silently shipping garbage CSVs.

Full example

import pandas as pd


def scrape_table_to_csv(url: str, required_keywords: list[str], out_path: str) -> pd.DataFrame:
    html = fetch_html(url)
    tables = parse_tables(html)
    df = choose_table(tables, required_keywords)
    df = flatten_columns(df)
    df = clean_table(df)
    validate_table(df, [required_keywords[0].lower()], min_rows=3)
    df.to_csv(out_path, index=False)
    return df


if __name__ == "__main__":
    df = scrape_table_to_csv(
        url="https://example.com/stats",
        required_keywords=["Country", "Population"],
        out_path="table_export.csv",
    )
    print(df.head(10).to_string(index=False))
    print("saved table_export.csv rows:", len(df))

That gives you one reusable pattern instead of a one-off notebook snippet.