HTML Table Scraping with Python: Parse Tables into CSV Reliably

If you search for html table scraping python, most tutorials show the easy case:

import pandas as pd
tables = pd.read_html(url)

And to be fair, that works surprisingly often.

But the real-world problem is not “can Python read a table?” The real problem is:

  • the page has six tables
  • headers are duplicated or merged
  • rows contain notes and footnotes
  • the site returns partial HTML or challenge pages
  • one table loads fine, another is malformed

So the reliable workflow is not one function. It is a sequence:

  1. fetch the real HTML
  2. identify the right table
  3. parse with read_html() first
  4. fall back to BeautifulSoup when the table is messy
  5. normalize headers and cells
  6. validate before export

That is the workflow we’ll build here.

Tables are easy until the HTML gets messy

The trick is not just `pandas.read_html()`. It’s knowing when to switch to BeautifulSoup, how to validate the extracted table, and how to keep the fetch layer stable when your scraper moves beyond friendly sites.


When pandas.read_html() is enough

If the page has:

  • server-rendered HTML
  • real <table> markup
  • clean headers
  • predictable rows

then read_html() is the fastest route from webpage to DataFrame.

import pandas as pd

url = "https://example.com/table-page"
tables = pd.read_html(url)
print("tables found:", len(tables))
print(tables[0].head())

That is step one, not the whole solution.


The reliable workflow

Here is the process I recommend for production-ish table scraping:

StepGoalTool
fetchmake sure the HTML is realrequests
inspectcount and label candidate tablesBeautifulSoup
parseextract structured rows fastpandas.read_html()
cleanfix headers, footnotes, blankspandas
validateconfirm shape and columnscustom checks
exportsave usable CSVto_csv()

This sequence is what keeps “works once” from turning into “works reliably”.


Step 1: Fetch the HTML yourself first

Do not pass a URL directly into read_html() until you know what the response actually looks like.

import requests

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}


def fetch_html(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=(10, 30))
    response.raise_for_status()
    html = response.text

    if len(html) < 1000:
        raise ValueError(f"suspiciously short HTML: {len(html)} chars")

    return html

Why start here?

  • you can inspect status codes
  • you can detect challenge pages
  • you can save raw HTML for debugging
  • you keep transport separate from parsing

That separation matters even more when you later add retries or ProxiesAPI.


Step 2: Count the tables before you parse them

Use BeautifulSoup to inspect the page and understand what you are dealing with.

from bs4 import BeautifulSoup


def inspect_tables(html: str) -> None:
    soup = BeautifulSoup(html, "lxml")
    tables = soup.select("table")
    print("table count:", len(tables))

    for i, table in enumerate(tables[:10], start=1):
        headers = [th.get_text(" ", strip=True) for th in table.select("th")[:8]]
        print(f"table {i} headers:", headers)

This tells you:

  • how many tables exist
  • which one probably matters
  • whether the table has <th> headers at all

Without this step, you are often guessing.


Step 3: Parse with pandas.read_html()

Once you have the actual HTML string, pass that to read_html() instead of giving it the URL directly.

from io import StringIO
import pandas as pd


def parse_tables(html: str) -> list[pd.DataFrame]:
    return pd.read_html(StringIO(html))

Then choose the table by its columns, not just its position:

def choose_table(tables: list[pd.DataFrame], required_keywords: list[str]) -> pd.DataFrame:
    for df in tables:
        cols = [str(col) for col in df.columns]
        if all(any(keyword.lower() in col.lower() for col in cols) for keyword in required_keywords):
            return df.copy()
    raise ValueError("Could not find target table")

Example:

html = fetch_html("https://example.com/stats")
tables = parse_tables(html)
target = choose_table(tables, ["Country", "Population"])
print(target.head())

This is much safer than hard-coding “use table 0”.


Step 4: Flatten messy headers

Some tables come with multi-row or multi-index headers. That is where many “quick” scrapers fall apart.

A practical normalization pass:

def flatten_columns(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()

    if isinstance(out.columns, pd.MultiIndex):
        out.columns = [
            " ".join(str(part).strip() for part in col if str(part) != "nan").strip()
            for col in out.columns
        ]
    else:
        out.columns = [str(col).strip() for col in out.columns]

    out.columns = [
        col.lower().replace("\n", " ").replace("/", "_").replace("  ", " ").strip()
        for col in out.columns
    ]
    return out

This turns ugly column structures into something you can actually reference in code.


Step 5: Clean row values before export

Real tables often include:

  • blank rows
  • repeated header rows inside the body
  • footnote markers
  • extra whitespace

Here is a solid cleanup pass:

import re


def clean_cell(value):
    if pd.isna(value):
        return None
    text = str(value)
    text = re.sub(r"\[[^\]]+\]", "", text)   # remove footnotes like [1]
    text = re.sub(r"\s+", " ", text).strip()
    return text or None


def clean_table(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out = out.dropna(how="all")

    for col in out.columns:
        out[col] = out[col].map(clean_cell)

    # remove repeated header rows accidentally parsed into the body
    first_col = out.columns[0]
    out = out[out[first_col] != first_col]

    return out.reset_index(drop=True)

This is where “parse into CSV reliably” actually happens.


Step 6: Fall back to BeautifulSoup when read_html() struggles

Sometimes read_html() misses the table or mangles the structure. Then it is better to parse the table manually.

def extract_table_with_bs4(html: str, table_selector: str) -> pd.DataFrame:
    soup = BeautifulSoup(html, "lxml")
    table = soup.select_one(table_selector)
    if table is None:
        raise ValueError(f"Table not found for selector: {table_selector}")

    headers = [th.get_text(" ", strip=True) for th in table.select("tr th")]
    rows = []

    for tr in table.select("tr"):
        cells = [td.get_text(" ", strip=True) for td in tr.select("td")]
        if not cells:
            continue
        rows.append(cells)

    if headers and rows and len(headers) == len(rows[0]):
        return pd.DataFrame(rows, columns=headers)

    return pd.DataFrame(rows)

Use this fallback when:

  • the table has broken markup
  • headers are inconsistent
  • read_html() says “No tables found”
  • the page mixes tables with layout elements

Step 7: Validate before writing CSV

Do not export blindly.

Validate:

  • expected columns exist
  • row count is plausible
  • key column is not mostly empty
def validate_table(df: pd.DataFrame, required_columns: list[str], min_rows: int = 1) -> None:
    missing = [col for col in required_columns if col not in df.columns]
    if missing:
        raise ValueError(f"missing columns: {missing}")

    if len(df) < min_rows:
        raise ValueError(f"too few rows: {len(df)}")

    first_required = required_columns[0]
    if df[first_required].isna().all():
        raise ValueError(f"column {first_required} is entirely empty")

This protects you from silently shipping garbage CSVs.


Full example

import pandas as pd


def scrape_table_to_csv(url: str, required_keywords: list[str], out_path: str) -> pd.DataFrame:
    html = fetch_html(url)
    tables = parse_tables(html)
    df = choose_table(tables, required_keywords)
    df = flatten_columns(df)
    df = clean_table(df)
    validate_table(df, [required_keywords[0].lower()], min_rows=3)
    df.to_csv(out_path, index=False)
    return df


if __name__ == "__main__":
    df = scrape_table_to_csv(
        url="https://example.com/stats",
        required_keywords=["Country", "Population"],
        out_path="table_export.csv",
    )
    print(df.head(10).to_string(index=False))
    print("saved table_export.csv rows:", len(df))

That gives you one reusable pattern instead of a one-off notebook snippet.


When the page is JS-rendered

Some “tables” are not real HTML tables at all. They are:

  • div grids
  • client-rendered components
  • API-backed views

Clues:

  • read_html() finds nothing
  • BeautifulSoup sees no meaningful <table> tags
  • the browser shows data, but the raw HTML does not

In those cases, the next step is:

  1. inspect the network calls
  2. hit the underlying JSON endpoint if possible
  3. use a headless browser only if needed

Do not keep forcing read_html() when the page is not actually serving tables.


Where ProxiesAPI fits

HTML table parsing and network reliability are different problems.

ProxiesAPI becomes useful when:

  • the same table scraper starts seeing 403 / 429
  • you need to scrape many similar sites
  • you compare region-specific tables
  • your fetch layer gets flaky, but the parser is already fine

That is why it is worth keeping this split:

  • requests / ProxiesAPI for transport
  • pandas / BeautifulSoup for parsing

When the network breaks, you can improve the transport without touching the table cleanup logic.


Final thoughts

Reliable table scraping is not about finding a single magic library.

It is about having a workflow that answers the right questions in order:

  1. did I fetch real HTML?
  2. which table is actually the one I want?
  3. can read_html() parse it cleanly?
  4. if not, where should BeautifulSoup take over?
  5. did I validate the final CSV before exporting it?

That is the difference between “I scraped a table once” and “I can keep this scraper running next month.”

Tables are easy until the HTML gets messy

The trick is not just `pandas.read_html()`. It’s knowing when to switch to BeautifulSoup, how to validate the extracted table, and how to keep the fetch layer stable when your scraper moves beyond friendly sites.

Related guides