HTML Table Scraping with Python: Parse Tables into CSV Reliably
If you search for html table scraping python, most tutorials show the easy case:
import pandas as pd
tables = pd.read_html(url)
And to be fair, that works surprisingly often.
But the real-world problem is not “can Python read a table?” The real problem is:
- the page has six tables
- headers are duplicated or merged
- rows contain notes and footnotes
- the site returns partial HTML or challenge pages
- one table loads fine, another is malformed
So the reliable workflow is not one function. It is a sequence:
- fetch the real HTML
- identify the right table
- parse with
read_html()first - fall back to
BeautifulSoupwhen the table is messy - normalize headers and cells
- validate before export
That is the workflow we’ll build here.
The trick is not just `pandas.read_html()`. It’s knowing when to switch to BeautifulSoup, how to validate the extracted table, and how to keep the fetch layer stable when your scraper moves beyond friendly sites.
When pandas.read_html() is enough
If the page has:
- server-rendered HTML
- real
<table>markup - clean headers
- predictable rows
then read_html() is the fastest route from webpage to DataFrame.
import pandas as pd
url = "https://example.com/table-page"
tables = pd.read_html(url)
print("tables found:", len(tables))
print(tables[0].head())
That is step one, not the whole solution.
The reliable workflow
Here is the process I recommend for production-ish table scraping:
| Step | Goal | Tool |
|---|---|---|
| fetch | make sure the HTML is real | requests |
| inspect | count and label candidate tables | BeautifulSoup |
| parse | extract structured rows fast | pandas.read_html() |
| clean | fix headers, footnotes, blanks | pandas |
| validate | confirm shape and columns | custom checks |
| export | save usable CSV | to_csv() |
This sequence is what keeps “works once” from turning into “works reliably”.
Step 1: Fetch the HTML yourself first
Do not pass a URL directly into read_html() until you know what the response actually looks like.
import requests
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
def fetch_html(url: str) -> str:
response = requests.get(url, headers=HEADERS, timeout=(10, 30))
response.raise_for_status()
html = response.text
if len(html) < 1000:
raise ValueError(f"suspiciously short HTML: {len(html)} chars")
return html
Why start here?
- you can inspect status codes
- you can detect challenge pages
- you can save raw HTML for debugging
- you keep transport separate from parsing
That separation matters even more when you later add retries or ProxiesAPI.
Step 2: Count the tables before you parse them
Use BeautifulSoup to inspect the page and understand what you are dealing with.
from bs4 import BeautifulSoup
def inspect_tables(html: str) -> None:
soup = BeautifulSoup(html, "lxml")
tables = soup.select("table")
print("table count:", len(tables))
for i, table in enumerate(tables[:10], start=1):
headers = [th.get_text(" ", strip=True) for th in table.select("th")[:8]]
print(f"table {i} headers:", headers)
This tells you:
- how many tables exist
- which one probably matters
- whether the table has
<th>headers at all
Without this step, you are often guessing.
Step 3: Parse with pandas.read_html()
Once you have the actual HTML string, pass that to read_html() instead of giving it the URL directly.
from io import StringIO
import pandas as pd
def parse_tables(html: str) -> list[pd.DataFrame]:
return pd.read_html(StringIO(html))
Then choose the table by its columns, not just its position:
def choose_table(tables: list[pd.DataFrame], required_keywords: list[str]) -> pd.DataFrame:
for df in tables:
cols = [str(col) for col in df.columns]
if all(any(keyword.lower() in col.lower() for col in cols) for keyword in required_keywords):
return df.copy()
raise ValueError("Could not find target table")
Example:
html = fetch_html("https://example.com/stats")
tables = parse_tables(html)
target = choose_table(tables, ["Country", "Population"])
print(target.head())
This is much safer than hard-coding “use table 0”.
Step 4: Flatten messy headers
Some tables come with multi-row or multi-index headers. That is where many “quick” scrapers fall apart.
A practical normalization pass:
def flatten_columns(df: pd.DataFrame) -> pd.DataFrame:
out = df.copy()
if isinstance(out.columns, pd.MultiIndex):
out.columns = [
" ".join(str(part).strip() for part in col if str(part) != "nan").strip()
for col in out.columns
]
else:
out.columns = [str(col).strip() for col in out.columns]
out.columns = [
col.lower().replace("\n", " ").replace("/", "_").replace(" ", " ").strip()
for col in out.columns
]
return out
This turns ugly column structures into something you can actually reference in code.
Step 5: Clean row values before export
Real tables often include:
- blank rows
- repeated header rows inside the body
- footnote markers
- extra whitespace
Here is a solid cleanup pass:
import re
def clean_cell(value):
if pd.isna(value):
return None
text = str(value)
text = re.sub(r"\[[^\]]+\]", "", text) # remove footnotes like [1]
text = re.sub(r"\s+", " ", text).strip()
return text or None
def clean_table(df: pd.DataFrame) -> pd.DataFrame:
out = df.copy()
out = out.dropna(how="all")
for col in out.columns:
out[col] = out[col].map(clean_cell)
# remove repeated header rows accidentally parsed into the body
first_col = out.columns[0]
out = out[out[first_col] != first_col]
return out.reset_index(drop=True)
This is where “parse into CSV reliably” actually happens.
Step 6: Fall back to BeautifulSoup when read_html() struggles
Sometimes read_html() misses the table or mangles the structure. Then it is better to parse the table manually.
def extract_table_with_bs4(html: str, table_selector: str) -> pd.DataFrame:
soup = BeautifulSoup(html, "lxml")
table = soup.select_one(table_selector)
if table is None:
raise ValueError(f"Table not found for selector: {table_selector}")
headers = [th.get_text(" ", strip=True) for th in table.select("tr th")]
rows = []
for tr in table.select("tr"):
cells = [td.get_text(" ", strip=True) for td in tr.select("td")]
if not cells:
continue
rows.append(cells)
if headers and rows and len(headers) == len(rows[0]):
return pd.DataFrame(rows, columns=headers)
return pd.DataFrame(rows)
Use this fallback when:
- the table has broken markup
- headers are inconsistent
read_html()says “No tables found”- the page mixes tables with layout elements
Step 7: Validate before writing CSV
Do not export blindly.
Validate:
- expected columns exist
- row count is plausible
- key column is not mostly empty
def validate_table(df: pd.DataFrame, required_columns: list[str], min_rows: int = 1) -> None:
missing = [col for col in required_columns if col not in df.columns]
if missing:
raise ValueError(f"missing columns: {missing}")
if len(df) < min_rows:
raise ValueError(f"too few rows: {len(df)}")
first_required = required_columns[0]
if df[first_required].isna().all():
raise ValueError(f"column {first_required} is entirely empty")
This protects you from silently shipping garbage CSVs.
Full example
import pandas as pd
def scrape_table_to_csv(url: str, required_keywords: list[str], out_path: str) -> pd.DataFrame:
html = fetch_html(url)
tables = parse_tables(html)
df = choose_table(tables, required_keywords)
df = flatten_columns(df)
df = clean_table(df)
validate_table(df, [required_keywords[0].lower()], min_rows=3)
df.to_csv(out_path, index=False)
return df
if __name__ == "__main__":
df = scrape_table_to_csv(
url="https://example.com/stats",
required_keywords=["Country", "Population"],
out_path="table_export.csv",
)
print(df.head(10).to_string(index=False))
print("saved table_export.csv rows:", len(df))
That gives you one reusable pattern instead of a one-off notebook snippet.
When the page is JS-rendered
Some “tables” are not real HTML tables at all. They are:
- div grids
- client-rendered components
- API-backed views
Clues:
read_html()finds nothingBeautifulSoupsees no meaningful<table>tags- the browser shows data, but the raw HTML does not
In those cases, the next step is:
- inspect the network calls
- hit the underlying JSON endpoint if possible
- use a headless browser only if needed
Do not keep forcing read_html() when the page is not actually serving tables.
Where ProxiesAPI fits
HTML table parsing and network reliability are different problems.
ProxiesAPI becomes useful when:
- the same table scraper starts seeing 403 / 429
- you need to scrape many similar sites
- you compare region-specific tables
- your fetch layer gets flaky, but the parser is already fine
That is why it is worth keeping this split:
requests/ ProxiesAPI for transportpandas/BeautifulSoupfor parsing
When the network breaks, you can improve the transport without touching the table cleanup logic.
Final thoughts
Reliable table scraping is not about finding a single magic library.
It is about having a workflow that answers the right questions in order:
- did I fetch real HTML?
- which table is actually the one I want?
- can
read_html()parse it cleanly? - if not, where should BeautifulSoup take over?
- did I validate the final CSV before exporting it?
That is the difference between “I scraped a table once” and “I can keep this scraper running next month.”
The trick is not just `pandas.read_html()`. It’s knowing when to switch to BeautifulSoup, how to validate the extracted table, and how to keep the fetch layer stable when your scraper moves beyond friendly sites.