Scrape a WordPress Site via sitemap_index.xml (Python): Crawl, Extract, Dedupe, Export
WordPress powers a huge slice of the “normal web”: blogs, small publications, niche sites, company pages.
If you’ve ever tried scraping WordPress by clicking “Next page” forever, you know it’s a trap: pagination changes, archives are inconsistent, and you’ll miss content.
The clean approach is sitemap-first crawling.
In this tutorial, we’ll build a real, production-style WordPress crawler in Python:
- Start from
sitemap_index.xml - Discover post URLs
- Fetch HTML
- Extract metadata + clean text
- Deduplicate
- Export to CSV/JSON
When your crawl goes from 50 URLs to 50,000, reliability matters more than clever parsing. ProxiesAPI gives you stable proxy rotation + consistent networking so your scraper keeps moving.
The core idea (pipeline)
The sitemap-first approach is basically this:
sitemap_index.xml
↓
child sitemaps (post-sitemap.xml, page-sitemap.xml, …)
↓
URL list (dedupe)
↓
fetch HTML (timeouts + retries)
↓
extract → validate → export
(Your feedback was right: the SVG diagram wasn’t rendering reliably, so I replaced it with a clean text pipeline.)
Target site (example)
WordPress sitemaps typically live at one of these:
https://example.com/sitemap_index.xml(Yoast / common)https://example.com/sitemap.xml(some setups)
For this guide, we’ll use a WordPress site that exposes sitemap_index.xml.
If you’re following along with your own target, replace the base URL.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Why lxml? BeautifulSoup becomes dramatically faster and more stable with it.
Step 1: Fetch sitemap_index.xml
import requests
SITEMAP_INDEX = "https://ma.tt/sitemap_index.xml"
resp = requests.get(SITEMAP_INDEX, timeout=30)
resp.raise_for_status()
print(resp.status_code)
print(resp.text[:400])
A typical sitemap index looks like this (trimmed):
<sitemapindex>
<sitemap>
<loc>https://ma.tt/post-sitemap.xml</loc>
<lastmod>2026-03-06T21:10:12+00:00</lastmod>
</sitemap>
...
</sitemapindex>
Step 2: Parse child sitemaps and collect URLs
We’ll parse the XML and collect:
- child sitemap URLs
- post/page URLs inside each sitemap
from bs4 import BeautifulSoup
def parse_sitemap_locs(xml_text: str) -> list[str]:
soup = BeautifulSoup(xml_text, "xml")
return [loc.get_text(strip=True) for loc in soup.select("loc")]
index_locs = parse_sitemap_locs(resp.text)
child_sitemaps = [u for u in index_locs if u.endswith(".xml")]
print("child sitemaps:", len(child_sitemaps))
print("example:", child_sitemaps[:3])
Now fetch each child sitemap and extract actual URLs:
def fetch_text(url: str) -> str:
r = requests.get(url, timeout=30)
r.raise_for_status()
return r.text
seen = set()
urls: list[str] = []
for sm_url in child_sitemaps:
xml = fetch_text(sm_url)
locs = parse_sitemap_locs(xml)
# In a child sitemap, locs are usually *page URLs*, not more sitemaps.
for u in locs:
if u in seen:
continue
seen.add(u)
urls.append(u)
print("urls:", len(urls))
print("sample:", urls[:5])
Terminal sanity check (what you should see)
child sitemaps: 2
example: ['https://ma.tt/post-sitemap.xml', 'https://ma.tt/page-sitemap.xml']
urls: 250+
sample: ['https://ma.tt/2004/02/wordpress-and-movable-type/', ...]
Step 3: Fetch HTML and extract content (no guessed selectors)
WordPress themes vary, so we avoid brittle CSS selectors and instead use a layered extraction strategy:
- Prefer structured data (JSON-LD) when present
- Fall back to common semantic containers (
article,main) - As a last resort, take the largest text block
Here’s a pragmatic extractor:
import json
import re
from bs4 import BeautifulSoup
def extract_post(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
# Title
title = None
if soup.title:
title = soup.title.get_text(" ", strip=True)
# Try JSON-LD
jsonld_blocks = [
s.get_text(strip=True)
for s in soup.select('script[type="application/ld+json"]')
if s.get_text(strip=True)
]
published = None
author = None
for block in jsonld_blocks:
try:
data = json.loads(block)
except Exception:
continue
# JSON-LD can be a dict or list
items = data if isinstance(data, list) else [data]
for item in items:
if not isinstance(item, dict):
continue
if item.get("@type") in ("Article", "BlogPosting", "NewsArticle"):
title = item.get("headline") or title
published = item.get("datePublished") or published
a = item.get("author")
if isinstance(a, dict):
author = a.get("name") or author
# Main text
main = soup.find("article") or soup.find("main") or soup.body
text = ""
if main:
text = main.get_text(" ", strip=True)
text = re.sub(r"\s+", " ", text).strip()
return {
"url": url,
"title": title,
"author": author,
"published": published,
"text": text,
"text_len": len(text),
}
Fetch + extract for a small sample first:
def fetch_html(url: str) -> str:
r = requests.get(url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
r.raise_for_status()
return r.text
sample = urls[:10]
rows = []
for u in sample:
html = fetch_html(u)
rows.append(extract_post(html, u))
for r in rows[:2]:
print(r["title"], r["published"], r["text_len"])
Step 4: Export to CSV/JSON
import csv
with open("wordpress_export.csv", "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=["url","title","author","published","text_len","text"])
w.writeheader()
w.writerows(rows)
print("wrote wordpress_export.csv", len(rows))
Production hardening (what makes this reliable at scale)
1) Idempotency + dedupe
- Use
url_norm(strip tracking params, normalize trailing slash) - Store seen URLs (SQLite or a simple file-backed set)
2) Retry policy
- Retry 429/5xx with exponential backoff + jitter
- Do not retry 404
3) Soft-block detection (the silent killer)
A “blocked” page is often HTTP 200 with HTML that looks like:
- “Enable JavaScript”
- “Access denied”
- a generic placeholder page
Defend by fingerprinting:
- minimum text length
- presence of known boilerplate phrases
4) Caching
Cache HTML by URL hash. Most crawls re-run.
Where ProxiesAPI fits (honestly)
ProxiesAPI won’t magically make you scrape Cloudflare-walled giants.
What it will do well for this kind of crawl:
- stabilize networking across thousands of requests
- reduce transient blocks and variability
- keep throughput consistent
Minimal integration sketch
# Pseudocode: adapt to ProxiesAPI’s exact proxy endpoint format.
PROXY = "http://USER:PASS@proxy.proxiesapi.com:PORT"
r = requests.get(
url,
timeout=30,
proxies={"http": PROXY, "https": PROXY},
headers={"User-Agent": "Mozilla/5.0"},
)
QA checklist (before you ship)
- Does the sitemap crawl return a stable count?
- Do extracted titles look sane?
- Are you accidentally extracting nav/footer repeatedly?
- Does your dedupe prevent re-scraping?
- Did you export a sample and spot-check 5 URLs manually?
Next upgrades
- Incremental crawling: only fetch URLs newer than last run
- SQLite persistence: store URL → last_fetched → content_hash
- Add concurrency safely (after you’ve got retries + rate limits)
When your crawl goes from 50 URLs to 50,000, reliability matters more than clever parsing. ProxiesAPI gives you stable proxy rotation + consistent networking so your scraper keeps moving.