Scrape a WordPress Site via sitemap_index.xml (Python): Crawl, Extract, Dedupe, Export

WordPress powers a huge slice of the “normal web”: blogs, small publications, niche sites, company pages.

If you’ve ever tried scraping WordPress by clicking “Next page” forever, you know it’s a trap: pagination changes, archives are inconsistent, and you’ll miss content.

The clean approach is sitemap-first crawling.

In this tutorial, we’ll build a real, production-style WordPress crawler in Python:

  • Start from sitemap_index.xml
  • Discover post URLs
  • Fetch HTML
  • Extract metadata + clean text
  • Deduplicate
  • Export to CSV/JSON
Make this crawl reliable with ProxiesAPI

When your crawl goes from 50 URLs to 50,000, reliability matters more than clever parsing. ProxiesAPI gives you stable proxy rotation + consistent networking so your scraper keeps moving.


The core idea (pipeline)

The sitemap-first approach is basically this:

sitemap_index.xml
   ↓
child sitemaps (post-sitemap.xml, page-sitemap.xml, …)
   ↓
URL list (dedupe)
   ↓
fetch HTML (timeouts + retries)
   ↓
extract → validate → export

(Your feedback was right: the SVG diagram wasn’t rendering reliably, so I replaced it with a clean text pipeline.)


Target site (example)

WordPress sitemaps typically live at one of these:

  • https://example.com/sitemap_index.xml (Yoast / common)
  • https://example.com/sitemap.xml (some setups)

For this guide, we’ll use a WordPress site that exposes sitemap_index.xml.

If you’re following along with your own target, replace the base URL.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Why lxml? BeautifulSoup becomes dramatically faster and more stable with it.


Step 1: Fetch sitemap_index.xml

import requests

SITEMAP_INDEX = "https://ma.tt/sitemap_index.xml"

resp = requests.get(SITEMAP_INDEX, timeout=30)
resp.raise_for_status()
print(resp.status_code)
print(resp.text[:400])

A typical sitemap index looks like this (trimmed):

<sitemapindex>
  <sitemap>
    <loc>https://ma.tt/post-sitemap.xml</loc>
    <lastmod>2026-03-06T21:10:12+00:00</lastmod>
  </sitemap>
  ...
</sitemapindex>

Step 2: Parse child sitemaps and collect URLs

We’ll parse the XML and collect:

  • child sitemap URLs
  • post/page URLs inside each sitemap
from bs4 import BeautifulSoup

def parse_sitemap_locs(xml_text: str) -> list[str]:
    soup = BeautifulSoup(xml_text, "xml")
    return [loc.get_text(strip=True) for loc in soup.select("loc")]

index_locs = parse_sitemap_locs(resp.text)
child_sitemaps = [u for u in index_locs if u.endswith(".xml")]
print("child sitemaps:", len(child_sitemaps))
print("example:", child_sitemaps[:3])

Now fetch each child sitemap and extract actual URLs:

def fetch_text(url: str) -> str:
    r = requests.get(url, timeout=30)
    r.raise_for_status()
    return r.text

seen = set()
urls: list[str] = []

for sm_url in child_sitemaps:
    xml = fetch_text(sm_url)
    locs = parse_sitemap_locs(xml)

    # In a child sitemap, locs are usually *page URLs*, not more sitemaps.
    for u in locs:
        if u in seen:
            continue
        seen.add(u)
        urls.append(u)

print("urls:", len(urls))
print("sample:", urls[:5])

Terminal sanity check (what you should see)

child sitemaps: 2
example: ['https://ma.tt/post-sitemap.xml', 'https://ma.tt/page-sitemap.xml']
urls: 250+
sample: ['https://ma.tt/2004/02/wordpress-and-movable-type/', ...]

Step 3: Fetch HTML and extract content (no guessed selectors)

WordPress themes vary, so we avoid brittle CSS selectors and instead use a layered extraction strategy:

  1. Prefer structured data (JSON-LD) when present
  2. Fall back to common semantic containers (article, main)
  3. As a last resort, take the largest text block

Here’s a pragmatic extractor:

import json
import re
from bs4 import BeautifulSoup

def extract_post(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Title
    title = None
    if soup.title:
        title = soup.title.get_text(" ", strip=True)

    # Try JSON-LD
    jsonld_blocks = [
        s.get_text(strip=True)
        for s in soup.select('script[type="application/ld+json"]')
        if s.get_text(strip=True)
    ]

    published = None
    author = None

    for block in jsonld_blocks:
        try:
            data = json.loads(block)
        except Exception:
            continue

        # JSON-LD can be a dict or list
        items = data if isinstance(data, list) else [data]
        for item in items:
            if not isinstance(item, dict):
                continue
            if item.get("@type") in ("Article", "BlogPosting", "NewsArticle"):
                title = item.get("headline") or title
                published = item.get("datePublished") or published
                a = item.get("author")
                if isinstance(a, dict):
                    author = a.get("name") or author

    # Main text
    main = soup.find("article") or soup.find("main") or soup.body
    text = ""
    if main:
        text = main.get_text(" ", strip=True)
        text = re.sub(r"\s+", " ", text).strip()

    return {
        "url": url,
        "title": title,
        "author": author,
        "published": published,
        "text": text,
        "text_len": len(text),
    }

Fetch + extract for a small sample first:

def fetch_html(url: str) -> str:
    r = requests.get(url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
    r.raise_for_status()
    return r.text

sample = urls[:10]
rows = []
for u in sample:
    html = fetch_html(u)
    rows.append(extract_post(html, u))

for r in rows[:2]:
    print(r["title"], r["published"], r["text_len"])

Step 4: Export to CSV/JSON

import csv

with open("wordpress_export.csv", "w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f, fieldnames=["url","title","author","published","text_len","text"])
    w.writeheader()
    w.writerows(rows)

print("wrote wordpress_export.csv", len(rows))

Production hardening (what makes this reliable at scale)

1) Idempotency + dedupe

  • Use url_norm (strip tracking params, normalize trailing slash)
  • Store seen URLs (SQLite or a simple file-backed set)

2) Retry policy

  • Retry 429/5xx with exponential backoff + jitter
  • Do not retry 404

3) Soft-block detection (the silent killer)

A “blocked” page is often HTTP 200 with HTML that looks like:

  • “Enable JavaScript”
  • “Access denied”
  • a generic placeholder page

Defend by fingerprinting:

  • minimum text length
  • presence of known boilerplate phrases

4) Caching

Cache HTML by URL hash. Most crawls re-run.


Where ProxiesAPI fits (honestly)

ProxiesAPI won’t magically make you scrape Cloudflare-walled giants.

What it will do well for this kind of crawl:

  • stabilize networking across thousands of requests
  • reduce transient blocks and variability
  • keep throughput consistent

Minimal integration sketch

# Pseudocode: adapt to ProxiesAPI’s exact proxy endpoint format.
PROXY = "http://USER:PASS@proxy.proxiesapi.com:PORT"

r = requests.get(
    url,
    timeout=30,
    proxies={"http": PROXY, "https": PROXY},
    headers={"User-Agent": "Mozilla/5.0"},
)

QA checklist (before you ship)

  • Does the sitemap crawl return a stable count?
  • Do extracted titles look sane?
  • Are you accidentally extracting nav/footer repeatedly?
  • Does your dedupe prevent re-scraping?
  • Did you export a sample and spot-check 5 URLs manually?

Next upgrades

  • Incremental crawling: only fetch URLs newer than last run
  • SQLite persistence: store URL → last_fetched → content_hash
  • Add concurrency safely (after you’ve got retries + rate limits)
Make this crawl reliable with ProxiesAPI

When your crawl goes from 50 URLs to 50,000, reliability matters more than clever parsing. ProxiesAPI gives you stable proxy rotation + consistent networking so your scraper keeps moving.

Related guides