Web Scraping Sitemaps: Find Every Indexable URL Fast

Jul 04, 2026 · guides · #sitemap scraping, #web-scraping, #xml, #python, #crawl-queue, #seo

Sitemap scraping is one of the highest-leverage habits in web scraping.

If a site publishes a decent XML sitemap, you can often skip huge amounts of blind crawling and jump straight to:

known indexable URLs
nested sitemap indexes
lastmod hints
content-type groupings like products, blog posts, categories, or locales

That is faster, cheaper, and easier to reason about than starting with a crawler that discovers everything the hard way.

Use sitemaps to cut waste before you scale fetches

The cheapest request is the one you never send. If sitemap scraping gives you a clean URL queue first, ProxiesAPI can spend its effort on the pages that actually matter.

Get 1,000 free API calls View pricing

What sitemap scraping is really for

A sitemap is not magic and it is not always complete. But it is often the best first source of truth for:

canonical content URLs
fresh pages to prioritize
which sections of a site exist at all

If you are scraping editorial sites, docs, ecommerce catalogs, or marketplaces, a sitemap pass can save hours of crawling noise.

The main sitemap formats you will see are:

Type	Root tag	What it contains	Why it matters
URL set	`urlset`	direct page URLs	fastest path to crawl targets
Sitemap index	`sitemapindex`	links to more sitemap files	how large sites segment content
Gzipped sitemap	`.xml.gz`	compressed XML	common on high-volume sites

The workflow is simple:

discover the sitemap entry points
expand nested indexes
normalize and dedupe URLs
push the results into your crawl queue

Step 1: Discover sitemap locations

Start with robots.txt. That is the cleanest source because many sites advertise sitemap URLs there explicitly.

import requests
from urllib.parse import urljoin


def find_sitemaps_from_robots(base_url: str) -> list[str]:
    robots_url = urljoin(base_url, "/robots.txt")
    r = requests.get(robots_url, timeout=(10, 30))
    r.raise_for_status()

    urls = []
    for line in r.text.splitlines():
        if line.lower().startswith("sitemap:"):
            urls.append(line.split(":", 1)[1].strip())
    return urls

If robots.txt does not help, try a few common fallback paths:

/sitemap.xml
/sitemap_index.xml
/sitemaps.xml

Do not guess twenty paths before checking the obvious three.

Step 2: Parse XML and gzip variants

Sitemap scraping breaks when people assume every sitemap is a plain XML file.

In reality, many large sites serve:

plain XML
gzipped XML
indexes that point to dozens or hundreds of child sitemaps

So your parser should handle all three cleanly.

import gzip
import io
import xml.etree.ElementTree as ET


def fetch_bytes(url: str) -> bytes:
    r = requests.get(url, timeout=(10, 60))
    r.raise_for_status()
    return r.content


def parse_xml_bytes(raw: bytes) -> ET.Element:
    if raw[:2] == b"\\x1f\\x8b":
        raw = gzip.decompress(raw)
    return ET.fromstring(raw)

That tiny gzip check solves a surprising number of “why is my XML parser exploding?” failures.

Step 3: Expand nested sitemap indexes recursively

This is the core of sitemap scraping: one sitemap can point to many more.

NAMESPACES = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}


def expand_sitemap(url: str, seen: set[str] | None = None) -> list[dict]:
    if seen is None:
        seen = set()
    if url in seen:
        return []
    seen.add(url)

    root = parse_xml_bytes(fetch_bytes(url))
    tag = root.tag.lower()

    rows = []

    if tag.endswith("sitemapindex"):
        for node in root.findall("sm:sitemap", NAMESPACES):
            loc = node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip()
            if loc:
                rows.extend(expand_sitemap(loc, seen))
        return rows

    if tag.endswith("urlset"):
        for node in root.findall("sm:url", NAMESPACES):
            rows.append(
                {
                    "loc": node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip(),
                    "lastmod": node.findtext("sm:lastmod", default="", namespaces=NAMESPACES).strip() or None,
                }
            )
        return rows

    raise ValueError(f"Unexpected sitemap root tag: {root.tag}")

This gives you a flat list of URL rows even when the site has a deep sitemap tree.

Step 4: Normalize and dedupe the crawl queue

Never send sitemap URLs straight into a production crawl without cleanup.

At minimum:

strip whitespace
ignore empty loc fields
dedupe exact URLs
optionally drop obvious tracking parameters

from urllib.parse import urlsplit, urlunsplit


def canonicalize_url(url: str) -> str:
    parts = urlsplit(url.strip())
    return urlunsplit((parts.scheme, parts.netloc, parts.path, parts.query, ""))


def build_queue(rows: list[dict]) -> list[dict]:
    out = []
    seen = set()

    for row in rows:
        loc = row.get("loc")
        if not loc:
            continue
        normalized = canonicalize_url(loc)
        if normalized in seen:
            continue
        seen.add(normalized)
        out.append(
            {
                "url": normalized,
                "lastmod": row.get("lastmod"),
            }
        )

    return out

Full example

import csv
import gzip
import requests
import xml.etree.ElementTree as ET
from urllib.parse import urljoin, urlsplit, urlunsplit

NAMESPACES = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}


def find_sitemaps_from_robots(base_url: str) -> list[str]:
    robots_url = urljoin(base_url, "/robots.txt")
    r = requests.get(robots_url, timeout=(10, 30))
    r.raise_for_status()
    return [
        line.split(":", 1)[1].strip()
        for line in r.text.splitlines()
        if line.lower().startswith("sitemap:")
    ]


def fetch_bytes(url: str) -> bytes:
    r = requests.get(url, timeout=(10, 60))
    r.raise_for_status()
    return r.content


def parse_xml_bytes(raw: bytes) -> ET.Element:
    if raw[:2] == b"\\x1f\\x8b":
        raw = gzip.decompress(raw)
    return ET.fromstring(raw)


def expand_sitemap(url: str, seen: set[str] | None = None) -> list[dict]:
    if seen is None:
        seen = set()
    if url in seen:
        return []
    seen.add(url)

    root = parse_xml_bytes(fetch_bytes(url))
    tag = root.tag.lower()

    if tag.endswith("sitemapindex"):
        rows = []
        for node in root.findall("sm:sitemap", NAMESPACES):
            loc = node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip()
            if loc:
                rows.extend(expand_sitemap(loc, seen))
        return rows

    if tag.endswith("urlset"):
        return [
            {
                "loc": node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip(),
                "lastmod": node.findtext("sm:lastmod", default="", namespaces=NAMESPACES).strip() or None,
            }
            for node in root.findall("sm:url", NAMESPACES)
        ]

    raise ValueError(f"Unexpected sitemap root tag: {root.tag}")


def canonicalize_url(url: str) -> str:
    parts = urlsplit(url.strip())
    return urlunsplit((parts.scheme, parts.netloc, parts.path, parts.query, ""))


def build_queue(rows: list[dict]) -> list[dict]:
    queue = []
    seen = set()
    for row in rows:
        loc = row.get("loc")
        if not loc:
            continue
        url = canonicalize_url(loc)
        if url in seen:
            continue
        seen.add(url)
        queue.append({"url": url, "lastmod": row.get("lastmod")})
    return queue


def write_csv(rows: list[dict], path: str) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["url", "lastmod"])
        writer.writeheader()
        writer.writerows(rows)


if __name__ == "__main__":
    base_url = "https://example.com"
    sitemap_urls = find_sitemaps_from_robots(base_url)
    all_rows = []
    for sitemap_url in sitemap_urls:
        all_rows.extend(expand_sitemap(sitemap_url))
    queue = build_queue(all_rows)
    write_csv(queue, "sitemap_queue.csv")
    print("queue size:", len(queue))

Practical sitemap scraping rules

1. Treat `lastmod` as a hint, not a promise

Some sites update it carefully. Others stamp the current time on everything. Use it for prioritization, not blind trust.

2. Segment queues by sitemap source

If a sitemap index splits content into:

/post-sitemap.xml
/category-sitemap.xml
/product-sitemap.xml

keep that label. It is extremely useful later for:

crawl frequency rules
parser routing
debugging weird sections

3. Expect bad hygiene

Real-world sitemap scraping often uncovers:

stale URLs
redirected URLs
non-canonical variants
empty lastmod values

That is normal. The sitemap is still valuable even when it is imperfect.

4. Combine sitemaps with crawling, not instead of crawling

Best use:

sitemap scraping for fast discovery
ordinary crawling for pagination gaps, orphaned pages, and fresh links not yet listed

This is not a religion. It is a queue-building shortcut.

When sitemap scraping is the wrong first move

It is less useful when:

the target has no XML sitemap
the sitemap is tiny but the site is mostly JS-driven behind search flows
the content you need is not indexable public content at all

In those cases, normal crawl discovery or API inspection may be better.

But when a decent sitemap exists, ignoring it is usually wasteful.

Bottom line

Sitemap scraping is not glamorous, but it is one of the fastest ways to upgrade a scraper from noisy exploration to intentional crawling.

Do the boring work first:

find the sitemap
expand nested indexes
parse gzip variants
dedupe into a real queue

Once you do that, every downstream fetch gets cheaper and cleaner.

Use sitemaps to cut waste before you scale fetches

The cheapest request is the one you never send. If sitemap scraping gives you a clean URL queue first, ProxiesAPI can spend its effort on the pages that actually matter.

Get 1,000 free API calls View pricing

Related guides

Scrape News Headlines from Google News

Collect headline text, sources, timestamps, and links from Google News topic feeds with Python, XML parsing, and a ProxiesAPI-ready fetch layer.

tutorial#python#google-news#rss

Scrape Academic Papers from arXiv: Metadata + PDFs (Python + ProxiesAPI)

Collect arXiv paper metadata (title, authors, abstract) and download PDFs reliably. Includes practical selectors, rate-limits, and screenshot proof.

tutorial#python#arxiv#web-scraping

How to Scrape ArXiv Papers (Search + Metadata + PDFs) with Python + ProxiesAPI

Search arXiv, collect paper metadata, and download PDFs reliably with retries, rate limiting, and a network layer you can route through ProxiesAPI.

tutorial#python#arxiv#web-scraping

Incremental Web Scraping: Re-Crawl Only What Changed

A practical guide to incremental web scraping: use ETag, Last-Modified, sitemap hints, and content hashes to avoid full recrawls while keeping datasets fresh.

guides#incremental web scraping#web-scraping#etag