Web Scraping Sitemaps: Find Every Indexable URL Fast

Sitemap scraping is one of the highest-leverage habits in web scraping.

If a site publishes a decent XML sitemap, you can often skip huge amounts of blind crawling and jump straight to:

  • known indexable URLs
  • nested sitemap indexes
  • lastmod hints
  • content-type groupings like products, blog posts, categories, or locales

That is faster, cheaper, and easier to reason about than starting with a crawler that discovers everything the hard way.

Use sitemaps to cut waste before you scale fetches

The cheapest request is the one you never send. If sitemap scraping gives you a clean URL queue first, ProxiesAPI can spend its effort on the pages that actually matter.


What sitemap scraping is really for

A sitemap is not magic and it is not always complete. But it is often the best first source of truth for:

  • canonical content URLs
  • fresh pages to prioritize
  • which sections of a site exist at all

If you are scraping editorial sites, docs, ecommerce catalogs, or marketplaces, a sitemap pass can save hours of crawling noise.

The main sitemap formats you will see are:

TypeRoot tagWhat it containsWhy it matters
URL seturlsetdirect page URLsfastest path to crawl targets
Sitemap indexsitemapindexlinks to more sitemap fileshow large sites segment content
Gzipped sitemap.xml.gzcompressed XMLcommon on high-volume sites

The workflow is simple:

  1. discover the sitemap entry points
  2. expand nested indexes
  3. normalize and dedupe URLs
  4. push the results into your crawl queue

Step 1: Discover sitemap locations

Start with robots.txt. That is the cleanest source because many sites advertise sitemap URLs there explicitly.

import requests
from urllib.parse import urljoin


def find_sitemaps_from_robots(base_url: str) -> list[str]:
    robots_url = urljoin(base_url, "/robots.txt")
    r = requests.get(robots_url, timeout=(10, 30))
    r.raise_for_status()

    urls = []
    for line in r.text.splitlines():
        if line.lower().startswith("sitemap:"):
            urls.append(line.split(":", 1)[1].strip())
    return urls

If robots.txt does not help, try a few common fallback paths:

  • /sitemap.xml
  • /sitemap_index.xml
  • /sitemaps.xml

Do not guess twenty paths before checking the obvious three.


Step 2: Parse XML and gzip variants

Sitemap scraping breaks when people assume every sitemap is a plain XML file.

In reality, many large sites serve:

  • plain XML
  • gzipped XML
  • indexes that point to dozens or hundreds of child sitemaps

So your parser should handle all three cleanly.

import gzip
import io
import xml.etree.ElementTree as ET


def fetch_bytes(url: str) -> bytes:
    r = requests.get(url, timeout=(10, 60))
    r.raise_for_status()
    return r.content


def parse_xml_bytes(raw: bytes) -> ET.Element:
    if raw[:2] == b"\\x1f\\x8b":
        raw = gzip.decompress(raw)
    return ET.fromstring(raw)

That tiny gzip check solves a surprising number of “why is my XML parser exploding?” failures.


Step 3: Expand nested sitemap indexes recursively

This is the core of sitemap scraping: one sitemap can point to many more.

NAMESPACES = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}


def expand_sitemap(url: str, seen: set[str] | None = None) -> list[dict]:
    if seen is None:
        seen = set()
    if url in seen:
        return []
    seen.add(url)

    root = parse_xml_bytes(fetch_bytes(url))
    tag = root.tag.lower()

    rows = []

    if tag.endswith("sitemapindex"):
        for node in root.findall("sm:sitemap", NAMESPACES):
            loc = node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip()
            if loc:
                rows.extend(expand_sitemap(loc, seen))
        return rows

    if tag.endswith("urlset"):
        for node in root.findall("sm:url", NAMESPACES):
            rows.append(
                {
                    "loc": node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip(),
                    "lastmod": node.findtext("sm:lastmod", default="", namespaces=NAMESPACES).strip() or None,
                }
            )
        return rows

    raise ValueError(f"Unexpected sitemap root tag: {root.tag}")

This gives you a flat list of URL rows even when the site has a deep sitemap tree.


Step 4: Normalize and dedupe the crawl queue

Never send sitemap URLs straight into a production crawl without cleanup.

At minimum:

  • strip whitespace
  • ignore empty loc fields
  • dedupe exact URLs
  • optionally drop obvious tracking parameters
from urllib.parse import urlsplit, urlunsplit


def canonicalize_url(url: str) -> str:
    parts = urlsplit(url.strip())
    return urlunsplit((parts.scheme, parts.netloc, parts.path, parts.query, ""))


def build_queue(rows: list[dict]) -> list[dict]:
    out = []
    seen = set()

    for row in rows:
        loc = row.get("loc")
        if not loc:
            continue
        normalized = canonicalize_url(loc)
        if normalized in seen:
            continue
        seen.add(normalized)
        out.append(
            {
                "url": normalized,
                "lastmod": row.get("lastmod"),
            }
        )

    return out

Full example

import csv
import gzip
import requests
import xml.etree.ElementTree as ET
from urllib.parse import urljoin, urlsplit, urlunsplit

NAMESPACES = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}


def find_sitemaps_from_robots(base_url: str) -> list[str]:
    robots_url = urljoin(base_url, "/robots.txt")
    r = requests.get(robots_url, timeout=(10, 30))
    r.raise_for_status()
    return [
        line.split(":", 1)[1].strip()
        for line in r.text.splitlines()
        if line.lower().startswith("sitemap:")
    ]


def fetch_bytes(url: str) -> bytes:
    r = requests.get(url, timeout=(10, 60))
    r.raise_for_status()
    return r.content


def parse_xml_bytes(raw: bytes) -> ET.Element:
    if raw[:2] == b"\\x1f\\x8b":
        raw = gzip.decompress(raw)
    return ET.fromstring(raw)


def expand_sitemap(url: str, seen: set[str] | None = None) -> list[dict]:
    if seen is None:
        seen = set()
    if url in seen:
        return []
    seen.add(url)

    root = parse_xml_bytes(fetch_bytes(url))
    tag = root.tag.lower()

    if tag.endswith("sitemapindex"):
        rows = []
        for node in root.findall("sm:sitemap", NAMESPACES):
            loc = node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip()
            if loc:
                rows.extend(expand_sitemap(loc, seen))
        return rows

    if tag.endswith("urlset"):
        return [
            {
                "loc": node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip(),
                "lastmod": node.findtext("sm:lastmod", default="", namespaces=NAMESPACES).strip() or None,
            }
            for node in root.findall("sm:url", NAMESPACES)
        ]

    raise ValueError(f"Unexpected sitemap root tag: {root.tag}")


def canonicalize_url(url: str) -> str:
    parts = urlsplit(url.strip())
    return urlunsplit((parts.scheme, parts.netloc, parts.path, parts.query, ""))


def build_queue(rows: list[dict]) -> list[dict]:
    queue = []
    seen = set()
    for row in rows:
        loc = row.get("loc")
        if not loc:
            continue
        url = canonicalize_url(loc)
        if url in seen:
            continue
        seen.add(url)
        queue.append({"url": url, "lastmod": row.get("lastmod")})
    return queue


def write_csv(rows: list[dict], path: str) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["url", "lastmod"])
        writer.writeheader()
        writer.writerows(rows)


if __name__ == "__main__":
    base_url = "https://example.com"
    sitemap_urls = find_sitemaps_from_robots(base_url)
    all_rows = []
    for sitemap_url in sitemap_urls:
        all_rows.extend(expand_sitemap(sitemap_url))
    queue = build_queue(all_rows)
    write_csv(queue, "sitemap_queue.csv")
    print("queue size:", len(queue))

Practical sitemap scraping rules

1. Treat lastmod as a hint, not a promise

Some sites update it carefully. Others stamp the current time on everything. Use it for prioritization, not blind trust.

2. Segment queues by sitemap source

If a sitemap index splits content into:

  • /post-sitemap.xml
  • /category-sitemap.xml
  • /product-sitemap.xml

keep that label. It is extremely useful later for:

  • crawl frequency rules
  • parser routing
  • debugging weird sections

3. Expect bad hygiene

Real-world sitemap scraping often uncovers:

  • stale URLs
  • redirected URLs
  • non-canonical variants
  • empty lastmod values

That is normal. The sitemap is still valuable even when it is imperfect.

4. Combine sitemaps with crawling, not instead of crawling

Best use:

  • sitemap scraping for fast discovery
  • ordinary crawling for pagination gaps, orphaned pages, and fresh links not yet listed

This is not a religion. It is a queue-building shortcut.


When sitemap scraping is the wrong first move

It is less useful when:

  • the target has no XML sitemap
  • the sitemap is tiny but the site is mostly JS-driven behind search flows
  • the content you need is not indexable public content at all

In those cases, normal crawl discovery or API inspection may be better.

But when a decent sitemap exists, ignoring it is usually wasteful.


Bottom line

Sitemap scraping is not glamorous, but it is one of the fastest ways to upgrade a scraper from noisy exploration to intentional crawling.

Do the boring work first:

  1. find the sitemap
  2. expand nested indexes
  3. parse gzip variants
  4. dedupe into a real queue

Once you do that, every downstream fetch gets cheaper and cleaner.

Use sitemaps to cut waste before you scale fetches

The cheapest request is the one you never send. If sitemap scraping gives you a clean URL queue first, ProxiesAPI can spend its effort on the pages that actually matter.

Related guides

Scrape News Headlines from Google News
Collect headline text, sources, timestamps, and links from Google News topic feeds with Python, XML parsing, and a ProxiesAPI-ready fetch layer.
tutorial#python#google-news#rss
Scrape Academic Papers from arXiv: Metadata + PDFs (Python + ProxiesAPI)
Collect arXiv paper metadata (title, authors, abstract) and download PDFs reliably. Includes practical selectors, rate-limits, and screenshot proof.
tutorial#python#arxiv#web-scraping
How to Scrape ArXiv Papers (Search + Metadata + PDFs) with Python + ProxiesAPI
Search arXiv, collect paper metadata, and download PDFs reliably with retries, rate limiting, and a network layer you can route through ProxiesAPI.
tutorial#python#arxiv#web-scraping
Incremental Web Scraping: Re-Crawl Only What Changed
A practical guide to incremental web scraping: use ETag, Last-Modified, sitemap hints, and content hashes to avoid full recrawls while keeping datasets fresh.
guides#incremental web scraping#web-scraping#etag