Web Scraping Sitemaps: Find Every Indexable URL Fast
Sitemap scraping is one of the highest-leverage habits in web scraping.
If a site publishes a decent XML sitemap, you can often skip huge amounts of blind crawling and jump straight to:
- known indexable URLs
- nested sitemap indexes
lastmodhints- content-type groupings like products, blog posts, categories, or locales
That is faster, cheaper, and easier to reason about than starting with a crawler that discovers everything the hard way.
The cheapest request is the one you never send. If sitemap scraping gives you a clean URL queue first, ProxiesAPI can spend its effort on the pages that actually matter.
What sitemap scraping is really for
A sitemap is not magic and it is not always complete. But it is often the best first source of truth for:
- canonical content URLs
- fresh pages to prioritize
- which sections of a site exist at all
If you are scraping editorial sites, docs, ecommerce catalogs, or marketplaces, a sitemap pass can save hours of crawling noise.
The main sitemap formats you will see are:
| Type | Root tag | What it contains | Why it matters |
|---|---|---|---|
| URL set | urlset | direct page URLs | fastest path to crawl targets |
| Sitemap index | sitemapindex | links to more sitemap files | how large sites segment content |
| Gzipped sitemap | .xml.gz | compressed XML | common on high-volume sites |
The workflow is simple:
- discover the sitemap entry points
- expand nested indexes
- normalize and dedupe URLs
- push the results into your crawl queue
Step 1: Discover sitemap locations
Start with robots.txt. That is the cleanest source because many sites advertise sitemap URLs there explicitly.
import requests
from urllib.parse import urljoin
def find_sitemaps_from_robots(base_url: str) -> list[str]:
robots_url = urljoin(base_url, "/robots.txt")
r = requests.get(robots_url, timeout=(10, 30))
r.raise_for_status()
urls = []
for line in r.text.splitlines():
if line.lower().startswith("sitemap:"):
urls.append(line.split(":", 1)[1].strip())
return urls
If robots.txt does not help, try a few common fallback paths:
/sitemap.xml/sitemap_index.xml/sitemaps.xml
Do not guess twenty paths before checking the obvious three.
Step 2: Parse XML and gzip variants
Sitemap scraping breaks when people assume every sitemap is a plain XML file.
In reality, many large sites serve:
- plain XML
- gzipped XML
- indexes that point to dozens or hundreds of child sitemaps
So your parser should handle all three cleanly.
import gzip
import io
import xml.etree.ElementTree as ET
def fetch_bytes(url: str) -> bytes:
r = requests.get(url, timeout=(10, 60))
r.raise_for_status()
return r.content
def parse_xml_bytes(raw: bytes) -> ET.Element:
if raw[:2] == b"\\x1f\\x8b":
raw = gzip.decompress(raw)
return ET.fromstring(raw)
That tiny gzip check solves a surprising number of “why is my XML parser exploding?” failures.
Step 3: Expand nested sitemap indexes recursively
This is the core of sitemap scraping: one sitemap can point to many more.
NAMESPACES = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
def expand_sitemap(url: str, seen: set[str] | None = None) -> list[dict]:
if seen is None:
seen = set()
if url in seen:
return []
seen.add(url)
root = parse_xml_bytes(fetch_bytes(url))
tag = root.tag.lower()
rows = []
if tag.endswith("sitemapindex"):
for node in root.findall("sm:sitemap", NAMESPACES):
loc = node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip()
if loc:
rows.extend(expand_sitemap(loc, seen))
return rows
if tag.endswith("urlset"):
for node in root.findall("sm:url", NAMESPACES):
rows.append(
{
"loc": node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip(),
"lastmod": node.findtext("sm:lastmod", default="", namespaces=NAMESPACES).strip() or None,
}
)
return rows
raise ValueError(f"Unexpected sitemap root tag: {root.tag}")
This gives you a flat list of URL rows even when the site has a deep sitemap tree.
Step 4: Normalize and dedupe the crawl queue
Never send sitemap URLs straight into a production crawl without cleanup.
At minimum:
- strip whitespace
- ignore empty
locfields - dedupe exact URLs
- optionally drop obvious tracking parameters
from urllib.parse import urlsplit, urlunsplit
def canonicalize_url(url: str) -> str:
parts = urlsplit(url.strip())
return urlunsplit((parts.scheme, parts.netloc, parts.path, parts.query, ""))
def build_queue(rows: list[dict]) -> list[dict]:
out = []
seen = set()
for row in rows:
loc = row.get("loc")
if not loc:
continue
normalized = canonicalize_url(loc)
if normalized in seen:
continue
seen.add(normalized)
out.append(
{
"url": normalized,
"lastmod": row.get("lastmod"),
}
)
return out
Full example
import csv
import gzip
import requests
import xml.etree.ElementTree as ET
from urllib.parse import urljoin, urlsplit, urlunsplit
NAMESPACES = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
def find_sitemaps_from_robots(base_url: str) -> list[str]:
robots_url = urljoin(base_url, "/robots.txt")
r = requests.get(robots_url, timeout=(10, 30))
r.raise_for_status()
return [
line.split(":", 1)[1].strip()
for line in r.text.splitlines()
if line.lower().startswith("sitemap:")
]
def fetch_bytes(url: str) -> bytes:
r = requests.get(url, timeout=(10, 60))
r.raise_for_status()
return r.content
def parse_xml_bytes(raw: bytes) -> ET.Element:
if raw[:2] == b"\\x1f\\x8b":
raw = gzip.decompress(raw)
return ET.fromstring(raw)
def expand_sitemap(url: str, seen: set[str] | None = None) -> list[dict]:
if seen is None:
seen = set()
if url in seen:
return []
seen.add(url)
root = parse_xml_bytes(fetch_bytes(url))
tag = root.tag.lower()
if tag.endswith("sitemapindex"):
rows = []
for node in root.findall("sm:sitemap", NAMESPACES):
loc = node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip()
if loc:
rows.extend(expand_sitemap(loc, seen))
return rows
if tag.endswith("urlset"):
return [
{
"loc": node.findtext("sm:loc", default="", namespaces=NAMESPACES).strip(),
"lastmod": node.findtext("sm:lastmod", default="", namespaces=NAMESPACES).strip() or None,
}
for node in root.findall("sm:url", NAMESPACES)
]
raise ValueError(f"Unexpected sitemap root tag: {root.tag}")
def canonicalize_url(url: str) -> str:
parts = urlsplit(url.strip())
return urlunsplit((parts.scheme, parts.netloc, parts.path, parts.query, ""))
def build_queue(rows: list[dict]) -> list[dict]:
queue = []
seen = set()
for row in rows:
loc = row.get("loc")
if not loc:
continue
url = canonicalize_url(loc)
if url in seen:
continue
seen.add(url)
queue.append({"url": url, "lastmod": row.get("lastmod")})
return queue
def write_csv(rows: list[dict], path: str) -> None:
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["url", "lastmod"])
writer.writeheader()
writer.writerows(rows)
if __name__ == "__main__":
base_url = "https://example.com"
sitemap_urls = find_sitemaps_from_robots(base_url)
all_rows = []
for sitemap_url in sitemap_urls:
all_rows.extend(expand_sitemap(sitemap_url))
queue = build_queue(all_rows)
write_csv(queue, "sitemap_queue.csv")
print("queue size:", len(queue))
Practical sitemap scraping rules
1. Treat lastmod as a hint, not a promise
Some sites update it carefully. Others stamp the current time on everything. Use it for prioritization, not blind trust.
2. Segment queues by sitemap source
If a sitemap index splits content into:
/post-sitemap.xml/category-sitemap.xml/product-sitemap.xml
keep that label. It is extremely useful later for:
- crawl frequency rules
- parser routing
- debugging weird sections
3. Expect bad hygiene
Real-world sitemap scraping often uncovers:
- stale URLs
- redirected URLs
- non-canonical variants
- empty
lastmodvalues
That is normal. The sitemap is still valuable even when it is imperfect.
4. Combine sitemaps with crawling, not instead of crawling
Best use:
- sitemap scraping for fast discovery
- ordinary crawling for pagination gaps, orphaned pages, and fresh links not yet listed
This is not a religion. It is a queue-building shortcut.
When sitemap scraping is the wrong first move
It is less useful when:
- the target has no XML sitemap
- the sitemap is tiny but the site is mostly JS-driven behind search flows
- the content you need is not indexable public content at all
In those cases, normal crawl discovery or API inspection may be better.
But when a decent sitemap exists, ignoring it is usually wasteful.
Bottom line
Sitemap scraping is not glamorous, but it is one of the fastest ways to upgrade a scraper from noisy exploration to intentional crawling.
Do the boring work first:
- find the sitemap
- expand nested indexes
- parse gzip variants
- dedupe into a real queue
Once you do that, every downstream fetch gets cheaper and cleaner.
The cheapest request is the one you never send. If sitemap scraping gives you a clean URL queue first, ProxiesAPI can spend its effort on the pages that actually matter.