XPath for Web Scraping: The Practical Cheat Sheet
XPath is the power tool for scraping: it can match based on structure, text content, and relationships like siblings and ancestors. CSS selectors are great, but when you need precision, XPath is often the fastest way to express it.
This post is a practical cheat sheet you can keep open while you build scrapers.
XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.
Setup (Python + lxml)
python3 -m venv .venv
source .venv/bin/activate
pip install lxml
from lxml import html
doc = html.fromstring("""
<html><body>
<div class="card">
<a class="title" href="/p/1">Hello</a>
<span class="price">$19</span>
</div>
</body></html>
""")
The two rules that prevent most bugs
- Use
.//inside loops (relative XPath), not//(global). - Avoid absolute, index-heavy paths like
/html/body/div[2]/div[1]...unless you enjoy rework.
Absolute vs relative
Absolute (from the document root):
//a
Relative (from a context node):
.//a
When you loop over cards, you almost always want relative selectors.
Core patterns (copy/paste)
By attribute:
//a[@href]
//div[@data-id="123"]
Contains / starts-with:
//a[contains(@href, "/product/")]
//a[starts-with(@href, "https://")]
Text matching with whitespace normalization:
//button[normalize-space(.)="Next"]
//a[contains(normalize-space(.), "Read more")]
Multiple conditions:
//a[@href and contains(@class, "title")]
Extracting text vs attributes
def first(xs):
return xs[0] if xs else None
title = first(doc.xpath("//a[@class=\"title\"]/text()"))
href = first(doc.xpath("//a[@class=\"title\"]/@href"))
print(title, href)
Card scraping template
Loop cards, then select relative inside each card:
cards = doc.xpath("//div[contains(@class, \"card\")]")
rows = []
for c in cards:
rows.append({
"title": first(c.xpath(".//a[contains(@class,\"title\")]/text()")),
"url": first(c.xpath(".//a[contains(@class,\"title\")]/@href")),
"price": first(c.xpath(".//span[contains(@class,\"price\")]/text()")),
})
print(rows)
Navigation: parents, ancestors, siblings
//a[@class="title"]/..
//a[@class="title"]/ancestor::div[1]
//h2[1]/following-sibling::p[1]
//p[1]/preceding-sibling::h2[1]
This is the XPath superpower: grabbing the value next to a label even when it has no stable class name.
Label/value scraping
If the page looks like:
<div class="spec"><span>Weight</span><span>1.2 kg</span></div>
Then:
//div[@class="spec"][span[1][normalize-space(.)="Weight"]]/span[2]/text()
When to prefer XPath over CSS selectors
Use XPath when you need:
- matching based on text
- sibling/ancestor traversal
- label/value extraction
- more expressive constraints than CSS alone
Keep your scraper architecture clean (fetch → parse → export). That way selectors stay focused, and you can evolve your network layer (including ProxiesAPI) without rewriting everything.
XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.