XPath for Web Scraping: The Practical Cheat Sheet

XPath is the power tool for scraping: it can match based on structure, text content, and relationships like siblings and ancestors. CSS selectors are great, but when you need precision, XPath is often the fastest way to express it.

This post is a practical cheat sheet you can keep open while you build scrapers.

Make selectors boring (and reliable)

XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.


Setup (Python + lxml)

python3 -m venv .venv
source .venv/bin/activate
pip install lxml
from lxml import html

doc = html.fromstring("""
<html><body>
  <div class="card">
    <a class="title" href="/p/1">Hello</a>
    <span class="price">$19</span>
  </div>
</body></html>
""")

The two rules that prevent most bugs

  • Use .// inside loops (relative XPath), not // (global).
  • Avoid absolute, index-heavy paths like /html/body/div[2]/div[1]... unless you enjoy rework.

Absolute vs relative

Absolute (from the document root):

//a

Relative (from a context node):

.//a

When you loop over cards, you almost always want relative selectors.


Core patterns (copy/paste)

By attribute:

//a[@href]
//div[@data-id="123"]

Contains / starts-with:

//a[contains(@href, "/product/")]
//a[starts-with(@href, "https://")]

Text matching with whitespace normalization:

//button[normalize-space(.)="Next"]
//a[contains(normalize-space(.), "Read more")]

Multiple conditions:

//a[@href and contains(@class, "title")]

Extracting text vs attributes

def first(xs):
    return xs[0] if xs else None

title = first(doc.xpath("//a[@class=\"title\"]/text()"))
href = first(doc.xpath("//a[@class=\"title\"]/@href"))
print(title, href)

Card scraping template

Loop cards, then select relative inside each card:

cards = doc.xpath("//div[contains(@class, \"card\")]")
rows = []
for c in cards:
    rows.append({
        "title": first(c.xpath(".//a[contains(@class,\"title\")]/text()")),
        "url": first(c.xpath(".//a[contains(@class,\"title\")]/@href")),
        "price": first(c.xpath(".//span[contains(@class,\"price\")]/text()")),
    })
print(rows)

//a[@class="title"]/..
//a[@class="title"]/ancestor::div[1]
//h2[1]/following-sibling::p[1]
//p[1]/preceding-sibling::h2[1]

This is the XPath superpower: grabbing the value next to a label even when it has no stable class name.


Label/value scraping

If the page looks like:

<div class="spec"><span>Weight</span><span>1.2 kg</span></div>

Then:

//div[@class="spec"][span[1][normalize-space(.)="Weight"]]/span[2]/text()

When to prefer XPath over CSS selectors

Use XPath when you need:

  • matching based on text
  • sibling/ancestor traversal
  • label/value extraction
  • more expressive constraints than CSS alone

Keep your scraper architecture clean (fetch → parse → export). That way selectors stay focused, and you can evolve your network layer (including ProxiesAPI) without rewriting everything.

Make selectors boring (and reliable)

XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.

Related guides

Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Paginate tag feeds, fetch question pages, and parse title/votes/accepted answer into a clean dataset — with a screenshot proof and production-grade Python.
tutorial#python#stack-overflow#web-scraping
How to Scrape Stack Overflow Questions and Accepted Answers with Python (By Tag)
Build a resilient Stack Overflow scraper: crawl tag pages, extract question metadata, follow links, and parse accepted answers. Includes retries, dedupe, and ProxiesAPI-ready requests + a screenshot of the tag page.
tutorial#python#stack-overflow#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping
Web Scraping with Scrapy: Getting Started Guide (2026)
A practical Scrapy starter for 2026: selectors, pagination, pipelines, exports, and adding proxy rotation the right way (including ProxiesAPI).
guides#scrapy#python#web-scraping