XPath for Web Scraping: The Practical Cheat Sheet

XPath is the power tool for scraping: it can match based on structure, text content, and relationships like siblings and ancestors. CSS selectors are great, but when you need precision, XPath is often the fastest way to express it.

This post is a practical cheat sheet you can keep open while you build scrapers.

Make selectors boring (and reliable)

XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.


Setup (Python + lxml)

python3 -m venv .venv
source .venv/bin/activate
pip install lxml
from lxml import html

doc = html.fromstring("""
<html><body>
  <div class="card">
    <a class="title" href="/p/1">Hello</a>
    <span class="price">$19</span>
  </div>
</body></html>
""")

The two rules that prevent most bugs

  • Use .// inside loops (relative XPath), not // (global).
  • Avoid absolute, index-heavy paths like /html/body/div[2]/div[1]... unless you enjoy rework.

Absolute vs relative

Absolute (from the document root):

//a

Relative (from a context node):

.//a

When you loop over cards, you almost always want relative selectors.


Core patterns (copy/paste)

By attribute:

//a[@href]
//div[@data-id="123"]

Contains / starts-with:

//a[contains(@href, "/product/")]
//a[starts-with(@href, "https://")]

Text matching with whitespace normalization:

//button[normalize-space(.)="Next"]
//a[contains(normalize-space(.), "Read more")]

Multiple conditions:

//a[@href and contains(@class, "title")]

Extracting text vs attributes

def first(xs):
    return xs[0] if xs else None

title = first(doc.xpath("//a[@class=\"title\"]/text()"))
href = first(doc.xpath("//a[@class=\"title\"]/@href"))
print(title, href)

Card scraping template

Loop cards, then select relative inside each card:

cards = doc.xpath("//div[contains(@class, \"card\")]")
rows = []
for c in cards:
    rows.append({
        "title": first(c.xpath(".//a[contains(@class,\"title\")]/text()")),
        "url": first(c.xpath(".//a[contains(@class,\"title\")]/@href")),
        "price": first(c.xpath(".//span[contains(@class,\"price\")]/text()")),
    })
print(rows)

//a[@class="title"]/..
//a[@class="title"]/ancestor::div[1]
//h2[1]/following-sibling::p[1]
//p[1]/preceding-sibling::h2[1]

This is the XPath superpower: grabbing the value next to a label even when it has no stable class name.


Label/value scraping

If the page looks like:

<div class="spec"><span>Weight</span><span>1.2 kg</span></div>

Then:

//div[@class="spec"][span[1][normalize-space(.)="Weight"]]/span[2]/text()

When to prefer XPath over CSS selectors

Use XPath when you need:

  • matching based on text
  • sibling/ancestor traversal
  • label/value extraction
  • more expressive constraints than CSS alone

Keep your scraper architecture clean (fetch → parse → export). That way selectors stay focused, and you can evolve your network layer (including ProxiesAPI) without rewriting everything.

Make selectors boring (and reliable)

XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.

Related guides

Web Scraping with Scrapy: Getting Started Guide
Teach Scrapy fundamentals with a simple crawl, selectors, pagination, exports, and proxy-ready request handling.
guides#scrapy#python#web-scraping
Scrape Stack Overflow Questions and Answers
Extract Stack Overflow question listings, votes, tags, accepted answers, and code blocks with Python. This guide uses real selectors and a ProxiesAPI-ready request layer for larger crawls.
tutorial#python#stack-overflow#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Paginate tag feeds, fetch question pages, and parse title/votes/accepted answer into a clean dataset — with a screenshot proof and production-grade Python.
tutorial#python#stack-overflow#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping