XPath for Web Scraping: The Practical Cheat Sheet

May 20, 2026 · guide · #xpath, #web-scraping, #python, #lxml, #selectors

XPath is the power tool for scraping: it can match based on structure, text content, and relationships like siblings and ancestors. CSS selectors are great, but when you need precision, XPath is often the fastest way to express it.

This post is a practical cheat sheet you can keep open while you build scrapers.

Make selectors boring (and reliable)

XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.

Get 1,000 free API calls View pricing

Setup (Python + lxml)

python3 -m venv .venv
source .venv/bin/activate
pip install lxml

from lxml import html

doc = html.fromstring("""
<html><body>
  <div class="card">
    <a class="title" href="/p/1">Hello</a>
    <span class="price">$19</span>
  </div>
</body></html>
""")

The two rules that prevent most bugs

Use .// inside loops (relative XPath), not // (global).
Avoid absolute, index-heavy paths like /html/body/div[2]/div[1]... unless you enjoy rework.

Absolute vs relative

Absolute (from the document root):

//a

Relative (from a context node):

.//a

When you loop over cards, you almost always want relative selectors.

Core patterns (copy/paste)

By attribute:

//a[@href]
//div[@data-id="123"]

Contains / starts-with:

//a[contains(@href, "/product/")]
//a[starts-with(@href, "https://")]

Text matching with whitespace normalization:

//button[normalize-space(.)="Next"]
//a[contains(normalize-space(.), "Read more")]

Multiple conditions:

//a[@href and contains(@class, "title")]

Extracting text vs attributes

def first(xs):
    return xs[0] if xs else None

title = first(doc.xpath("//a[@class=\"title\"]/text()"))
href = first(doc.xpath("//a[@class=\"title\"]/@href"))
print(title, href)

Card scraping template

Loop cards, then select relative inside each card:

cards = doc.xpath("//div[contains(@class, \"card\")]")
rows = []
for c in cards:
    rows.append({
        "title": first(c.xpath(".//a[contains(@class,\"title\")]/text()")),
        "url": first(c.xpath(".//a[contains(@class,\"title\")]/@href")),
        "price": first(c.xpath(".//span[contains(@class,\"price\")]/text()")),
    })
print(rows)

//a[@class="title"]/..
//a[@class="title"]/ancestor::div[1]
//h2[1]/following-sibling::p[1]
//p[1]/preceding-sibling::h2[1]

This is the XPath superpower: grabbing the value next to a label even when it has no stable class name.

Label/value scraping

If the page looks like:

<div class="spec"><span>Weight</span><span>1.2 kg</span></div>

Then:

//div[@class="spec"][span[1][normalize-space(.)="Weight"]]/span[2]/text()

When to prefer XPath over CSS selectors

Use XPath when you need:

matching based on text
sibling/ancestor traversal
label/value extraction
more expressive constraints than CSS alone

Keep your scraper architecture clean (fetch → parse → export). That way selectors stay focused, and you can evolve your network layer (including ProxiesAPI) without rewriting everything.

Make selectors boring (and reliable)

XPath is a sharp tool, but stability also depends on your fetch layer. A ProxiesAPI-friendly fetch wrapper (timeouts, retries, rotation) keeps the pipeline stable while your XPath stays the same.

Get 1,000 free API calls View pricing

Teach Scrapy fundamentals with a simple crawl, selectors, pagination, exports, and proxy-ready request handling.

guides#scrapy#python#web-scraping

Scrape Stack Overflow Questions and Answers

Extract Stack Overflow question listings, votes, tags, accepted answers, and code blocks with Python. This guide uses real selectors and a ProxiesAPI-ready request layer for larger crawls.

tutorial#python#stack-overflow#web-scraping

Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)

Paginate tag feeds, fetch question pages, and parse title/votes/accepted answer into a clean dataset — with a screenshot proof and production-grade Python.

tutorial#python#stack-overflow#web-scraping

Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)

Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.

tutorial#python#stack-overflow#web-scraping