How to Scrape E-Commerce Websites: A Practical Guide
E-commerce scraping sounds simple (“just grab the price”), until you ship a crawler and it fails on day 2.
The real problems show up when you try to scrape a catalog at scale:
- pagination that changes based on filters
- out-of-stock variants
- price formats and discounts
- bot protection (403/429), CAPTCHAs, and sudden HTML changes
- anti-scraping tricks (invisible duplicates, lazy-loaded data)
This guide is a practical playbook. You’ll learn a repeatable approach to scrape product data responsibly and reliably, without turning your code into a fragile mess.
E-commerce sites block aggressively once you crawl category pages at scale. ProxiesAPI gives you a stable proxy layer + rotation so retries work and your pipeline doesn’t die mid-crawl.
1) Define your data contract first (don’t scrape “everything”)
Before writing any code, decide what a “product row” means in your system.
A solid baseline schema:
product_id(or canonical URL)namebrandcategorypricecurrencyin_stockimage_urlrating/review_count(optional)scraped_at
Why this matters:
- you can validate output automatically
- changes in site HTML become detectable (missing fields)
- you avoid scope creep
2) Choose the right scraping surface: category pages vs product pages
Most ecommerce sites have at least two surfaces:
Category/search listing pages
Pros:
- contain many products per request (efficient)
- good for discovery
Cons:
- often missing details (variants, full description)
Product detail pages (PDPs)
Pros:
- richest data
- clearer selectors (often)
Cons:
- expensive to crawl at scale
A proven pipeline:
- Crawl category/search pages → collect product URLs/ids
- Crawl PDPs for a subset (or for new/changed products)
- Store results, compute diffs, alert on changes
3) Start with HTML parsing. Fall back to APIs only if needed.
Many ecommerce sites render pages server-side (or partially server-side). If the HTML contains the data, parse it.
When you need more:
- check for embedded JSON (
application/ld+json,__NEXT_DATA__,window.__APOLLO_STATE__) - check XHR endpoints in DevTools
Embedded JSON is often the most stable source without reverse-engineering private APIs.
4) A production-grade fetch layer (timeouts, retries, and backoff)
Scrapers fail at the network layer more often than the parsing layer.
Here’s a reusable fetch layer in Python.
import os
import time
import random
import requests
TIMEOUT = (10, 30) # connect, read
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
session = requests.Session()
PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
def proxies():
if not PROXIESAPI_PROXY_URL:
return None
return {"http": PROXIESAPI_PROXY_URL, "https": PROXIESAPI_PROXY_URL}
def fetch(url: str, tries: int = 5) -> str:
last_err = None
for attempt in range(1, tries + 1):
try:
r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT, proxies=proxies())
# Retry on common failure statuses
if r.status_code in (403, 408, 429, 500, 502, 503, 504):
raise requests.HTTPError(f"status {r.status_code}")
r.raise_for_status()
return r.text
except Exception as e:
last_err = e
backoff = min(30, 2 ** attempt) + random.random()
print(f"attempt {attempt}/{tries} failed: {e}; sleeping {backoff:.1f}s")
time.sleep(backoff)
raise RuntimeError(f"failed after {tries} tries: {last_err}")
This code is intentionally boring. That’s the point.
- Timeouts prevent hanging workers.
- Retries handle transient failures.
PROXIESAPI_PROXY_URLlets you switch proxying on/off without changing your code.
5) Selector strategy: prefer semantics over CSS class names
Ecommerce sites love to ship new CSS class names.
Prefer selectors based on:
- stable attributes:
data-*,itemprop,aria-label - structured data:
application/ld+json - URL patterns (for category/product links)
Example: extract product cards from a category page.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re
def clean_text(x: str | None) -> str | None:
if not x:
return None
return re.sub(r"\s+", " ", x).strip() or None
def parse_category(html: str, base_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
# Heuristic: many stores link product cards with "/products/" or "/product/"
for a in soup.select('a[href*="/product"], a[href*="/products/"]'):
href = a.get("href")
if not href:
continue
url = href if href.startswith("http") else urljoin(base_url, href)
name = clean_text(a.get("aria-label") or a.get_text(" ", strip=True))
img = a.select_one("img")
img_url = img.get("src") or img.get("data-src") if img else None
out.append({
"name": name,
"url": url,
"image": img_url,
})
# Dedupe by URL
deduped = {p["url"]: p for p in out if p.get("url")}
return list(deduped.values())
You’ll tailor the URL heuristics to your target platform (Shopify, WooCommerce, Magento, custom).
6) Pagination: handle 4 patterns
Pagination is the #1 reason ecommerce crawlers miss data.
Common patterns:
?page=2query param- cursor-based (
?cursor=...) - “Load more” button (XHR)
- infinite scroll (XHR)
Simple ?page= example
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
def with_page(url: str, page: int) -> str:
u = urlparse(url)
q = parse_qs(u.query)
q["page"] = [str(page)]
return urlunparse((u.scheme, u.netloc, u.path, u.params, urlencode(q, doseq=True), u.fragment))
def crawl_category(category_url: str, pages: int = 5) -> list[dict]:
all_products = []
seen = set()
for p in range(1, pages + 1):
url = with_page(category_url, p)
html = fetch(url)
batch = parse_category(html, base_url=category_url)
for prod in batch:
if not prod.get("url") or prod["url"] in seen:
continue
seen.add(prod["url"])
all_products.append(prod)
print(f"page {p}: {len(batch)} products (unique: {len(all_products)})")
return all_products
For cursor/infinite scroll, you’ll need DevTools to capture the XHR and call it directly.
7) Product detail pages: parse structured data first
Many PDPs include schema.org JSON-LD.
import json
def parse_jsonld_product(html: str) -> dict | None:
soup = BeautifulSoup(html, "lxml")
for script in soup.select('script[type="application/ld+json"]'):
try:
data = json.loads(script.get_text(strip=True) or "")
except Exception:
continue
# can be a list or a dict
nodes = data if isinstance(data, list) else [data]
for node in nodes:
if isinstance(node, dict) and node.get("@type") in ("Product", "ProductGroup"):
return node
return None
When JSON-LD exists, you often get:
- name
- image
- brand
- offers → price/currency/availability
That’s gold.
8) Data QA: treat missing fields as a breaking change
A reliable scraper includes QA checks. Examples:
priceshould be numeric for >80% of productscurrencyshould be present whenpriceis presenturlshould be unique
A simple QA report:
def qa_report(products: list[dict]):
n = len(products)
with_price = sum(1 for p in products if p.get("price") is not None)
with_name = sum(1 for p in products if p.get("name"))
print("total:", n)
print("name coverage:", with_name, f"({with_name/n:.0%})" if n else "")
print("price coverage:", with_price, f"({with_price/n:.0%})" if n else "")
When coverage drops, your crawler should alert you.
9) Rotating proxies: when you actually need them
You need rotation when:
- you paginate through many category pages
- you scrape multiple categories
- you run frequently (hourly/daily)
- the site rate limits aggressively
You don’t need rotation for:
- a one-off scrape of a handful of products
- a site that explicitly offers a public API
ProxiesAPI fits as the proxy layer in the fetch function above. Keep it configurable via environment variables.
10) Practical advice (from real crawlers)
- Start with a single category and crawl 2 pages.
- Log HTML samples when parsing fails.
- Cache responses while iterating on selectors.
- Keep concurrency low; scale slowly.
- Store results with
scraped_atso you can diff.
Summary
Ecommerce scraping is less about clever parsing and more about building a pipeline that survives:
- HTML changes
- pagination quirks
- transient network failures
- anti-bot protections
Use a strong fetch layer, parse semantically, validate outputs, and add proxy rotation only when scale demands it.
If you share your target store platform (Shopify/WooCommerce/custom) and an example category URL, I can tailor the selectors and pagination logic to match.
E-commerce sites block aggressively once you crawl category pages at scale. ProxiesAPI gives you a stable proxy layer + rotation so retries work and your pipeline doesn’t die mid-crawl.