How to Scrape E-Commerce Websites: A Practical Guide

E-commerce scraping sounds simple (“just grab the price”), until you ship a crawler and it fails on day 2.

The real problems show up when you try to scrape a catalog at scale:

  • pagination that changes based on filters
  • out-of-stock variants
  • price formats and discounts
  • bot protection (403/429), CAPTCHAs, and sudden HTML changes
  • anti-scraping tricks (invisible duplicates, lazy-loaded data)

This guide is a practical playbook. You’ll learn a repeatable approach to scrape product data responsibly and reliably, without turning your code into a fragile mess.

When your ecommerce crawl grows, ProxiesAPI keeps it steady

E-commerce sites block aggressively once you crawl category pages at scale. ProxiesAPI gives you a stable proxy layer + rotation so retries work and your pipeline doesn’t die mid-crawl.


1) Define your data contract first (don’t scrape “everything”)

Before writing any code, decide what a “product row” means in your system.

A solid baseline schema:

  • product_id (or canonical URL)
  • name
  • brand
  • category
  • price
  • currency
  • in_stock
  • image_url
  • rating / review_count (optional)
  • scraped_at

Why this matters:

  • you can validate output automatically
  • changes in site HTML become detectable (missing fields)
  • you avoid scope creep

2) Choose the right scraping surface: category pages vs product pages

Most ecommerce sites have at least two surfaces:

Category/search listing pages

Pros:

  • contain many products per request (efficient)
  • good for discovery

Cons:

  • often missing details (variants, full description)

Product detail pages (PDPs)

Pros:

  • richest data
  • clearer selectors (often)

Cons:

  • expensive to crawl at scale

A proven pipeline:

  1. Crawl category/search pages → collect product URLs/ids
  2. Crawl PDPs for a subset (or for new/changed products)
  3. Store results, compute diffs, alert on changes

3) Start with HTML parsing. Fall back to APIs only if needed.

Many ecommerce sites render pages server-side (or partially server-side). If the HTML contains the data, parse it.

When you need more:

  • check for embedded JSON (application/ld+json, __NEXT_DATA__, window.__APOLLO_STATE__)
  • check XHR endpoints in DevTools

Embedded JSON is often the most stable source without reverse-engineering private APIs.


4) A production-grade fetch layer (timeouts, retries, and backoff)

Scrapers fail at the network layer more often than the parsing layer.

Here’s a reusable fetch layer in Python.

import os
import time
import random
import requests

TIMEOUT = (10, 30)  # connect, read

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()

PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")


def proxies():
    if not PROXIESAPI_PROXY_URL:
        return None
    return {"http": PROXIESAPI_PROXY_URL, "https": PROXIESAPI_PROXY_URL}


def fetch(url: str, tries: int = 5) -> str:
    last_err = None

    for attempt in range(1, tries + 1):
        try:
            r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT, proxies=proxies())

            # Retry on common failure statuses
            if r.status_code in (403, 408, 429, 500, 502, 503, 504):
                raise requests.HTTPError(f"status {r.status_code}")

            r.raise_for_status()
            return r.text

        except Exception as e:
            last_err = e
            backoff = min(30, 2 ** attempt) + random.random()
            print(f"attempt {attempt}/{tries} failed: {e}; sleeping {backoff:.1f}s")
            time.sleep(backoff)

    raise RuntimeError(f"failed after {tries} tries: {last_err}")

This code is intentionally boring. That’s the point.

  • Timeouts prevent hanging workers.
  • Retries handle transient failures.
  • PROXIESAPI_PROXY_URL lets you switch proxying on/off without changing your code.

5) Selector strategy: prefer semantics over CSS class names

Ecommerce sites love to ship new CSS class names.

Prefer selectors based on:

  • stable attributes: data-*, itemprop, aria-label
  • structured data: application/ld+json
  • URL patterns (for category/product links)

Example: extract product cards from a category page.

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


def clean_text(x: str | None) -> str | None:
    if not x:
        return None
    return re.sub(r"\s+", " ", x).strip() or None


def parse_category(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []

    # Heuristic: many stores link product cards with "/products/" or "/product/"
    for a in soup.select('a[href*="/product"], a[href*="/products/"]'):
        href = a.get("href")
        if not href:
            continue
        url = href if href.startswith("http") else urljoin(base_url, href)

        name = clean_text(a.get("aria-label") or a.get_text(" ", strip=True))

        img = a.select_one("img")
        img_url = img.get("src") or img.get("data-src") if img else None

        out.append({
            "name": name,
            "url": url,
            "image": img_url,
        })

    # Dedupe by URL
    deduped = {p["url"]: p for p in out if p.get("url")}
    return list(deduped.values())

You’ll tailor the URL heuristics to your target platform (Shopify, WooCommerce, Magento, custom).


6) Pagination: handle 4 patterns

Pagination is the #1 reason ecommerce crawlers miss data.

Common patterns:

  1. ?page=2 query param
  2. cursor-based (?cursor=...)
  3. “Load more” button (XHR)
  4. infinite scroll (XHR)

Simple ?page= example

from urllib.parse import urlparse, parse_qs, urlencode, urlunparse


def with_page(url: str, page: int) -> str:
    u = urlparse(url)
    q = parse_qs(u.query)
    q["page"] = [str(page)]
    return urlunparse((u.scheme, u.netloc, u.path, u.params, urlencode(q, doseq=True), u.fragment))


def crawl_category(category_url: str, pages: int = 5) -> list[dict]:
    all_products = []
    seen = set()

    for p in range(1, pages + 1):
        url = with_page(category_url, p)
        html = fetch(url)
        batch = parse_category(html, base_url=category_url)

        for prod in batch:
            if not prod.get("url") or prod["url"] in seen:
                continue
            seen.add(prod["url"])
            all_products.append(prod)

        print(f"page {p}: {len(batch)} products (unique: {len(all_products)})")

    return all_products

For cursor/infinite scroll, you’ll need DevTools to capture the XHR and call it directly.


7) Product detail pages: parse structured data first

Many PDPs include schema.org JSON-LD.

import json


def parse_jsonld_product(html: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")

    for script in soup.select('script[type="application/ld+json"]'):
        try:
            data = json.loads(script.get_text(strip=True) or "")
        except Exception:
            continue

        # can be a list or a dict
        nodes = data if isinstance(data, list) else [data]
        for node in nodes:
            if isinstance(node, dict) and node.get("@type") in ("Product", "ProductGroup"):
                return node

    return None

When JSON-LD exists, you often get:

  • name
  • image
  • brand
  • offers → price/currency/availability

That’s gold.


8) Data QA: treat missing fields as a breaking change

A reliable scraper includes QA checks. Examples:

  • price should be numeric for >80% of products
  • currency should be present when price is present
  • url should be unique

A simple QA report:


def qa_report(products: list[dict]):
    n = len(products)
    with_price = sum(1 for p in products if p.get("price") is not None)
    with_name = sum(1 for p in products if p.get("name"))

    print("total:", n)
    print("name coverage:", with_name, f"({with_name/n:.0%})" if n else "")
    print("price coverage:", with_price, f"({with_price/n:.0%})" if n else "")

When coverage drops, your crawler should alert you.


9) Rotating proxies: when you actually need them

You need rotation when:

  • you paginate through many category pages
  • you scrape multiple categories
  • you run frequently (hourly/daily)
  • the site rate limits aggressively

You don’t need rotation for:

  • a one-off scrape of a handful of products
  • a site that explicitly offers a public API

ProxiesAPI fits as the proxy layer in the fetch function above. Keep it configurable via environment variables.


10) Practical advice (from real crawlers)

  • Start with a single category and crawl 2 pages.
  • Log HTML samples when parsing fails.
  • Cache responses while iterating on selectors.
  • Keep concurrency low; scale slowly.
  • Store results with scraped_at so you can diff.

Summary

Ecommerce scraping is less about clever parsing and more about building a pipeline that survives:

  • HTML changes
  • pagination quirks
  • transient network failures
  • anti-bot protections

Use a strong fetch layer, parse semantically, validate outputs, and add proxy rotation only when scale demands it.

If you share your target store platform (Shopify/WooCommerce/custom) and an example category URL, I can tailor the selectors and pagination logic to match.

When your ecommerce crawl grows, ProxiesAPI keeps it steady

E-commerce sites block aggressively once you crawl category pages at scale. ProxiesAPI gives you a stable proxy layer + rotation so retries work and your pipeline doesn’t die mid-crawl.

Related guides

How to Scrape E-Commerce Websites: A Practical Guide
A practical playbook for ecommerce scraping: category discovery, pagination patterns, product detail extraction, variants, rate limits, retries, and proxy-backed fetching with ProxiesAPI.
guide#ecommerce scraping#ecommerce#web-scraping
Rotating Proxies: What They Are, How They Work, and Best Providers
A practical, no-hype guide to rotating proxies: per-request vs per-session rotation, residential vs datacenter, common mistakes, and how to implement rotation safely in Python.
guide#rotating proxies#proxies#residential proxies
Scrape Secondhand Fashion Listings from Vinted (Python + ProxiesAPI)
Extract listing title, price, brand, images, and pagination from Vinted search results into a clean dataset — with a screenshot-backed walkthrough and anti-blocking tips.
tutorial#python#vinted#web-scraping
Scrape Products from Amazon (Python) — Title, Price, Rating + Pagination
Build an Amazon product-list scraper in Python that extracts title, URL, ASIN, price, and rating across multiple result pages. Includes retries, headers, and a ProxiesAPI-ready request wrapper.
tutorial#python#amazon#ecommerce