How to Scrape E-Commerce Websites: A Practical Guide

May 14, 2026 · guide · #ecommerce scraping, #python, #web-scraping, #pagination, #product-data, #proxies, #anti-blocking

E-commerce scraping sounds simple (“just grab the price”), until you ship a crawler and it fails on day 2.

The real problems show up when you try to scrape a catalog at scale:

pagination that changes based on filters
out-of-stock variants
price formats and discounts
bot protection (403/429), CAPTCHAs, and sudden HTML changes
anti-scraping tricks (invisible duplicates, lazy-loaded data)

This guide is a practical playbook. You’ll learn a repeatable approach to scrape product data responsibly and reliably, without turning your code into a fragile mess.

When your ecommerce crawl grows, ProxiesAPI keeps it steady

E-commerce sites block aggressively once you crawl category pages at scale. ProxiesAPI gives you a stable proxy layer + rotation so retries work and your pipeline doesn’t die mid-crawl.

Get 1,000 free API calls View pricing

1) Define your data contract first (don’t scrape “everything”)

Before writing any code, decide what a “product row” means in your system.

A solid baseline schema:

product_id (or canonical URL)
name
brand
category
price
currency
in_stock
image_url
rating / review_count (optional)
scraped_at

Why this matters:

you can validate output automatically
changes in site HTML become detectable (missing fields)
you avoid scope creep

2) Choose the right scraping surface: category pages vs product pages

Most ecommerce sites have at least two surfaces:

Category/search listing pages

Pros:

contain many products per request (efficient)
good for discovery

Cons:

often missing details (variants, full description)

Product detail pages (PDPs)

Pros:

richest data
clearer selectors (often)

Cons:

expensive to crawl at scale

A proven pipeline:

Crawl category/search pages → collect product URLs/ids
Crawl PDPs for a subset (or for new/changed products)
Store results, compute diffs, alert on changes

3) Start with HTML parsing. Fall back to APIs only if needed.

Many ecommerce sites render pages server-side (or partially server-side). If the HTML contains the data, parse it.

When you need more:

check for embedded JSON (application/ld+json, __NEXT_DATA__, window.__APOLLO_STATE__)
check XHR endpoints in DevTools

Embedded JSON is often the most stable source without reverse-engineering private APIs.

4) A production-grade fetch layer (timeouts, retries, and backoff)

Scrapers fail at the network layer more often than the parsing layer.

Here’s a reusable fetch layer in Python.

import os
import time
import random
import requests

TIMEOUT = (10, 30)  # connect, read

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()

PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")


def proxies():
    if not PROXIESAPI_PROXY_URL:
        return None
    return {"http": PROXIESAPI_PROXY_URL, "https": PROXIESAPI_PROXY_URL}


def fetch(url: str, tries: int = 5) -> str:
    last_err = None

    for attempt in range(1, tries + 1):
        try:
            r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT, proxies=proxies())

            # Retry on common failure statuses
            if r.status_code in (403, 408, 429, 500, 502, 503, 504):
                raise requests.HTTPError(f"status {r.status_code}")

            r.raise_for_status()
            return r.text

        except Exception as e:
            last_err = e
            backoff = min(30, 2 ** attempt) + random.random()
            print(f"attempt {attempt}/{tries} failed: {e}; sleeping {backoff:.1f}s")
            time.sleep(backoff)

    raise RuntimeError(f"failed after {tries} tries: {last_err}")

This code is intentionally boring. That’s the point.

Timeouts prevent hanging workers.
Retries handle transient failures.
PROXIESAPI_PROXY_URL lets you switch proxying on/off without changing your code.

5) Selector strategy: prefer semantics over CSS class names

Ecommerce sites love to ship new CSS class names.

Prefer selectors based on:

stable attributes: data-*, itemprop, aria-label
structured data: application/ld+json
URL patterns (for category/product links)

Example: extract product cards from a category page.

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


def clean_text(x: str | None) -> str | None:
    if not x:
        return None
    return re.sub(r"\s+", " ", x).strip() or None


def parse_category(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []

    # Heuristic: many stores link product cards with "/products/" or "/product/"
    for a in soup.select('a[href*="/product"], a[href*="/products/"]'):
        href = a.get("href")
        if not href:
            continue
        url = href if href.startswith("http") else urljoin(base_url, href)

        name = clean_text(a.get("aria-label") or a.get_text(" ", strip=True))

        img = a.select_one("img")
        img_url = img.get("src") or img.get("data-src") if img else None

        out.append({
            "name": name,
            "url": url,
            "image": img_url,
        })

    # Dedupe by URL
    deduped = {p["url"]: p for p in out if p.get("url")}
    return list(deduped.values())

You’ll tailor the URL heuristics to your target platform (Shopify, WooCommerce, Magento, custom).

6) Pagination: handle 4 patterns

Pagination is the #1 reason ecommerce crawlers miss data.

Common patterns:

?page=2 query param
cursor-based (?cursor=...)
“Load more” button (XHR)
infinite scroll (XHR)

Simple `?page=` example

from urllib.parse import urlparse, parse_qs, urlencode, urlunparse


def with_page(url: str, page: int) -> str:
    u = urlparse(url)
    q = parse_qs(u.query)
    q["page"] = [str(page)]
    return urlunparse((u.scheme, u.netloc, u.path, u.params, urlencode(q, doseq=True), u.fragment))


def crawl_category(category_url: str, pages: int = 5) -> list[dict]:
    all_products = []
    seen = set()

    for p in range(1, pages + 1):
        url = with_page(category_url, p)
        html = fetch(url)
        batch = parse_category(html, base_url=category_url)

        for prod in batch:
            if not prod.get("url") or prod["url"] in seen:
                continue
            seen.add(prod["url"])
            all_products.append(prod)

        print(f"page {p}: {len(batch)} products (unique: {len(all_products)})")

    return all_products

For cursor/infinite scroll, you’ll need DevTools to capture the XHR and call it directly.

7) Product detail pages: parse structured data first

Many PDPs include schema.org JSON-LD.

import json


def parse_jsonld_product(html: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")

    for script in soup.select('script[type="application/ld+json"]'):
        try:
            data = json.loads(script.get_text(strip=True) or "")
        except Exception:
            continue

        # can be a list or a dict
        nodes = data if isinstance(data, list) else [data]
        for node in nodes:
            if isinstance(node, dict) and node.get("@type") in ("Product", "ProductGroup"):
                return node

    return None

When JSON-LD exists, you often get:

name
image
brand
offers → price/currency/availability

That’s gold.

8) Data QA: treat missing fields as a breaking change

A reliable scraper includes QA checks. Examples:

price should be numeric for >80% of products
currency should be present when price is present
url should be unique

A simple QA report:


def qa_report(products: list[dict]):
    n = len(products)
    with_price = sum(1 for p in products if p.get("price") is not None)
    with_name = sum(1 for p in products if p.get("name"))

    print("total:", n)
    print("name coverage:", with_name, f"({with_name/n:.0%})" if n else "")
    print("price coverage:", with_price, f"({with_price/n:.0%})" if n else "")

When coverage drops, your crawler should alert you.

9) Rotating proxies: when you actually need them

You need rotation when:

you paginate through many category pages
you scrape multiple categories
you run frequently (hourly/daily)
the site rate limits aggressively

You don’t need rotation for:

a one-off scrape of a handful of products
a site that explicitly offers a public API

ProxiesAPI fits as the proxy layer in the fetch function above. Keep it configurable via environment variables.

10) Practical advice (from real crawlers)

Start with a single category and crawl 2 pages.
Log HTML samples when parsing fails.
Cache responses while iterating on selectors.
Keep concurrency low; scale slowly.
Store results with scraped_at so you can diff.

Summary

Ecommerce scraping is less about clever parsing and more about building a pipeline that survives:

HTML changes
pagination quirks
transient network failures
anti-bot protections

Use a strong fetch layer, parse semantically, validate outputs, and add proxy rotation only when scale demands it.

If you share your target store platform (Shopify/WooCommerce/custom) and an example category URL, I can tailor the selectors and pagination logic to match.

When your ecommerce crawl grows, ProxiesAPI keeps it steady

E-commerce sites block aggressively once you crawl category pages at scale. ProxiesAPI gives you a stable proxy layer + rotation so retries work and your pipeline doesn’t die mid-crawl.

Get 1,000 free API calls View pricing

A practical playbook for ecommerce scraping: category discovery, pagination patterns, product detail extraction, variants, rate limits, retries, and proxy-backed fetching with ProxiesAPI.

guide#ecommerce scraping#ecommerce#web-scraping

Rotating Proxies: What They Are, How They Work, and Best Providers

A practical, no-hype guide to rotating proxies: per-request vs per-session rotation, residential vs datacenter, common mistakes, and how to implement rotation safely in Python.

guide#rotating proxies#proxies#residential proxies

Scrape Secondhand Fashion Listings from Vinted (Python + ProxiesAPI)

Extract listing title, price, brand, images, and pagination from Vinted search results into a clean dataset — with a screenshot-backed walkthrough and anti-blocking tips.

tutorial#python#vinted#web-scraping

Scrape Products from Amazon (Python) — Title, Price, Rating + Pagination

Build an Amazon product-list scraper in Python that extracts title, URL, ASIN, price, and rating across multiple result pages. Includes retries, headers, and a ProxiesAPI-ready request wrapper.

tutorial#python#amazon#ecommerce

How to Scrape E-Commerce Websites: A Practical Guide

Related guides