How to Scrape E-Commerce Websites: A Practical Guide

If your target keyword is ecommerce scraping, you’re probably trying to do one of these:

  • monitor competitor prices
  • build a product catalog
  • track stock/availability
  • compare variants (size, color, pack)

This guide is a practical, implementation-oriented playbook.

We’ll cover:

  • how ecommerce sites are typically structured
  • how to find category + pagination URLs
  • how to extract product cards reliably
  • how to scrape product detail pages (PDPs)
  • how to handle variants
  • rate limits, retries, and data quality
  • where ProxiesAPI fits in the fetch layer
Scale ecommerce scraping more reliably with ProxiesAPI

E-commerce crawls involve lots of repetitive requests (categories → pages → products). ProxiesAPI gives you a simple proxy-backed fetch URL that can improve stability as your URL volume grows.


1) The three-page model (category → listing → product detail)

Most ecommerce scraping pipelines are a crawl graph:

  1. Category pages (collections): “Men’s shoes”, “Laptops”, etc.
  2. Product listing pages (PLPs): paginated grids of product cards.
  3. Product detail pages (PDPs): the canonical source for price, variants, availability.

Your crawler should reflect that:

  • discover categories
  • crawl PLPs with pagination
  • enqueue PDP URLs
  • scrape PDPs into normalized records

2) Common pagination patterns (don’t assume)

Ecommerce pagination is rarely universal. Common patterns:

  • ?page=2
  • ?p=2
  • ?start=48
  • /page/2/
  • “Load more” (sometimes still server-rendered, sometimes JS)

How to confirm:

  • Click “Next” in the browser.
  • Copy the URL.
  • Compare to page 1.

If “Next” doesn’t change the URL, inspect the network tab: there may be an XHR endpoint.


3) Product cards: what to extract at the listing level

From PLPs (grid/list pages), aim to extract:

  • product name
  • product URL (absolute)
  • price snippet (best-effort)
  • image URL (best-effort)
  • SKU/id if present

Do not rely on PLPs for the final truth. Treat them as URL discovery + rough preview.


4) Product detail pages: the “truth”

From PDPs, extract:

  • canonical title/name
  • canonical URL
  • current price + currency
  • list price (if present)
  • availability (“in stock”, “out of stock”, “ships in…”) as text
  • variant options (size/color)
  • images
  • product description

Also capture:

  • timestamp of scrape
  • source URL

5) A practical Python pipeline template

This example shows the pipeline shape (not tied to one platform).

Setup

pip install requests beautifulsoup4 lxml

Fetch with retries + optional ProxiesAPI

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

session = requests.Session()


def fetch(url: str, proxiesapi_key: str | None = None, retries: int = 4) -> str:
    last = None
    for attempt in range(1, retries + 1):
        try:
            if proxiesapi_key:
                proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
                r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
            else:
                r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
            r.raise_for_status()
            return r.text
        except Exception as e:
            last = e
            sleep_s = (2 ** attempt) + random.random()
            time.sleep(sleep_s)
    raise RuntimeError(f"failed: {last}")

Parse helpers

import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup


def first_text(soup, selectors: list[str]) -> str | None:
    for sel in selectors:
        el = soup.select_one(sel)
        if el:
            t = el.get_text(" ", strip=True)
            if t:
                return t
    return None


def parse_price(text: str | None) -> float | None:
    if not text:
        return None
    m = re.search(r"([0-9][0-9,]*\.?[0-9]{0,2})", text)
    return float(m.group(1).replace(",", "")) if m else None

Parse a listing page (PLP)


def parse_listing(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    products = []

    # common pattern: product cards are within <article> or <li>
    for card in soup.select("article, li"):
        a = card.select_one("a[href]")
        if not a:
            continue

        href = a.get("href")
        url = urljoin(base_url, href)

        name = first_text(card, ["h2", "h3", "[data-testid='product-title']"]) or a.get_text(" ", strip=True)
        price_text = first_text(card, [".price", "[data-testid='price']", "span.a-price span.a-offscreen"])  # examples

        if not url or not name:
            continue

        products.append({
            "name": name,
            "url": url,
            "price_text": price_text,
            "price": parse_price(price_text),
        })

    # de-dupe by URL
    uniq = {}
    for p in products:
        uniq[p["url"]] = p
    return list(uniq.values())

Parse a product detail page (PDP)


def parse_product(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = first_text(soup, ["h1", "#productTitle", "[data-testid='product-title']"])  # examples
    price_text = first_text(soup, [
        "span.a-price span.a-offscreen",
        "[data-testid='price']",
        ".price",
    ])
    availability = first_text(soup, [
        "#availability",
        "[data-testid='availability']",
        ".stock",
    ])

    canonical = None
    can = soup.select_one('link[rel="canonical"]')
    if can:
        canonical = can.get("href")

    return {
        "title": title,
        "price_text": price_text,
        "price": parse_price(price_text),
        "availability": availability,
        "canonical_url": canonical,
    }

Orchestrate: crawl a few pages, then scrape PDPs

import time


def scrape_catalog(category_url: str, pages: int = 3, proxiesapi_key: str | None = None) -> list[dict]:
    all_products = []
    seen = set()

    for p in range(1, pages + 1):
        url = category_url if p == 1 else f"{category_url}?page={p}"
        html = fetch(url, proxiesapi_key=proxiesapi_key)
        listing = parse_listing(html, base_url=category_url)

        for item in listing:
            if item["url"] in seen:
                continue
            seen.add(item["url"])
            all_products.append(item)

        print(f"plp page {p}/{pages}: +{len(listing)} products (total {len(all_products)})")
        time.sleep(1.0)

    # now scrape PDPs
    out = []
    for i, item in enumerate(all_products, start=1):
        html = fetch(item["url"], proxiesapi_key=proxiesapi_key)
        details = parse_product(html)
        out.append({**item, **details})
        if i % 10 == 0:
            print("pdp", i, "/", len(all_products))
            time.sleep(1.0)

    return out

6) Variants: treat them as a first-class entity

Variants (size/color) are where ecommerce scrapers go to die.

Practical advice:

  • store a product table and a variant table
  • always keep:
    • product_id (canonical)
    • variant_id (sku or option tuple)
    • price, availability, option_values

If the site renders variants as separate URLs, it’s easier: each variant is a PDP URL.


7) Data quality: normalize now or regret later

Normalize these fields:

  • price → numeric + currency
  • availability → raw text + normalized enum (in_stock, out_of_stock, unknown)
  • URLs → canonicalized

And always store:

  • scraped_at ISO timestamp
  • source_url

8) Where ProxiesAPI fits

Ecommerce scraping is request-heavy:

  • category pages
  • many pagination pages
  • many product details

That repetitive pattern often triggers throttling.

ProxiesAPI gives you a simple proxy-backed fetch URL:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

If you combine ProxiesAPI with:

  • timeouts
  • retries with backoff
  • slower pagination

…you typically get more “complete runs” when scraping large catalogs.

Scale ecommerce scraping more reliably with ProxiesAPI

E-commerce crawls involve lots of repetitive requests (categories → pages → products). ProxiesAPI gives you a simple proxy-backed fetch URL that can improve stability as your URL volume grows.

Related guides

Scrape Product Data from Amazon (with Python + ProxiesAPI)
Extract Amazon product title, price, rating, and availability from a product page using requests + BeautifulSoup, with retries and proxy-backed fetching via ProxiesAPI.
tutorial#python#amazon#web-scraping
Web Scraping with Python: The Complete 2026 Tutorial
A from-scratch, production-minded guide to web scraping in Python: requests + BeautifulSoup, pagination, retries, caching, proxies, and a reusable scraper template.
guide#web scraping python#python#web-scraping
Build a Job Board with Data from Indeed (Python scraper tutorial)
Scrape Indeed job listings (title, company, location, salary, summary) with Python (requests + BeautifulSoup), then save a clean dataset you can render as a simple job board. Includes pagination + ProxiesAPI fetch.
tutorial#python#indeed#jobs
Retry Policies for Web Scrapers: What to Retry vs Fail Fast
Learn a production-safe retry strategy with status-code rules, backoff, and a Python helper you can drop into any scraper.
engineering#python#web-scraping#retries