How to Scrape E-Commerce Websites: A Practical Guide

Jun 14, 2026 · guide · #ecommerce scraping, #ecommerce, #web-scraping, #python, #proxies, #pagination, #products

If your target keyword is ecommerce scraping, you’re probably trying to do one of these:

monitor competitor prices
build a product catalog
track stock/availability
compare variants (size, color, pack)

This guide is a practical, implementation-oriented playbook.

We’ll cover:

how ecommerce sites are typically structured
how to find category + pagination URLs
how to extract product cards reliably
how to scrape product detail pages (PDPs)
how to handle variants
rate limits, retries, and data quality
where ProxiesAPI fits in the fetch layer

Scale ecommerce scraping more reliably with ProxiesAPI

E-commerce crawls involve lots of repetitive requests (categories → pages → products). ProxiesAPI gives you a simple proxy-backed fetch URL that can improve stability as your URL volume grows.

Get 1,000 free API calls View pricing

1) The three-page model (category → listing → product detail)

Most ecommerce scraping pipelines are a crawl graph:

Category pages (collections): “Men’s shoes”, “Laptops”, etc.
Product listing pages (PLPs): paginated grids of product cards.
Product detail pages (PDPs): the canonical source for price, variants, availability.

Your crawler should reflect that:

discover categories
crawl PLPs with pagination
enqueue PDP URLs
scrape PDPs into normalized records

2) Common pagination patterns (don’t assume)

Ecommerce pagination is rarely universal. Common patterns:

?page=2
?p=2
?start=48
/page/2/
“Load more” (sometimes still server-rendered, sometimes JS)

How to confirm:

Click “Next” in the browser.
Copy the URL.
Compare to page 1.

If “Next” doesn’t change the URL, inspect the network tab: there may be an XHR endpoint.

3) Product cards: what to extract at the listing level

From PLPs (grid/list pages), aim to extract:

product name
product URL (absolute)
price snippet (best-effort)
image URL (best-effort)
SKU/id if present

Do not rely on PLPs for the final truth. Treat them as URL discovery + rough preview.

4) Product detail pages: the “truth”

From PDPs, extract:

canonical title/name
canonical URL
current price + currency
list price (if present)
availability (“in stock”, “out of stock”, “ships in…”) as text
variant options (size/color)
images
product description

Also capture:

timestamp of scrape
source URL

5) A practical Python pipeline template

This example shows the pipeline shape (not tied to one platform).

Setup

pip install requests beautifulsoup4 lxml

Fetch with retries + optional ProxiesAPI

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

session = requests.Session()


def fetch(url: str, proxiesapi_key: str | None = None, retries: int = 4) -> str:
    last = None
    for attempt in range(1, retries + 1):
        try:
            if proxiesapi_key:
                proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
                r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
            else:
                r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
            r.raise_for_status()
            return r.text
        except Exception as e:
            last = e
            sleep_s = (2 ** attempt) + random.random()
            time.sleep(sleep_s)
    raise RuntimeError(f"failed: {last}")

Parse helpers

import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup


def first_text(soup, selectors: list[str]) -> str | None:
    for sel in selectors:
        el = soup.select_one(sel)
        if el:
            t = el.get_text(" ", strip=True)
            if t:
                return t
    return None


def parse_price(text: str | None) -> float | None:
    if not text:
        return None
    m = re.search(r"([0-9][0-9,]*\.?[0-9]{0,2})", text)
    return float(m.group(1).replace(",", "")) if m else None

Parse a listing page (PLP)


def parse_listing(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    products = []

    # common pattern: product cards are within <article> or <li>
    for card in soup.select("article, li"):
        a = card.select_one("a[href]")
        if not a:
            continue

        href = a.get("href")
        url = urljoin(base_url, href)

        name = first_text(card, ["h2", "h3", "[data-testid='product-title']"]) or a.get_text(" ", strip=True)
        price_text = first_text(card, [".price", "[data-testid='price']", "span.a-price span.a-offscreen"])  # examples

        if not url or not name:
            continue

        products.append({
            "name": name,
            "url": url,
            "price_text": price_text,
            "price": parse_price(price_text),
        })

    # de-dupe by URL
    uniq = {}
    for p in products:
        uniq[p["url"]] = p
    return list(uniq.values())

Parse a product detail page (PDP)


def parse_product(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = first_text(soup, ["h1", "#productTitle", "[data-testid='product-title']"])  # examples
    price_text = first_text(soup, [
        "span.a-price span.a-offscreen",
        "[data-testid='price']",
        ".price",
    ])
    availability = first_text(soup, [
        "#availability",
        "[data-testid='availability']",
        ".stock",
    ])

    canonical = None
    can = soup.select_one('link[rel="canonical"]')
    if can:
        canonical = can.get("href")

    return {
        "title": title,
        "price_text": price_text,
        "price": parse_price(price_text),
        "availability": availability,
        "canonical_url": canonical,
    }

Orchestrate: crawl a few pages, then scrape PDPs

import time


def scrape_catalog(category_url: str, pages: int = 3, proxiesapi_key: str | None = None) -> list[dict]:
    all_products = []
    seen = set()

    for p in range(1, pages + 1):
        url = category_url if p == 1 else f"{category_url}?page={p}"
        html = fetch(url, proxiesapi_key=proxiesapi_key)
        listing = parse_listing(html, base_url=category_url)

        for item in listing:
            if item["url"] in seen:
                continue
            seen.add(item["url"])
            all_products.append(item)

        print(f"plp page {p}/{pages}: +{len(listing)} products (total {len(all_products)})")
        time.sleep(1.0)

    # now scrape PDPs
    out = []
    for i, item in enumerate(all_products, start=1):
        html = fetch(item["url"], proxiesapi_key=proxiesapi_key)
        details = parse_product(html)
        out.append({**item, **details})
        if i % 10 == 0:
            print("pdp", i, "/", len(all_products))
            time.sleep(1.0)

    return out

6) Variants: treat them as a first-class entity

Variants (size/color) are where ecommerce scrapers go to die.

Practical advice:

store a product table and a variant table
always keep:
- product_id (canonical)
- variant_id (sku or option tuple)
- price, availability, option_values

If the site renders variants as separate URLs, it’s easier: each variant is a PDP URL.

7) Data quality: normalize now or regret later

Normalize these fields:

price → numeric + currency
availability → raw text + normalized enum (in_stock, out_of_stock, unknown)
URLs → canonicalized

And always store:

scraped_at ISO timestamp
source_url

8) Where ProxiesAPI fits

Ecommerce scraping is request-heavy:

category pages
many pagination pages
many product details

That repetitive pattern often triggers throttling.

ProxiesAPI gives you a simple proxy-backed fetch URL:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

If you combine ProxiesAPI with:

timeouts
retries with backoff
slower pagination

…you typically get more “complete runs” when scraping large catalogs.

Scale ecommerce scraping more reliably with ProxiesAPI

E-commerce crawls involve lots of repetitive requests (categories → pages → products). ProxiesAPI gives you a simple proxy-backed fetch URL that can improve stability as your URL volume grows.

Get 1,000 free API calls View pricing

A step-by-step playbook for ecommerce scraping: product selectors, pagination, retries, proxy rotation, and data QA — with real Python patterns you can reuse.

guide#ecommerce scraping#python#web-scraping

Scrape Products from Amazon (Python) — Title, Price, Rating + Pagination

Build an Amazon product-list scraper in Python that extracts title, URL, ASIN, price, and rating across multiple result pages. Includes retries, headers, and a ProxiesAPI-ready request wrapper.

tutorial#python#amazon#ecommerce

Scrape Secondhand Fashion Listings from Vinted

Show how to collect listing titles, brands, prices, images, and pagination data from Vinted search pages with ProxiesAPI.

tutorial#python#vinted#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

How to Scrape E-Commerce Websites: A Practical Guide

Related guides