How to Scrape E-Commerce Websites: A Practical Guide
If your target keyword is ecommerce scraping, you’re probably trying to do one of these:
- monitor competitor prices
- build a product catalog
- track stock/availability
- compare variants (size, color, pack)
This guide is a practical, implementation-oriented playbook.
We’ll cover:
- how ecommerce sites are typically structured
- how to find category + pagination URLs
- how to extract product cards reliably
- how to scrape product detail pages (PDPs)
- how to handle variants
- rate limits, retries, and data quality
- where ProxiesAPI fits in the fetch layer
E-commerce crawls involve lots of repetitive requests (categories → pages → products). ProxiesAPI gives you a simple proxy-backed fetch URL that can improve stability as your URL volume grows.
1) The three-page model (category → listing → product detail)
Most ecommerce scraping pipelines are a crawl graph:
- Category pages (collections): “Men’s shoes”, “Laptops”, etc.
- Product listing pages (PLPs): paginated grids of product cards.
- Product detail pages (PDPs): the canonical source for price, variants, availability.
Your crawler should reflect that:
- discover categories
- crawl PLPs with pagination
- enqueue PDP URLs
- scrape PDPs into normalized records
2) Common pagination patterns (don’t assume)
Ecommerce pagination is rarely universal. Common patterns:
?page=2?p=2?start=48/page/2/- “Load more” (sometimes still server-rendered, sometimes JS)
How to confirm:
- Click “Next” in the browser.
- Copy the URL.
- Compare to page 1.
If “Next” doesn’t change the URL, inspect the network tab: there may be an XHR endpoint.
3) Product cards: what to extract at the listing level
From PLPs (grid/list pages), aim to extract:
- product name
- product URL (absolute)
- price snippet (best-effort)
- image URL (best-effort)
- SKU/id if present
Do not rely on PLPs for the final truth. Treat them as URL discovery + rough preview.
4) Product detail pages: the “truth”
From PDPs, extract:
- canonical title/name
- canonical URL
- current price + currency
- list price (if present)
- availability (“in stock”, “out of stock”, “ships in…”) as text
- variant options (size/color)
- images
- product description
Also capture:
- timestamp of scrape
- source URL
5) A practical Python pipeline template
This example shows the pipeline shape (not tied to one platform).
Setup
pip install requests beautifulsoup4 lxml
Fetch with retries + optional ProxiesAPI
import random
import time
from urllib.parse import quote
import requests
TIMEOUT = (10, 30)
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
session = requests.Session()
def fetch(url: str, proxiesapi_key: str | None = None, retries: int = 4) -> str:
last = None
for attempt in range(1, retries + 1):
try:
if proxiesapi_key:
proxied = f"http://api.proxiesapi.com/?key={quote(proxiesapi_key)}&url={quote(url, safe='')}"
r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
else:
r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
r.raise_for_status()
return r.text
except Exception as e:
last = e
sleep_s = (2 ** attempt) + random.random()
time.sleep(sleep_s)
raise RuntimeError(f"failed: {last}")
Parse helpers
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def first_text(soup, selectors: list[str]) -> str | None:
for sel in selectors:
el = soup.select_one(sel)
if el:
t = el.get_text(" ", strip=True)
if t:
return t
return None
def parse_price(text: str | None) -> float | None:
if not text:
return None
m = re.search(r"([0-9][0-9,]*\.?[0-9]{0,2})", text)
return float(m.group(1).replace(",", "")) if m else None
Parse a listing page (PLP)
def parse_listing(html: str, base_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
products = []
# common pattern: product cards are within <article> or <li>
for card in soup.select("article, li"):
a = card.select_one("a[href]")
if not a:
continue
href = a.get("href")
url = urljoin(base_url, href)
name = first_text(card, ["h2", "h3", "[data-testid='product-title']"]) or a.get_text(" ", strip=True)
price_text = first_text(card, [".price", "[data-testid='price']", "span.a-price span.a-offscreen"]) # examples
if not url or not name:
continue
products.append({
"name": name,
"url": url,
"price_text": price_text,
"price": parse_price(price_text),
})
# de-dupe by URL
uniq = {}
for p in products:
uniq[p["url"]] = p
return list(uniq.values())
Parse a product detail page (PDP)
def parse_product(html: str) -> dict:
soup = BeautifulSoup(html, "lxml")
title = first_text(soup, ["h1", "#productTitle", "[data-testid='product-title']"]) # examples
price_text = first_text(soup, [
"span.a-price span.a-offscreen",
"[data-testid='price']",
".price",
])
availability = first_text(soup, [
"#availability",
"[data-testid='availability']",
".stock",
])
canonical = None
can = soup.select_one('link[rel="canonical"]')
if can:
canonical = can.get("href")
return {
"title": title,
"price_text": price_text,
"price": parse_price(price_text),
"availability": availability,
"canonical_url": canonical,
}
Orchestrate: crawl a few pages, then scrape PDPs
import time
def scrape_catalog(category_url: str, pages: int = 3, proxiesapi_key: str | None = None) -> list[dict]:
all_products = []
seen = set()
for p in range(1, pages + 1):
url = category_url if p == 1 else f"{category_url}?page={p}"
html = fetch(url, proxiesapi_key=proxiesapi_key)
listing = parse_listing(html, base_url=category_url)
for item in listing:
if item["url"] in seen:
continue
seen.add(item["url"])
all_products.append(item)
print(f"plp page {p}/{pages}: +{len(listing)} products (total {len(all_products)})")
time.sleep(1.0)
# now scrape PDPs
out = []
for i, item in enumerate(all_products, start=1):
html = fetch(item["url"], proxiesapi_key=proxiesapi_key)
details = parse_product(html)
out.append({**item, **details})
if i % 10 == 0:
print("pdp", i, "/", len(all_products))
time.sleep(1.0)
return out
6) Variants: treat them as a first-class entity
Variants (size/color) are where ecommerce scrapers go to die.
Practical advice:
- store a product table and a variant table
- always keep:
product_id(canonical)variant_id(sku or option tuple)price,availability,option_values
If the site renders variants as separate URLs, it’s easier: each variant is a PDP URL.
7) Data quality: normalize now or regret later
Normalize these fields:
- price → numeric + currency
- availability → raw text + normalized enum (
in_stock,out_of_stock,unknown) - URLs → canonicalized
And always store:
scraped_atISO timestampsource_url
8) Where ProxiesAPI fits
Ecommerce scraping is request-heavy:
- category pages
- many pagination pages
- many product details
That repetitive pattern often triggers throttling.
ProxiesAPI gives you a simple proxy-backed fetch URL:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
If you combine ProxiesAPI with:
- timeouts
- retries with backoff
- slower pagination
…you typically get more “complete runs” when scraping large catalogs.
E-commerce crawls involve lots of repetitive requests (categories → pages → products). ProxiesAPI gives you a simple proxy-backed fetch URL that can improve stability as your URL volume grows.