Data Scraping for E-Commerce: Price Monitoring + Competitive Intel
If you run e-commerce operations, you care about three things:
- price: are competitors undercutting you?
- availability: is stock changing (and when)?
- merchandising: what’s new, what’s trending, what’s being pushed?
That’s why “data scraping for e-commerce” usually starts as a one-off script… and turns into a pipeline.
This guide is the playbook I’d want if I were building it solo: the data model, the crawl strategy, and the engineering choices that keep the job running.
E-commerce scraping isn’t hard — keeping it running for months is. ProxiesAPI helps reduce random failures when you’re crawling many SKUs across multiple sites on a schedule.
What to scrape (the intel checklist)
At minimum, treat each product page as two layers:
- Identity layer (slow-changing)
product_id(your internal stable key)brand,title,category,breadcrumbscanonical_url(normalized)image_urls
- Market layer (fast-changing)
price,list_price,discount_pctin_stock/stock_level(when available)delivery(ETA, shipping cost, pickup)seller(marketplaces)rating,review_count
If you’re doing competitive intel, add:
- promotions: coupons, bundles, “limited time”
- variants: sizes/colors, each with its own stock/price
- search rank: where the SKU appears in category/search
Don’t start with scraping — start with a schema
Scraping fails when you don’t know what “good data” looks like.
Here’s a clean, practical schema that works for most price monitors:
| Table | Row grain | Key fields |
|---|---|---|
products | 1 row per product identity | product_id, site, url, brand, title, category |
offers | 1 row per product per crawl | product_id, crawled_at, price, currency, in_stock, seller |
pages | 1 row per raw fetch | url, crawled_at, http_status, bytes, sha256, fetch_ms |
alerts | 1 row per triggered rule | product_id, rule, old_value, new_value, created_at |
Why pages matters: it gives you debuggability. When a scraper “suddenly got worse”, you can inspect raw bytes, status codes, and fetch times.
Crawl scheduling (how often is “often enough”?)
Frequency is a business decision.
Use a tiered schedule:
- Tier 1 (hero SKUs): every 1–6 hours
- Tier 2 (core catalog): daily
- Tier 3 (long tail): weekly
Then add event-driven spikes:
- competitor promo days
- your own campaign windows
- seasonal peaks
A simple scheduler heuristic
If you only have time for one rule:
- crawl more often when price volatility is high
Keep a rolling standard deviation of price changes and promote SKUs to higher tiers when volatility crosses a threshold.
Change detection (what counts as “meaningful”?)
Raw diffs are noisy. Real alerts are rare.
Use a layered approach:
- Normalize first
- parse currency symbols into
currency - strip whitespace and HTML entities
- standardize “In stock / Out of stock” into booleans
- parse currency symbols into
- Alert second
price_drop_pct >= 5%in_stockflips from false → truesellerchanges
Avoid alert storms
Add dampening:
- require the change to persist across 2 crawls before alerting
- rate limit alerts per SKU per day
Reliability: retries, blocks, and “small HTML”
Most pipeline failures are not parsing bugs. They’re fetch failures:
- transient 5xx
- timeouts
- blocks/interstitials returning a tiny HTML page
Defensive tactics that work:
- exponential backoff retries (cap at ~20s)
- “small HTML” detection (payload size floor)
- unique request headers (real browser UA, sane Accept-Language)
- jitter between requests
When to add proxies
If you’re crawling:
- a handful of pages, once/day → you may not need proxies
- hundreds/thousands of pages on a schedule → you probably do
ProxiesAPI is a good fit when you want a simple integration point: set proxies=... in your HTTP client and keep the rest of your system the same.
Comparison: scraping approaches for e-commerce teams
| Approach | Best for | Pros | Cons |
|---|---|---|---|
| Manual checks | tiny catalogs | zero engineering | doesn’t scale |
| Vendor tools | fast setup | dashboards + alerts | cost + limited flexibility |
| In-house scraper | competitive intel | custom logic | reliability burden |
| Scraper + ProxiesAPI | scale without proxy plumbing | fewer random failures | still need parsing + QA |
Recommendation (for solo builders): start with an in-house scraper, then add ProxiesAPI once you feel the pain — don’t over-engineer early.
A minimal “price monitor” workflow
If you want a practical MVP:
- store your product URL list (CSV or DB)
- crawl daily (tiered schedule later)
- parse price + stock + seller
- write
offersrows - compute diffs and raise alerts
Once the loop runs for 2–4 weeks without babysitting, add:
- backfills and re-crawls
- better normalization and entity resolution (same product across sites)
- screenshot capture for audit trails
E-commerce scraping isn’t hard — keeping it running for months is. ProxiesAPI helps reduce random failures when you’re crawling many SKUs across multiple sites on a schedule.