How to Scrape Shopify Stores: Products, Prices, and Inventory (2026)
If you’re trying to scrape an e-commerce site in 2026, there’s a good chance it’s running on Shopify.
The good news: Shopify exposes a bunch of structured endpoints that are far easier to work with than scraping messy HTML.
The tricky part: every store has different themes/apps, and scraping at scale can trigger blocks.
This guide walks you through practical, repeatable ways to extract:
- product names and URLs
- prices (including variant prices)
- inventory/availability signals
And we’ll do it in a way that’s:
- robust to theme differences
- respectful (rate limits, ethical usage)
- easy to adapt to many stores
Target keyword: how to scrape shopify stores.
Shopify stores vary wildly, and rate limits/blocks show up fast when you monitor many stores. A proxy layer (like ProxiesAPI) can help keep your data collection consistent.
First: know what you can and can’t reliably get
Shopify stores can expose different “levels” of product data:
| Data | Often available? | Best source |
|---|---|---|
| Product title, handle, URL | Yes | /products.json or HTML → JSON endpoints |
| Variant prices | Yes | product JSON (variants[]) |
| Availability (in stock) | Sometimes | variant available field, or inventory_quantity when exposed |
| Exact inventory counts | Rare (public) | usually not available without authenticated APIs |
If you need exact inventory quantities, you often can’t get it ethically/legally from public endpoints.
But for many monitoring use cases (price tracking, assortment tracking), availability + price is enough.
The easiest win: /products.json
Many Shopify stores expose:
https://STORE_DOMAIN/products.json?limit=250&page=1
This returns a JSON payload with a products array.
Python example: fetch + parse
import requests
from urllib.parse import urljoin
TIMEOUT = (10, 30)
def fetch_products_json(store_base: str, page: int = 1, limit: int = 250) -> dict:
url = urljoin(store_base, f"/products.json?limit={limit}&page={page}")
r = requests.get(url, timeout=TIMEOUT, headers={
"User-Agent": "Mozilla/5.0",
"Accept": "application/json",
})
r.raise_for_status()
return r.json()
def extract_products(payload: dict) -> list[dict]:
out = []
for p in payload.get("products", []):
handle = p.get("handle")
product_url = f"/products/{handle}" if handle else None
variants = p.get("variants", []) or []
# choose a representative price
prices = []
for v in variants:
if v.get("price") is not None:
try:
prices.append(float(v["price"]))
except Exception:
pass
out.append({
"id": p.get("id"),
"title": p.get("title"),
"handle": handle,
"url": product_url,
"vendor": p.get("vendor"),
"product_type": p.get("product_type"),
"price_min": min(prices) if prices else None,
"price_max": max(prices) if prices else None,
"variant_count": len(variants),
})
return out
Pagination pattern
Keep requesting pages until you get fewer than limit products.
def crawl_store_products(store_base: str, limit: int = 250, max_pages: int = 20) -> list[dict]:
all_products = []
for page in range(1, max_pages + 1):
payload = fetch_products_json(store_base, page=page, limit=limit)
batch = extract_products(payload)
print("page", page, "products", len(batch))
all_products.extend(batch)
if len(payload.get("products", [])) < limit:
break
return all_products
More targeted: collections → products (best for large catalogs)
On bigger stores, /products.json may be disabled or throttled.
A more targeted route is:
- discover collection handles (from HTML nav, sitemap, or known URLs)
- fetch:
https://STORE_DOMAIN/collections/{handle}/products.json?limit=250&page=1
This gives you just products in that collection.
Parse variant availability (inventory “signal”)
Within a product JSON object, variants often include:
available(boolean)inventory_management(string or null)- sometimes
inventory_quantity(not always present)
Example extraction:
def extract_variants(product: dict) -> list[dict]:
out = []
for v in product.get("variants", []) or []:
out.append({
"variant_id": v.get("id"),
"title": v.get("title"),
"price": v.get("price"),
"available": v.get("available"),
"sku": v.get("sku"),
})
return out
If available is missing, you can sometimes infer availability by:
- checking if an “Add to cart” form exists in HTML
- looking for
"available":truein embedded JSON (window.__st) depending on theme
But prefer JSON endpoints first.
When JSON endpoints are blocked: HTML → embedded JSON
Some stores restrict /products.json.
In that case, fetch the product page HTML and look for embedded JSON.
Common patterns:
<script type="application/ld+json">(structured product data; often has price)application/jsonscript tags used by themes
A minimal parser using BeautifulSoup:
import json
from bs4 import BeautifulSoup
def extract_ld_json(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
for s in soup.select('script[type="application/ld+json"]'):
try:
out.append(json.loads(s.get_text(strip=True)))
except Exception:
continue
return out
LD+JSON is not perfect, but it’s a stable fallback for:
- product name
- canonical URL
- offer price
Avoid blocks: practical advice
Shopify stores have WAF/CDN layers and app stacks. Blocks show up as:
- 403 responses
- HTML “challenge” pages
- sudden empty JSON
What helps:
-
Slow down
- start at 0.5–1.5 requests/second
-
Retry carefully
- exponential backoff
- stop after N failures
-
Cache
- don’t refetch the full catalog every minute
-
Spread traffic
- if you monitor many stores, distribute requests across time
-
Use a stable proxy layer when needed
- especially if your IP gets flagged during high-volume monitoring
Comparison: JSON endpoints vs HTML scraping
| Approach | Pros | Cons | Best for |
|---|---|---|---|
/products.json | structured, fast, easy parsing | sometimes disabled/throttled | catalogs, price tracking |
collections/.../products.json | targeted, scalable | needs collection discovery | large stores |
| HTML + embedded JSON | works when JSON endpoints blocked | more brittle, heavier pages | fallback / enrichment |
In practice, build your scraper with a tiered strategy:
- try JSON endpoint
- fall back to collections JSON
- fall back to HTML → embedded JSON
A simple “store monitor” shape
If you’re monitoring prices/availability, you want incremental runs:
- store last-seen price per variant
- alert when price changes or availability flips
Even a SQLite database works great for this.
Ethics and Terms
E-commerce scraping has real risks. Do the boring-but-important checks:
- read the store’s Terms of Service
- don’t scrape personal data
- respect rate limits
- be transparent if used commercially
Summary
To scrape Shopify stores reliably in 2026:
- start with
/products.json(limit 250 + page) - use collections JSON for big catalogs
- treat inventory counts as usually private; use availability signals
- add retries/backoff and cache aggressively
- use a proxy layer only when scale requires it
Shopify stores vary wildly, and rate limits/blocks show up fast when you monitor many stores. A proxy layer (like ProxiesAPI) can help keep your data collection consistent.