Shopify Product Scraping (2026): Prices, Variants, Inventory—Without Breaking When Themes Change
Shopify is everywhere. If you’re building a price tracker, a product research tool, or a competitive monitoring system, you’ll likely need to scrape Shopify storefronts.
The problem: Shopify themes change.
If you scrape HTML with brittle selectors like .product__title you’ll spend your life fixing parsers.
This guide shows a Shopify-first strategy that stays stable in 2026:
- Prefer platform JSON endpoints (
.json,/products/<handle>.json) - Use structured data (JSON-LD) as a secondary source
- Use HTML as a last resort with minimal assumptions
- Handle variants + availability signals in a way that doesn’t depend on theme markup
Shopify storefronts are consistent at the platform layer (JSON endpoints), but rate limits and blocks still happen at scale. ProxiesAPI helps you keep your crawler stable with IP rotation + consistent routing across runs.
What Shopify gives you for free (stable endpoints)
Most Shopify stores expose useful endpoints that do not depend on the theme:
1) Product JSON
https://STORE_DOMAIN/products/HANDLE.json
This returns:
- title
- vendor
- product type
- images
- variants (id, title, price, available, sku, etc.)
2) Collection products JSON
https://STORE_DOMAIN/collections/COLLECTION_HANDLE/products.json?limit=250&page=1
Great for crawling category pages in bulk.
3) Search suggestions / predictive search
Many stores enable predictive search endpoints. Not universal, but helpful.
Because these are platform endpoints, they’re much less likely to break when the store redesigns.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests tenacity
Step 1: Build a resilient fetch layer (with ProxiesAPI)
import os
import random
import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
TIMEOUT = (10, 40)
UA_POOL = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
session = requests.Session()
def build_proxies():
p = os.getenv("PROXIESAPI_PROXY_URL")
return {"http": p, "https": p} if p else None
@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=1, max=20))
def get_json(url: str) -> dict:
headers = {
"User-Agent": random.choice(UA_POOL),
"Accept": "application/json,text/plain,*/*",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
}
r = session.get(url, headers=headers, proxies=build_proxies(), timeout=TIMEOUT)
if r.status_code in (403, 429, 503):
raise RuntimeError(f"blocked/rate limited: {r.status_code}")
r.raise_for_status()
return r.json()
Step 2: Scrape a product via /products/<handle>.json
This is the most reliable starting point.
from urllib.parse import urlparse
def normalize_store(url: str) -> str:
# Accept https://store.com or store.com
if not url.startswith("http"):
url = "https://" + url
u = urlparse(url)
return f"{u.scheme}://{u.netloc}"
def product_json_url(store: str, handle: str) -> str:
store = normalize_store(store)
return f"{store}/products/{handle}.json"
def parse_product(product: dict) -> dict:
p = product.get("product", product)
variants = []
for v in p.get("variants", []) or []:
variants.append({
"id": v.get("id"),
"title": v.get("title"),
"sku": v.get("sku"),
"price": v.get("price"),
"compare_at_price": v.get("compare_at_price"),
"available": v.get("available"),
"inventory_quantity": v.get("inventory_quantity"),
})
return {
"id": p.get("id"),
"handle": p.get("handle"),
"title": p.get("title"),
"vendor": p.get("vendor"),
"product_type": p.get("product_type"),
"tags": p.get("tags"),
"created_at": p.get("created_at"),
"updated_at": p.get("updated_at"),
"variants": variants,
}
if __name__ == "__main__":
store = "https://example-store.com"
handle = "my-product"
data = get_json(product_json_url(store, handle))
print(parse_product(data))
Notes on inventory
inventory_quantity is not always present (many storefronts hide it). But you can still infer availability from:
availableboolean- whether adding to cart is enabled
If you truly need inventory counts, that’s usually not available without authenticated flows (and you probably shouldn’t scrape it).
Step 3: Crawl a collection in bulk
Collections are a great way to build a SKU list without scraping theme HTML.
def collection_products_url(store: str, collection_handle: str, page: int = 1, limit: int = 250) -> str:
store = normalize_store(store)
return f"{store}/collections/{collection_handle}/products.json?limit={limit}&page={page}"
def crawl_collection(store: str, collection_handle: str, max_pages: int = 5) -> list[dict]:
out = []
for page in range(1, max_pages + 1):
data = get_json(collection_products_url(store, collection_handle, page=page))
products = data.get("products", [])
if not products:
break
out.extend(products)
print("page", page, "products", len(products), "total", len(out))
# If we got fewer than limit, likely last page
if len(products) < 250:
break
return out
Step 4: When .json endpoints are blocked (fallback strategy)
Some stores disable or rate limit JSON endpoints.
Fallback ladder:
- JSON-LD on the product page (
application/ld+json) - Shopify’s embedded state objects (varies)
- HTML selectors as a last resort
The goal is: avoid theme-specific selectors.
JSON-LD fallback
import json
import re
from bs4 import BeautifulSoup
def get_html(url: str) -> str:
# reuse your requests Session; keep headers browser-like
headers = {"User-Agent": random.choice(UA_POOL), "Accept": "text/html,*/*"}
r = session.get(url, headers=headers, proxies=build_proxies(), timeout=TIMEOUT)
if r.status_code in (403, 429, 503):
raise RuntimeError(f"blocked: {r.status_code}")
r.raise_for_status()
return r.text
def parse_product_jsonld(html: str) -> dict | None:
soup = BeautifulSoup(html, "lxml")
for s in soup.select('script[type="application/ld+json"]'):
raw = s.string
if not raw:
continue
try:
data = json.loads(raw)
except Exception:
continue
objs = data if isinstance(data, list) else [data]
for obj in objs:
if isinstance(obj, dict) and obj.get("@type") == "Product":
return obj
return None
JSON-LD often includes:
- name
- description
- offers.price
- availability
Not always variants, but enough for a basic price tracker.
Comparison table: extraction methods
| Method | Best for | Stability | Data richness |
|---|---|---|---|
/products/<handle>.json | variants + prices | High | High |
/collections/<handle>/products.json | crawling SKU list | High | Medium |
| JSON-LD | basic product + price | Medium | Medium |
| HTML selectors | last resort | Low | Varies |
Practical advice to avoid breakage
- Treat Shopify as an API-first target: use JSON endpoints.
- Add caching so you don’t refetch unchanged products constantly.
- Keep concurrency low per domain (Shopify stores can be sensitive).
- Rotate IPs when crawling many stores (ProxiesAPI).
- Store raw responses for a small sample so you can debug schema shifts.
Where ProxiesAPI fits (honestly)
ProxiesAPI helps most when you:
- crawl many stores (each with different rate limits)
- refresh prices daily across thousands of URLs
- run on a single server IP that gets flagged
It won’t stop a store from blocking .json endpoints entirely. But combined with:
- sane per-domain rate limits
- retries with jitter
- caching
…it improves success rate and makes failures replayable.
QA checklist
-
/products/<handle>.jsonreturns a product object - variant list includes price + availability
- collection crawl paginates without duplicates
- JSON-LD fallback extracts at least name + price
- failures are logged with status + URL for replay
Shopify storefronts are consistent at the platform layer (JSON endpoints), but rate limits and blocks still happen at scale. ProxiesAPI helps you keep your crawler stable with IP rotation + consistent routing across runs.