How to Scrape Shopify Stores: Product, Price, Inventory
Shopify product scraping is attractive because many stores expose structured product data without forcing you to reverse-engineer every pixel on the page.
That does not mean every Shopify store is identical. Themes differ, some stores lock down public endpoints, and public inventory fields vary wildly.
The good news is that there is still a practical playbook that works across a large share of stores:
- try JSON endpoints first
- use collection pagination when you need breadth
- parse variant arrays instead of scraping price text from HTML
- treat exact inventory counts as optional, not guaranteed
This guide focuses on the keyword shopify product scraping and shows the most dependable way to collect:
- product titles
- handles and URLs
- prices
- compare-at prices
- variant SKUs
- availability signals
- basic collection pagination
One Shopify store is easy. Monitoring hundreds of stores, product pages, and collections is where a stable proxy layer starts paying for itself.
The three Shopify data sources that matter
When people talk about Shopify scraping, they often mix together three different things.
| Source | What it gives you | Reliability |
|---|---|---|
/products.json | full product objects and variants | best first option when public |
| collection product JSON | store- or collection-scoped inventory of products | great for catalog crawling |
| HTML product page | fallback when JSON is blocked or incomplete | most variable |
The winning strategy is simple: start at the JSON layer, then fall back to HTML only when necessary.
What you can usually extract
For public Shopify stores, you can often get:
| Field | Usually available? | Where to look |
|---|---|---|
| title | yes | products[].title |
| handle | yes | products[].handle |
| product URL | yes | build from handle |
| variant title | yes | variants[].title |
| price | yes | variants[].price |
| compare-at price | often | variants[].compare_at_price |
| SKU | often | variants[].sku |
| availability | often | variants[].available |
| exact inventory quantity | inconsistent | sometimes absent from public responses |
That last row matters. Many beginners promise "inventory scraping" when they really mean availability scraping. Those are not the same thing.
Start with /products.json
Many Shopify stores still expose:
https://STORE_DOMAIN/products.json?limit=250&page=1
For example, at the time of writing, public stores such as:
kyliecosmetics.comgymshark.com
return structured product JSON from that endpoint.
If this works on your target store, use it before anything else.
A practical Python fetcher
from __future__ import annotations
import os
from urllib.parse import quote, urljoin
import requests
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential_jitter
TIMEOUT = (10, 30)
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/136.0.0.0 Safari/537.36"
),
"Accept": "application/json,text/html;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
session = requests.Session()
session.headers.update(HEADERS)
def build_fetch_url(url: str) -> str:
api_key = os.getenv("PROXIESAPI_KEY", "").strip()
if not api_key:
return url
return (
"https://api.proxiesapi.com/?auth_key="
+ quote(api_key, safe="")
+ "&url="
+ quote(url, safe="")
)
@retry(
reraise=True,
stop=stop_after_attempt(4),
wait=wait_exponential_jitter(initial=1, max=15),
retry=retry_if_exception_type(requests.RequestException),
)
def fetch_json(url: str) -> dict:
response = session.get(build_fetch_url(url), timeout=TIMEOUT)
response.raise_for_status()
return response.json()
Parse product and variant data the right way
The biggest mistake in Shopify product scraping is flattening everything at the product level and losing the variants. Prices, availability, and SKUs usually belong to variants.
def extract_rows(store_base: str, payload: dict) -> list[dict]:
rows = []
for product in payload.get("products", []):
handle = product.get("handle")
product_url = urljoin(store_base, f"/products/{handle}") if handle else None
for variant in product.get("variants", []):
rows.append({
"product_id": product.get("id"),
"product_title": product.get("title"),
"vendor": product.get("vendor"),
"handle": handle,
"product_url": product_url,
"variant_id": variant.get("id"),
"variant_title": variant.get("title"),
"sku": variant.get("sku"),
"price": variant.get("price"),
"compare_at_price": variant.get("compare_at_price"),
"available": variant.get("available"),
"inventory_quantity": variant.get("inventory_quantity"),
"created_at": product.get("created_at"),
"updated_at": product.get("updated_at"),
})
return rows
This gives you one row per variant, which is usually what you want for real monitoring.
Crawl multiple pages
def crawl_products_json(store_base: str, max_pages: int = 5, limit: int = 250) -> list[dict]:
all_rows = []
for page in range(1, max_pages + 1):
url = urljoin(store_base, f"/products.json?limit={limit}&page={page}")
payload = fetch_json(url)
products = payload.get("products", [])
if not products:
break
batch = extract_rows(store_base, payload)
all_rows.extend(batch)
print(f"page={page} products={len(products)} rows={len(batch)} total_rows={len(all_rows)}")
if len(products) < limit:
break
return all_rows
That is often enough for a whole-store catalog pull.
Collection-level crawling is often cleaner
For big stores, you may not want the whole catalog every time. Collection-level crawling keeps jobs more focused.
Typical pattern:
https://STORE_DOMAIN/collections/running/products.json?limit=250&page=1
This is useful for:
- category-specific monitoring
- smaller incremental jobs
- lower per-run bandwidth
- easier QA when a merchant has thousands of products
What about inventory?
This is where honest guidance matters.
You can usually get:
available: true/false- variant presence/absence
- sold-out states on product pages
You sometimes get:
inventory_quantity
You should not assume:
- that every public Shopify store exposes exact inventory counts
Some stores expose inventory_quantity; some do not; some return values that are not operationally useful.
So if your business question is:
- "Is it in stock?" then public Shopify scraping is often enough.
- "Exactly how many units remain?" then public storefront data is much less reliable.
Fallback: scrape the HTML product page
If /products.json is blocked, rate-limited, or incomplete, fall back to the product page and look for embedded structured data.
Common places to inspect:
| Fallback source | Why it helps |
|---|---|
script[type="application/ld+json"] | often contains product metadata |
| inline JS objects | some themes serialize product data directly into the page |
| variant picker markup | can expose availability and variant IDs |
This is more brittle than JSON endpoints, which is why it should be your fallback, not your default.
End-to-end example
import csv
def save_csv(rows: list[dict], path: str) -> None:
if not rows:
return
with open(path, "w", newline="", encoding="utf-8") as fh:
writer = csv.DictWriter(fh, fieldnames=list(rows[0].keys()))
writer.writeheader()
writer.writerows(rows)
if __name__ == "__main__":
store = "https://www.kyliecosmetics.com"
rows = crawl_products_json(store, max_pages=2)
save_csv(rows, "shopify_products.csv")
print(f"saved {len(rows)} variant rows")
Typical output:
page=1 products=250 rows=1380 total_rows=1380
page=2 products=87 rows=412 total_rows=1792
saved 1792 variant rows
JSON endpoints vs HTML scraping
| Method | Pros | Cons |
|---|---|---|
/products.json | structured, fast, variant-rich | not guaranteed on every store |
| collection JSON | cleaner targeting, smaller jobs | depends on store exposing collection product JSON |
| HTML parsing | works when JSON is blocked | theme-dependent and more brittle |
For most teams, the right stack is:
- try
/products.json - fall back to collection JSON
- use HTML only for specific gaps
When ProxiesAPI helps with Shopify product scraping
Shopify itself is not the whole problem. The problem appears when you scale to:
- hundreds of stores
- frequent refreshes
- many collection pages
- product-page fallbacks after JSON failures
That is when a stable proxy layer starts helping with:
- IP reputation
- regional consistency
- fewer noisy bans from one hot IP
It does not magically unlock hidden inventory fields, but it does make a broad crawl more reliable.
Practical advice
1. Model data at the variant level
If you collapse everything to one row per product, you lose the exact price and availability details that matter most.
2. Separate "availability" from "inventory quantity"
This sounds small, but it saves a lot of confusion in downstream analytics.
3. Keep the JSON path first
Do not pay the HTML-scraping tax unless you have to.
4. Add store-specific exceptions later
A good general scraper beats a giant pile of one-off rules on day one.
Bottom line
Shopify product scraping is one of the friendlier e-commerce scraping jobs because many stores expose structured product JSON. The winning move is to build around public JSON endpoints, keep variants as first-class rows, and treat exact inventory counts as optional rather than promised.
That approach gets you clean product, price, and availability data quickly. When you scale the crawl across many stores, ProxiesAPI helps keep the network side predictable without changing the core parser.
One Shopify store is easy. Monitoring hundreds of stores, product pages, and collections is where a stable proxy layer starts paying for itself.