How to Scrape Shopify Stores: Product, Price, Inventory

Shopify product scraping is attractive because many stores expose structured product data without forcing you to reverse-engineer every pixel on the page.

That does not mean every Shopify store is identical. Themes differ, some stores lock down public endpoints, and public inventory fields vary wildly.

The good news is that there is still a practical playbook that works across a large share of stores:

  1. try JSON endpoints first
  2. use collection pagination when you need breadth
  3. parse variant arrays instead of scraping price text from HTML
  4. treat exact inventory counts as optional, not guaranteed

This guide focuses on the keyword shopify product scraping and shows the most dependable way to collect:

  • product titles
  • handles and URLs
  • prices
  • compare-at prices
  • variant SKUs
  • availability signals
  • basic collection pagination
Scale Shopify monitoring with ProxiesAPI

One Shopify store is easy. Monitoring hundreds of stores, product pages, and collections is where a stable proxy layer starts paying for itself.


The three Shopify data sources that matter

When people talk about Shopify scraping, they often mix together three different things.

SourceWhat it gives youReliability
/products.jsonfull product objects and variantsbest first option when public
collection product JSONstore- or collection-scoped inventory of productsgreat for catalog crawling
HTML product pagefallback when JSON is blocked or incompletemost variable

The winning strategy is simple: start at the JSON layer, then fall back to HTML only when necessary.


What you can usually extract

For public Shopify stores, you can often get:

FieldUsually available?Where to look
titleyesproducts[].title
handleyesproducts[].handle
product URLyesbuild from handle
variant titleyesvariants[].title
priceyesvariants[].price
compare-at priceoftenvariants[].compare_at_price
SKUoftenvariants[].sku
availabilityoftenvariants[].available
exact inventory quantityinconsistentsometimes absent from public responses

That last row matters. Many beginners promise "inventory scraping" when they really mean availability scraping. Those are not the same thing.


Start with /products.json

Many Shopify stores still expose:

https://STORE_DOMAIN/products.json?limit=250&page=1

For example, at the time of writing, public stores such as:

  • kyliecosmetics.com
  • gymshark.com

return structured product JSON from that endpoint.

If this works on your target store, use it before anything else.


A practical Python fetcher

from __future__ import annotations

import os
from urllib.parse import quote, urljoin

import requests
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential_jitter

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/136.0.0.0 Safari/537.36"
    ),
    "Accept": "application/json,text/html;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)


def build_fetch_url(url: str) -> str:
    api_key = os.getenv("PROXIESAPI_KEY", "").strip()
    if not api_key:
        return url
    return (
        "https://api.proxiesapi.com/?auth_key="
        + quote(api_key, safe="")
        + "&url="
        + quote(url, safe="")
    )


@retry(
    reraise=True,
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=1, max=15),
    retry=retry_if_exception_type(requests.RequestException),
)
def fetch_json(url: str) -> dict:
    response = session.get(build_fetch_url(url), timeout=TIMEOUT)
    response.raise_for_status()
    return response.json()

Parse product and variant data the right way

The biggest mistake in Shopify product scraping is flattening everything at the product level and losing the variants. Prices, availability, and SKUs usually belong to variants.

def extract_rows(store_base: str, payload: dict) -> list[dict]:
    rows = []

    for product in payload.get("products", []):
        handle = product.get("handle")
        product_url = urljoin(store_base, f"/products/{handle}") if handle else None

        for variant in product.get("variants", []):
            rows.append({
                "product_id": product.get("id"),
                "product_title": product.get("title"),
                "vendor": product.get("vendor"),
                "handle": handle,
                "product_url": product_url,
                "variant_id": variant.get("id"),
                "variant_title": variant.get("title"),
                "sku": variant.get("sku"),
                "price": variant.get("price"),
                "compare_at_price": variant.get("compare_at_price"),
                "available": variant.get("available"),
                "inventory_quantity": variant.get("inventory_quantity"),
                "created_at": product.get("created_at"),
                "updated_at": product.get("updated_at"),
            })

    return rows

This gives you one row per variant, which is usually what you want for real monitoring.


Crawl multiple pages

def crawl_products_json(store_base: str, max_pages: int = 5, limit: int = 250) -> list[dict]:
    all_rows = []

    for page in range(1, max_pages + 1):
        url = urljoin(store_base, f"/products.json?limit={limit}&page={page}")
        payload = fetch_json(url)
        products = payload.get("products", [])

        if not products:
            break

        batch = extract_rows(store_base, payload)
        all_rows.extend(batch)
        print(f"page={page} products={len(products)} rows={len(batch)} total_rows={len(all_rows)}")

        if len(products) < limit:
            break

    return all_rows

That is often enough for a whole-store catalog pull.


Collection-level crawling is often cleaner

For big stores, you may not want the whole catalog every time. Collection-level crawling keeps jobs more focused.

Typical pattern:

https://STORE_DOMAIN/collections/running/products.json?limit=250&page=1

This is useful for:

  • category-specific monitoring
  • smaller incremental jobs
  • lower per-run bandwidth
  • easier QA when a merchant has thousands of products

What about inventory?

This is where honest guidance matters.

You can usually get:

  • available: true/false
  • variant presence/absence
  • sold-out states on product pages

You sometimes get:

  • inventory_quantity

You should not assume:

  • that every public Shopify store exposes exact inventory counts

Some stores expose inventory_quantity; some do not; some return values that are not operationally useful.

So if your business question is:

  • "Is it in stock?" then public Shopify scraping is often enough.
  • "Exactly how many units remain?" then public storefront data is much less reliable.

Fallback: scrape the HTML product page

If /products.json is blocked, rate-limited, or incomplete, fall back to the product page and look for embedded structured data.

Common places to inspect:

Fallback sourceWhy it helps
script[type="application/ld+json"]often contains product metadata
inline JS objectssome themes serialize product data directly into the page
variant picker markupcan expose availability and variant IDs

This is more brittle than JSON endpoints, which is why it should be your fallback, not your default.


End-to-end example

import csv


def save_csv(rows: list[dict], path: str) -> None:
    if not rows:
        return
    with open(path, "w", newline="", encoding="utf-8") as fh:
        writer = csv.DictWriter(fh, fieldnames=list(rows[0].keys()))
        writer.writeheader()
        writer.writerows(rows)


if __name__ == "__main__":
    store = "https://www.kyliecosmetics.com"
    rows = crawl_products_json(store, max_pages=2)
    save_csv(rows, "shopify_products.csv")
    print(f"saved {len(rows)} variant rows")

Typical output:

page=1 products=250 rows=1380 total_rows=1380
page=2 products=87 rows=412 total_rows=1792
saved 1792 variant rows

JSON endpoints vs HTML scraping

MethodProsCons
/products.jsonstructured, fast, variant-richnot guaranteed on every store
collection JSONcleaner targeting, smaller jobsdepends on store exposing collection product JSON
HTML parsingworks when JSON is blockedtheme-dependent and more brittle

For most teams, the right stack is:

  1. try /products.json
  2. fall back to collection JSON
  3. use HTML only for specific gaps

When ProxiesAPI helps with Shopify product scraping

Shopify itself is not the whole problem. The problem appears when you scale to:

  • hundreds of stores
  • frequent refreshes
  • many collection pages
  • product-page fallbacks after JSON failures

That is when a stable proxy layer starts helping with:

  • IP reputation
  • regional consistency
  • fewer noisy bans from one hot IP

It does not magically unlock hidden inventory fields, but it does make a broad crawl more reliable.


Practical advice

1. Model data at the variant level

If you collapse everything to one row per product, you lose the exact price and availability details that matter most.

2. Separate "availability" from "inventory quantity"

This sounds small, but it saves a lot of confusion in downstream analytics.

3. Keep the JSON path first

Do not pay the HTML-scraping tax unless you have to.

4. Add store-specific exceptions later

A good general scraper beats a giant pile of one-off rules on day one.


Bottom line

Shopify product scraping is one of the friendlier e-commerce scraping jobs because many stores expose structured product JSON. The winning move is to build around public JSON endpoints, keep variants as first-class rows, and treat exact inventory counts as optional rather than promised.

That approach gets you clean product, price, and availability data quickly. When you scale the crawl across many stores, ProxiesAPI helps keep the network side predictable without changing the core parser.

Scale Shopify monitoring with ProxiesAPI

One Shopify store is easy. Monitoring hundreds of stores, product pages, and collections is where a stable proxy layer starts paying for itself.

Related guides

Scrape Secondhand Fashion Listings from Vinted
Show how to collect listing titles, brands, prices, images, and pagination data from Vinted search pages with ProxiesAPI.
tutorial#python#vinted#web-scraping
Scrape Shopee Reviews at Scale: Ratings, Review Text, and Product Metadata
Fetch Shopee product metadata + reviews via ProxiesAPI, paginate ratings safely, and export clean JSON/CSV for analysis. Includes robust URL parsing, retry/backoff, and a screenshot of a real product page.
tutorial#python#shopee#reviews
How to Scrape Google Search Results with Python
Walk through extracting titles, URLs, and snippets from Google result pages while handling rate limits and anti-bot friction.
guide#scrape google#python#serp
How to Scrape E-Commerce Websites: A Practical Guide
A practical playbook for ecommerce scraping: category discovery, pagination patterns, product detail extraction, variants, rate limits, retries, and proxy-backed fetching with ProxiesAPI.
guide#ecommerce scraping#ecommerce#web-scraping