How to Scrape Shopify Stores: Product, Price, Inventory

Jun 23, 2026 · guide · #shopify product scraping, #shopify, #ecommerce, #web-scraping, #python, #json, #proxiesapi

Shopify product scraping is attractive because many stores expose structured product data without forcing you to reverse-engineer every pixel on the page.

That does not mean every Shopify store is identical. Themes differ, some stores lock down public endpoints, and public inventory fields vary wildly.

The good news is that there is still a practical playbook that works across a large share of stores:

try JSON endpoints first
use collection pagination when you need breadth
parse variant arrays instead of scraping price text from HTML
treat exact inventory counts as optional, not guaranteed

This guide focuses on the keyword shopify product scraping and shows the most dependable way to collect:

product titles
handles and URLs
prices
compare-at prices
variant SKUs
availability signals
basic collection pagination

Scale Shopify monitoring with ProxiesAPI

One Shopify store is easy. Monitoring hundreds of stores, product pages, and collections is where a stable proxy layer starts paying for itself.

Get 1,000 free API calls View pricing

The three Shopify data sources that matter

When people talk about Shopify scraping, they often mix together three different things.

Source	What it gives you	Reliability
`/products.json`	full product objects and variants	best first option when public
collection product JSON	store- or collection-scoped inventory of products	great for catalog crawling
HTML product page	fallback when JSON is blocked or incomplete	most variable

The winning strategy is simple: start at the JSON layer, then fall back to HTML only when necessary.

What you can usually extract

For public Shopify stores, you can often get:

Field	Usually available?	Where to look
title	yes	`products[].title`
handle	yes	`products[].handle`
product URL	yes	build from handle
variant title	yes	`variants[].title`
price	yes	`variants[].price`
compare-at price	often	`variants[].compare_at_price`
SKU	often	`variants[].sku`
availability	often	`variants[].available`
exact inventory quantity	inconsistent	sometimes absent from public responses

That last row matters. Many beginners promise "inventory scraping" when they really mean availability scraping. Those are not the same thing.

Start with `/products.json`

Many Shopify stores still expose:

https://STORE_DOMAIN/products.json?limit=250&page=1

For example, at the time of writing, public stores such as:

kyliecosmetics.com
gymshark.com

return structured product JSON from that endpoint.

If this works on your target store, use it before anything else.

A practical Python fetcher

from __future__ import annotations

import os
from urllib.parse import quote, urljoin

import requests
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential_jitter

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/136.0.0.0 Safari/537.36"
    ),
    "Accept": "application/json,text/html;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)


def build_fetch_url(url: str) -> str:
    api_key = os.getenv("PROXIESAPI_KEY", "").strip()
    if not api_key:
        return url
    return (
        "https://api.proxiesapi.com/?auth_key="
        + quote(api_key, safe="")
        + "&url="
        + quote(url, safe="")
    )


@retry(
    reraise=True,
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=1, max=15),
    retry=retry_if_exception_type(requests.RequestException),
)
def fetch_json(url: str) -> dict:
    response = session.get(build_fetch_url(url), timeout=TIMEOUT)
    response.raise_for_status()
    return response.json()

Parse product and variant data the right way

The biggest mistake in Shopify product scraping is flattening everything at the product level and losing the variants. Prices, availability, and SKUs usually belong to variants.

def extract_rows(store_base: str, payload: dict) -> list[dict]:
    rows = []

    for product in payload.get("products", []):
        handle = product.get("handle")
        product_url = urljoin(store_base, f"/products/{handle}") if handle else None

        for variant in product.get("variants", []):
            rows.append({
                "product_id": product.get("id"),
                "product_title": product.get("title"),
                "vendor": product.get("vendor"),
                "handle": handle,
                "product_url": product_url,
                "variant_id": variant.get("id"),
                "variant_title": variant.get("title"),
                "sku": variant.get("sku"),
                "price": variant.get("price"),
                "compare_at_price": variant.get("compare_at_price"),
                "available": variant.get("available"),
                "inventory_quantity": variant.get("inventory_quantity"),
                "created_at": product.get("created_at"),
                "updated_at": product.get("updated_at"),
            })

    return rows

This gives you one row per variant, which is usually what you want for real monitoring.

Crawl multiple pages

def crawl_products_json(store_base: str, max_pages: int = 5, limit: int = 250) -> list[dict]:
    all_rows = []

    for page in range(1, max_pages + 1):
        url = urljoin(store_base, f"/products.json?limit={limit}&page={page}")
        payload = fetch_json(url)
        products = payload.get("products", [])

        if not products:
            break

        batch = extract_rows(store_base, payload)
        all_rows.extend(batch)
        print(f"page={page} products={len(products)} rows={len(batch)} total_rows={len(all_rows)}")

        if len(products) < limit:
            break

    return all_rows

That is often enough for a whole-store catalog pull.

Collection-level crawling is often cleaner

For big stores, you may not want the whole catalog every time. Collection-level crawling keeps jobs more focused.

Typical pattern:

https://STORE_DOMAIN/collections/running/products.json?limit=250&page=1

This is useful for:

category-specific monitoring
smaller incremental jobs
lower per-run bandwidth
easier QA when a merchant has thousands of products

What about inventory?

This is where honest guidance matters.

You can usually get:

available: true/false
variant presence/absence
sold-out states on product pages

You should not assume:

that every public Shopify store exposes exact inventory counts

Some stores expose inventory_quantity; some do not; some return values that are not operationally useful.

So if your business question is:

"Is it in stock?" then public Shopify scraping is often enough.
"Exactly how many units remain?" then public storefront data is much less reliable.

Fallback: scrape the HTML product page

If /products.json is blocked, rate-limited, or incomplete, fall back to the product page and look for embedded structured data.

Common places to inspect:

Fallback source	Why it helps
`script[type="application/ld+json"]`	often contains product metadata
inline JS objects	some themes serialize product data directly into the page
variant picker markup	can expose availability and variant IDs

This is more brittle than JSON endpoints, which is why it should be your fallback, not your default.

End-to-end example

import csv


def save_csv(rows: list[dict], path: str) -> None:
    if not rows:
        return
    with open(path, "w", newline="", encoding="utf-8") as fh:
        writer = csv.DictWriter(fh, fieldnames=list(rows[0].keys()))
        writer.writeheader()
        writer.writerows(rows)


if __name__ == "__main__":
    store = "https://www.kyliecosmetics.com"
    rows = crawl_products_json(store, max_pages=2)
    save_csv(rows, "shopify_products.csv")
    print(f"saved {len(rows)} variant rows")

Typical output:

page=1 products=250 rows=1380 total_rows=1380
page=2 products=87 rows=412 total_rows=1792
saved 1792 variant rows

JSON endpoints vs HTML scraping

Method	Pros	Cons
`/products.json`	structured, fast, variant-rich	not guaranteed on every store
collection JSON	cleaner targeting, smaller jobs	depends on store exposing collection product JSON
HTML parsing	works when JSON is blocked	theme-dependent and more brittle

For most teams, the right stack is:

try /products.json
fall back to collection JSON
use HTML only for specific gaps

When ProxiesAPI helps with Shopify product scraping

Shopify itself is not the whole problem. The problem appears when you scale to:

hundreds of stores
frequent refreshes
many collection pages
product-page fallbacks after JSON failures

That is when a stable proxy layer starts helping with:

IP reputation
regional consistency
fewer noisy bans from one hot IP

It does not magically unlock hidden inventory fields, but it does make a broad crawl more reliable.

Practical advice

1. Model data at the variant level

If you collapse everything to one row per product, you lose the exact price and availability details that matter most.

2. Separate "availability" from "inventory quantity"

This sounds small, but it saves a lot of confusion in downstream analytics.

3. Keep the JSON path first

Do not pay the HTML-scraping tax unless you have to.

4. Add store-specific exceptions later

A good general scraper beats a giant pile of one-off rules on day one.

Shopify product scraping is one of the friendlier e-commerce scraping jobs because many stores expose structured product JSON. The winning move is to build around public JSON endpoints, keep variants as first-class rows, and treat exact inventory counts as optional rather than promised.

That approach gets you clean product, price, and availability data quickly. When you scale the crawl across many stores, ProxiesAPI helps keep the network side predictable without changing the core parser.

Scale Shopify monitoring with ProxiesAPI

One Shopify store is easy. Monitoring hundreds of stores, product pages, and collections is where a stable proxy layer starts paying for itself.

Get 1,000 free API calls View pricing

Show how to collect listing titles, brands, prices, images, and pagination data from Vinted search pages with ProxiesAPI.

tutorial#python#vinted#web-scraping

Scrape Shopee Reviews at Scale: Ratings, Review Text, and Product Metadata

Fetch Shopee product metadata + reviews via ProxiesAPI, paginate ratings safely, and export clean JSON/CSV for analysis. Includes robust URL parsing, retry/backoff, and a screenshot of a real product page.

tutorial#python#shopee#reviews

How to Scrape Google Search Results with Python

Walk through extracting titles, URLs, and snippets from Google result pages while handling rate limits and anti-bot friction.

guide#scrape google#python#serp

How to Scrape E-Commerce Websites: A Practical Guide