Shopify Product Scraping (2026): Prices, Variants, Inventory—Without Breaking When Themes Change

Apr 12, 2026 · guide · #shopify, #ecommerce, #product-scraping, #python, #price-scraping, #variants, #proxies

Shopify is everywhere. If you’re building a price tracker, a product research tool, or a competitive monitoring system, you’ll likely need to scrape Shopify storefronts.

The problem: Shopify themes change.

If you scrape HTML with brittle selectors like .product__title you’ll spend your life fixing parsers.

This guide shows a Shopify-first strategy that stays stable in 2026:

Prefer platform JSON endpoints (.json, /products/<handle>.json)
Use structured data (JSON-LD) as a secondary source
Use HTML as a last resort with minimal assumptions
Handle variants + availability signals in a way that doesn’t depend on theme markup

Scale Shopify crawling safely with ProxiesAPI

Shopify storefronts are consistent at the platform layer (JSON endpoints), but rate limits and blocks still happen at scale. ProxiesAPI helps you keep your crawler stable with IP rotation + consistent routing across runs.

Get 1,000 free API calls View pricing

What Shopify gives you for free (stable endpoints)

Most Shopify stores expose useful endpoints that do not depend on the theme:

1) Product JSON

https://STORE_DOMAIN/products/HANDLE.json

This returns:

title
vendor
product type
images
variants (id, title, price, available, sku, etc.)

2) Collection products JSON

https://STORE_DOMAIN/collections/COLLECTION_HANDLE/products.json?limit=250&page=1

Great for crawling category pages in bulk.

3) Search suggestions / predictive search

Many stores enable predictive search endpoints. Not universal, but helpful.

Because these are platform endpoints, they’re much less likely to break when the store redesigns.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests tenacity

Step 1: Build a resilient fetch layer (with ProxiesAPI)

import os
import random

import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

TIMEOUT = (10, 40)

UA_POOL = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

session = requests.Session()


def build_proxies():
    p = os.getenv("PROXIESAPI_PROXY_URL")
    return {"http": p, "https": p} if p else None


@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=1, max=20))
def get_json(url: str) -> dict:
    headers = {
        "User-Agent": random.choice(UA_POOL),
        "Accept": "application/json,text/plain,*/*",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
    }

    r = session.get(url, headers=headers, proxies=build_proxies(), timeout=TIMEOUT)
    if r.status_code in (403, 429, 503):
        raise RuntimeError(f"blocked/rate limited: {r.status_code}")
    r.raise_for_status()
    return r.json()

Step 2: Scrape a product via `/products/<handle>.json`

This is the most reliable starting point.

from urllib.parse import urlparse


def normalize_store(url: str) -> str:
    # Accept https://store.com or store.com
    if not url.startswith("http"):
        url = "https://" + url
    u = urlparse(url)
    return f"{u.scheme}://{u.netloc}"


def product_json_url(store: str, handle: str) -> str:
    store = normalize_store(store)
    return f"{store}/products/{handle}.json"


def parse_product(product: dict) -> dict:
    p = product.get("product", product)

    variants = []
    for v in p.get("variants", []) or []:
        variants.append({
            "id": v.get("id"),
            "title": v.get("title"),
            "sku": v.get("sku"),
            "price": v.get("price"),
            "compare_at_price": v.get("compare_at_price"),
            "available": v.get("available"),
            "inventory_quantity": v.get("inventory_quantity"),
        })

    return {
        "id": p.get("id"),
        "handle": p.get("handle"),
        "title": p.get("title"),
        "vendor": p.get("vendor"),
        "product_type": p.get("product_type"),
        "tags": p.get("tags"),
        "created_at": p.get("created_at"),
        "updated_at": p.get("updated_at"),
        "variants": variants,
    }


if __name__ == "__main__":
    store = "https://example-store.com"
    handle = "my-product"
    data = get_json(product_json_url(store, handle))
    print(parse_product(data))

Notes on inventory

inventory_quantity is not always present (many storefronts hide it). But you can still infer availability from:

available boolean
whether adding to cart is enabled

If you truly need inventory counts, that’s usually not available without authenticated flows (and you probably shouldn’t scrape it).

Step 3: Crawl a collection in bulk

Collections are a great way to build a SKU list without scraping theme HTML.


def collection_products_url(store: str, collection_handle: str, page: int = 1, limit: int = 250) -> str:
    store = normalize_store(store)
    return f"{store}/collections/{collection_handle}/products.json?limit={limit}&page={page}"


def crawl_collection(store: str, collection_handle: str, max_pages: int = 5) -> list[dict]:
    out = []
    for page in range(1, max_pages + 1):
        data = get_json(collection_products_url(store, collection_handle, page=page))
        products = data.get("products", [])
        if not products:
            break

        out.extend(products)
        print("page", page, "products", len(products), "total", len(out))

        # If we got fewer than limit, likely last page
        if len(products) < 250:
            break

    return out

Step 4: When `.json` endpoints are blocked (fallback strategy)

Some stores disable or rate limit JSON endpoints.

Fallback ladder:

JSON-LD on the product page (application/ld+json)
Shopify’s embedded state objects (varies)
HTML selectors as a last resort

The goal is: avoid theme-specific selectors.

JSON-LD fallback

import json
import re
from bs4 import BeautifulSoup


def get_html(url: str) -> str:
    # reuse your requests Session; keep headers browser-like
    headers = {"User-Agent": random.choice(UA_POOL), "Accept": "text/html,*/*"}
    r = session.get(url, headers=headers, proxies=build_proxies(), timeout=TIMEOUT)
    if r.status_code in (403, 429, 503):
        raise RuntimeError(f"blocked: {r.status_code}")
    r.raise_for_status()
    return r.text


def parse_product_jsonld(html: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")
    for s in soup.select('script[type="application/ld+json"]'):
        raw = s.string
        if not raw:
            continue
        try:
            data = json.loads(raw)
        except Exception:
            continue

        objs = data if isinstance(data, list) else [data]
        for obj in objs:
            if isinstance(obj, dict) and obj.get("@type") == "Product":
                return obj

    return None

JSON-LD often includes:

name
description
offers.price
availability

Not always variants, but enough for a basic price tracker.

Comparison table: extraction methods

Method	Best for	Stability	Data richness
`/products/<handle>.json`	variants + prices	High	High
`/collections/<handle>/products.json`	crawling SKU list	High	Medium
JSON-LD	basic product + price	Medium	Medium
HTML selectors	last resort	Low	Varies