Shopify Product Scraping (2026): Prices, Variants, Inventory—Without Breaking When Themes Change

Shopify is everywhere. If you’re building a price tracker, a product research tool, or a competitive monitoring system, you’ll likely need to scrape Shopify storefronts.

The problem: Shopify themes change.

If you scrape HTML with brittle selectors like .product__title you’ll spend your life fixing parsers.

This guide shows a Shopify-first strategy that stays stable in 2026:

  1. Prefer platform JSON endpoints (.json, /products/<handle>.json)
  2. Use structured data (JSON-LD) as a secondary source
  3. Use HTML as a last resort with minimal assumptions
  4. Handle variants + availability signals in a way that doesn’t depend on theme markup
Scale Shopify crawling safely with ProxiesAPI

Shopify storefronts are consistent at the platform layer (JSON endpoints), but rate limits and blocks still happen at scale. ProxiesAPI helps you keep your crawler stable with IP rotation + consistent routing across runs.


What Shopify gives you for free (stable endpoints)

Most Shopify stores expose useful endpoints that do not depend on the theme:

1) Product JSON

  • https://STORE_DOMAIN/products/HANDLE.json

This returns:

  • title
  • vendor
  • product type
  • images
  • variants (id, title, price, available, sku, etc.)

2) Collection products JSON

  • https://STORE_DOMAIN/collections/COLLECTION_HANDLE/products.json?limit=250&page=1

Great for crawling category pages in bulk.

Many stores enable predictive search endpoints. Not universal, but helpful.

Because these are platform endpoints, they’re much less likely to break when the store redesigns.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests tenacity

Step 1: Build a resilient fetch layer (with ProxiesAPI)

import os
import random

import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

TIMEOUT = (10, 40)

UA_POOL = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

session = requests.Session()


def build_proxies():
    p = os.getenv("PROXIESAPI_PROXY_URL")
    return {"http": p, "https": p} if p else None


@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=1, max=20))
def get_json(url: str) -> dict:
    headers = {
        "User-Agent": random.choice(UA_POOL),
        "Accept": "application/json,text/plain,*/*",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
    }

    r = session.get(url, headers=headers, proxies=build_proxies(), timeout=TIMEOUT)
    if r.status_code in (403, 429, 503):
        raise RuntimeError(f"blocked/rate limited: {r.status_code}")
    r.raise_for_status()
    return r.json()

Step 2: Scrape a product via /products/<handle>.json

This is the most reliable starting point.

from urllib.parse import urlparse


def normalize_store(url: str) -> str:
    # Accept https://store.com or store.com
    if not url.startswith("http"):
        url = "https://" + url
    u = urlparse(url)
    return f"{u.scheme}://{u.netloc}"


def product_json_url(store: str, handle: str) -> str:
    store = normalize_store(store)
    return f"{store}/products/{handle}.json"


def parse_product(product: dict) -> dict:
    p = product.get("product", product)

    variants = []
    for v in p.get("variants", []) or []:
        variants.append({
            "id": v.get("id"),
            "title": v.get("title"),
            "sku": v.get("sku"),
            "price": v.get("price"),
            "compare_at_price": v.get("compare_at_price"),
            "available": v.get("available"),
            "inventory_quantity": v.get("inventory_quantity"),
        })

    return {
        "id": p.get("id"),
        "handle": p.get("handle"),
        "title": p.get("title"),
        "vendor": p.get("vendor"),
        "product_type": p.get("product_type"),
        "tags": p.get("tags"),
        "created_at": p.get("created_at"),
        "updated_at": p.get("updated_at"),
        "variants": variants,
    }


if __name__ == "__main__":
    store = "https://example-store.com"
    handle = "my-product"
    data = get_json(product_json_url(store, handle))
    print(parse_product(data))

Notes on inventory

inventory_quantity is not always present (many storefronts hide it). But you can still infer availability from:

  • available boolean
  • whether adding to cart is enabled

If you truly need inventory counts, that’s usually not available without authenticated flows (and you probably shouldn’t scrape it).


Step 3: Crawl a collection in bulk

Collections are a great way to build a SKU list without scraping theme HTML.


def collection_products_url(store: str, collection_handle: str, page: int = 1, limit: int = 250) -> str:
    store = normalize_store(store)
    return f"{store}/collections/{collection_handle}/products.json?limit={limit}&page={page}"


def crawl_collection(store: str, collection_handle: str, max_pages: int = 5) -> list[dict]:
    out = []
    for page in range(1, max_pages + 1):
        data = get_json(collection_products_url(store, collection_handle, page=page))
        products = data.get("products", [])
        if not products:
            break

        out.extend(products)
        print("page", page, "products", len(products), "total", len(out))

        # If we got fewer than limit, likely last page
        if len(products) < 250:
            break

    return out

Step 4: When .json endpoints are blocked (fallback strategy)

Some stores disable or rate limit JSON endpoints.

Fallback ladder:

  1. JSON-LD on the product page (application/ld+json)
  2. Shopify’s embedded state objects (varies)
  3. HTML selectors as a last resort

The goal is: avoid theme-specific selectors.

JSON-LD fallback

import json
import re
from bs4 import BeautifulSoup


def get_html(url: str) -> str:
    # reuse your requests Session; keep headers browser-like
    headers = {"User-Agent": random.choice(UA_POOL), "Accept": "text/html,*/*"}
    r = session.get(url, headers=headers, proxies=build_proxies(), timeout=TIMEOUT)
    if r.status_code in (403, 429, 503):
        raise RuntimeError(f"blocked: {r.status_code}")
    r.raise_for_status()
    return r.text


def parse_product_jsonld(html: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")
    for s in soup.select('script[type="application/ld+json"]'):
        raw = s.string
        if not raw:
            continue
        try:
            data = json.loads(raw)
        except Exception:
            continue

        objs = data if isinstance(data, list) else [data]
        for obj in objs:
            if isinstance(obj, dict) and obj.get("@type") == "Product":
                return obj

    return None

JSON-LD often includes:

  • name
  • description
  • offers.price
  • availability

Not always variants, but enough for a basic price tracker.


Comparison table: extraction methods

MethodBest forStabilityData richness
/products/<handle>.jsonvariants + pricesHighHigh
/collections/<handle>/products.jsoncrawling SKU listHighMedium
JSON-LDbasic product + priceMediumMedium
HTML selectorslast resortLowVaries

Practical advice to avoid breakage

  • Treat Shopify as an API-first target: use JSON endpoints.
  • Add caching so you don’t refetch unchanged products constantly.
  • Keep concurrency low per domain (Shopify stores can be sensitive).
  • Rotate IPs when crawling many stores (ProxiesAPI).
  • Store raw responses for a small sample so you can debug schema shifts.

Where ProxiesAPI fits (honestly)

ProxiesAPI helps most when you:

  • crawl many stores (each with different rate limits)
  • refresh prices daily across thousands of URLs
  • run on a single server IP that gets flagged

It won’t stop a store from blocking .json endpoints entirely. But combined with:

  • sane per-domain rate limits
  • retries with jitter
  • caching

…it improves success rate and makes failures replayable.


QA checklist

  • /products/<handle>.json returns a product object
  • variant list includes price + availability
  • collection crawl paginates without duplicates
  • JSON-LD fallback extracts at least name + price
  • failures are logged with status + URL for replay
Scale Shopify crawling safely with ProxiesAPI

Shopify storefronts are consistent at the platform layer (JSON endpoints), but rate limits and blocks still happen at scale. ProxiesAPI helps you keep your crawler stable with IP rotation + consistent routing across runs.

Related guides

How to Scrape Shopify Stores: Products, Prices, and Inventory (2026)
Practical Shopify scraping patterns: discover product JSON endpoints, paginate collections, extract variants + availability, and reduce blocks while staying ethical.
guide#shopify#ecommerce#web-scraping
How to Scrape Walmart Grocery Prices with Python (Search + Product Pages)
Build a practical Walmart grocery price scraper: search for items, follow product links, extract price/size/availability, and export clean JSON. Includes ProxiesAPI integration, retries, and selector fallbacks.
tutorial#python#walmart#price-scraping
Scrape Product Data from Target.com (Title, Price, Availability) with Python + ProxiesAPI
Extract Target product-page data (title, price, availability) into clean JSON/CSV with resilient parsing, retries/timeouts, and a ProxiesAPI-ready fetch layer. Includes a screenshot of the page we scrape.
tutorial#python#target#ecommerce
Scrape Product Data from Target.com (title, price, availability) with Python + ProxiesAPI
End-to-end Target product-page scraper that extracts title, price, and availability with robust parsing, retries, and CSV export. Includes ProxiesAPI-ready request patterns and a screenshot of the page we scrape.
tutorial#python#target#ecommerce