Scrape Costco Product Prices with Python (Search + Pagination + Product Pages)

Costco is a great example of a “real-world” ecommerce target:

  • search pages (you start from a query)
  • listing pages (multiple results)
  • pagination (you need to crawl page 1…N)
  • product detail pages (true source of price + SKU-ish identifiers)

In this guide we’ll build a repeatable Costco price dataset with Python:

  • crawl search results for a query (e.g. protein)
  • collect product URLs across pagination
  • visit each product page and extract name, price, availability (where available)
  • export to CSV/JSON
  • add a resilient network layer with timeouts, retries, and ProxiesAPI integration

Costco search results page (we’ll extract product cards + pagination)

Keep Costco crawls stable with ProxiesAPI

Ecommerce targets tend to rate-limit and intermittently block repeat traffic. ProxiesAPI helps you run scheduled price crawls with fewer failures and less babysitting.


Important notes (before you start)

  • Websites change often. The selectors below are based on Costco’s current markup and designed to be easy to update.
  • Costco may show different content by region and may require consent/login for some flows.
  • Be respectful: crawl slowly, cache results, and don’t hammer endpoints.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • pandas for easy CSV export (optional)

Step 1: Build a robust fetcher (timeouts + retries)

You want a single place to control:

  • headers
  • timeouts
  • retry/backoff
  • proxy routing (where ProxiesAPI fits)
from __future__ import annotations

import random
import time
from dataclasses import dataclass

import requests

TIMEOUT = (10, 30)  # connect, read

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}


@dataclass
class FetchConfig:
    use_proxiesapi: bool = True
    proxiesapi_endpoint: str | None = None
    max_retries: int = 4
    min_sleep: float = 0.8
    max_sleep: float = 1.8


class Fetcher:
    def __init__(self, cfg: FetchConfig):
        self.cfg = cfg
        self.s = requests.Session()
        self.s.headers.update(DEFAULT_HEADERS)

    def _sleep_jitter(self):
        time.sleep(random.uniform(self.cfg.min_sleep, self.cfg.max_sleep))

    def get(self, url: str) -> str:
        last_err = None

        for attempt in range(1, self.cfg.max_retries + 1):
            try:
                self._sleep_jitter()

                # Where ProxiesAPI fits:
                # - If you have a ProxiesAPI HTTP(S) proxy endpoint, route traffic through it.
                # - Keep this as a config toggle so you can test without proxies.
                proxies = None
                if self.cfg.use_proxiesapi and self.cfg.proxiesapi_endpoint:
                    proxies = {
                        "http": self.cfg.proxiesapi_endpoint,
                        "https": self.cfg.proxiesapi_endpoint,
                    }

                r = self.s.get(url, timeout=TIMEOUT, proxies=proxies)

                # A few sites return 403/429 intermittently. Treat as retryable.
                if r.status_code in (403, 429, 500, 502, 503, 504):
                    raise requests.HTTPError(
                        f"HTTP {r.status_code} for {url}", response=r
                    )

                r.raise_for_status()
                return r.text

            except Exception as e:
                last_err = e
                backoff = 1.2 ** attempt
                time.sleep(backoff)

        raise RuntimeError(f"Failed after retries: {url}") from last_err

Configure ProxiesAPI

Set your proxy endpoint as an env var (example name):

export PROXIESAPI_PROXY_URL="http://USER:PASS@proxy.proxiesapi.com:PORT"

Then in Python:

import os

cfg = FetchConfig(
    use_proxiesapi=True,
    proxiesapi_endpoint=os.getenv("PROXIESAPI_PROXY_URL"),
)
fetcher = Fetcher(cfg)

If you don’t have the endpoint yet, you can run with use_proxiesapi=False and still validate selectors.


Step 2: Costco URLs we’ll crawl

Costco search URLs typically look like:

  • Search: https://www.costco.com/CatalogSearch?dept=All&keyword=protein

Pagination/parameters can vary; the practical approach is:

  1. Start from a search URL
  2. Parse product card URLs from the HTML
  3. Find the “next page” link (if any) and repeat

Step 3: Parse search/listing pages (product cards)

We’ll extract:

  • product name
  • product URL
  • optional displayed price (sometimes visible on cards)
from urllib.parse import urljoin
from bs4 import BeautifulSoup

BASE = "https://www.costco.com"


def parse_search_page(html: str) -> tuple[list[dict], str | None]:
    soup = BeautifulSoup(html, "lxml")

    items: list[dict] = []

    # Product tiles commonly contain an anchor to the PDP.
    # Use a broad selector, then normalize.
    for a in soup.select('a[href*=".product"]'):
        href = a.get("href")
        if not href:
            continue

        url = href if href.startswith("http") else urljoin(BASE, href)

        # Try to pick a human-visible title from within the tile.
        title = a.get_text(" ", strip=True) or None

        # Filter out non-product anchors.
        if "/" not in url or ".product" not in url:
            continue

        items.append({
            "title": title,
            "url": url,
        })

    # Pagination: look for a "next" link (site markup changes; keep logic forgiving).
    next_url = None
    next_a = soup.select_one('a[aria-label="Next"], a[rel="next"], a.pagination-next')
    if next_a and next_a.get("href"):
        href = next_a.get("href")
        next_url = href if href.startswith("http") else urljoin(BASE, href)

    # Deduplicate by URL
    dedup = {}
    for it in items:
        dedup[it["url"]] = it

    return list(dedup.values()), next_url

Sanity check the parser

query = "protein"
start = f"{BASE}/CatalogSearch?dept=All&keyword={query}"

html = fetcher.get(start)
items, next_url = parse_search_page(html)

print("items", len(items))
print("next", next_url)
print(items[:3])

Step 4: Parse a Costco product page (PDP)

On the product page, you want:

  • a stable product identifier (often embedded in the URL or in structured data)
  • title
  • price
  • availability / stock messaging (when present)

A reliable strategy:

  1. Prefer structured data (application/ld+json) if available
  2. Fall back to visible DOM selectors
import json
import re


def extract_ld_json(soup: BeautifulSoup) -> list[dict]:
    out = []
    for s in soup.select('script[type="application/ld+json"]'):
        raw = s.get_text("\n", strip=True)
        if not raw:
            continue
        try:
            data = json.loads(raw)
            if isinstance(data, dict):
                out.append(data)
            elif isinstance(data, list):
                out.extend([d for d in data if isinstance(d, dict)])
        except Exception:
            continue
    return out


def parse_product_page(url: str, html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = None
    price = None
    currency = None
    availability = None

    # 1) Try JSON-LD
    for block in extract_ld_json(soup):
        # Products sometimes live under @graph
        graph = block.get("@graph") if isinstance(block.get("@graph"), list) else None
        candidates = graph if graph else [block]
        for obj in candidates:
            if obj.get("@type") in ("Product", ["Product"]):
                title = title or obj.get("name")

                offers = obj.get("offers")
                if isinstance(offers, dict):
                    price = price or offers.get("price")
                    currency = currency or offers.get("priceCurrency")
                    availability = availability or offers.get("availability")

    # 2) Fall back to visible selectors
    if not title:
        h1 = soup.select_one("h1")
        title = h1.get_text(" ", strip=True) if h1 else None

    if not price:
        # Common pattern: price fragments split across spans.
        # Keep it flexible: look for something that looks like $12.34
        text = soup.get_text("\n", strip=True)
        m = re.search(r"\$(\d{1,4}(?:,\d{3})*(?:\.\d{2})?)", text)
        if m:
            price = m.group(1)
            currency = currency or "USD"

    return {
        "url": url,
        "title": title,
        "price": price,
        "currency": currency,
        "availability": availability,
    }

Step 5: Crawl end-to-end (search → products)

Now we stitch it together:

  • crawl up to max_pages of search results
  • collect unique product URLs
  • fetch + parse each product page
from urllib.parse import urlencode


def crawl_costco_search(keyword: str, max_pages: int = 5) -> list[dict]:
    params = {"dept": "All", "keyword": keyword}
    url = f"{BASE}/CatalogSearch?{urlencode(params)}"

    products: dict[str, dict] = {}

    pages = 0
    while url and pages < max_pages:
        pages += 1
        html = fetcher.get(url)
        items, next_url = parse_search_page(html)

        for it in items:
            products[it["url"]] = it

        print(f"page {pages}: found {len(items)} items (total unique {len(products)})")
        url = next_url

    return list(products.values())


def crawl_product_details(urls: list[str]) -> list[dict]:
    out = []
    for i, url in enumerate(urls, start=1):
        html = fetcher.get(url)
        data = parse_product_page(url, html)
        out.append(data)
        print(f"{i}/{len(urls)} parsed", data.get("title"), data.get("price"))
    return out


items = crawl_costco_search("protein", max_pages=3)
urls = [it["url"] for it in items]
rows = crawl_product_details(urls[:25])  # start small

print("rows", len(rows))
print(rows[0])

Step 6: Export to CSV + JSON

import json
import pandas as pd

pd.DataFrame(rows).to_csv("costco_prices.csv", index=False)

with open("costco_prices.json", "w", encoding="utf-8") as f:
    json.dump(rows, f, ensure_ascii=False, indent=2)

print("wrote costco_prices.csv + costco_prices.json")

Practical production upgrades

If you’re turning this into a tracker (daily/weekly price checks):

  • Store results in SQLite/Postgres keyed by product URL
  • Cache HTML for debugging failed parses
  • Add concurrency cautiously (start with 2–4 threads)
  • Add alerting when a price changes beyond a threshold
  • Keep a block/failure rate dashboard (403/429/timeout counts)

QA checklist

  • Search parser extracts mostly product URLs (spot-check 10)
  • Pagination finds next page or stops cleanly
  • Product parser returns non-empty title for most URLs
  • Price extraction succeeds for a meaningful subset
  • Exports are valid CSV/JSON

Where ProxiesAPI helps (honestly)

Ecommerce sites are where scraping reliability becomes a job:

  • IP-based rate limits
  • intermittent 403/429
  • different content per region

ProxiesAPI doesn’t “magically bypass everything,” but it does give you a stable proxy layer you can turn on when your crawl starts failing.

If you keep your network layer isolated (like Fetcher above), you can swap proxy settings without rewriting your parser.

Keep Costco crawls stable with ProxiesAPI

Ecommerce targets tend to rate-limit and intermittently block repeat traffic. ProxiesAPI helps you run scheduled price crawls with fewer failures and less babysitting.

Related guides

Scrape UK Property Prices from Rightmove (Dataset Builder)
Build a sold-price dataset from Rightmove: crawl results, follow listing links, extract key fields, handle retries, and export to CSV using ProxiesAPI.
tutorial#python#rightmove#real-estate
How to Scrape Google Flights Prices with Python (Routes, Dates, and Price Quotes)
A practical guide to extracting flight price quotes from Google Flights responsibly: capture share URLs, fetch server-rendered HTML, parse price cards, and export clean JSON. Includes ProxiesAPI-backed requests + a screenshot.
tutorial#python#google-flights#travel
Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder)
Build a repeatable Rightmove sold-prices dataset with pagination, retries, and screenshot proof. Includes a production-ready Python scraper and export to CSV/JSON.
tutorial#python#rightmove#real-estate
Scrape Product Prices from Home Depot (Search + Category Pages) with Python + ProxiesAPI
Extract product name, price, and availability from Home Depot listing pages (search + category) with pagination, resilient parsing, and an anti-block-friendly request layer.
tutorial#python#home-depot#ecommerce