Scrape Costco Product Prices with Python (Search + Pagination + SKU Variants)

Costco is one of those sites where the idea is simple (search → product cards → product detail) but the reality is messy:

  • prices can be member-only
  • the same product may exist in multiple pack sizes / variants
  • pages can be personalized by location and inventory
  • anti-bot measures can appear when you crawl too aggressively

In this guide we’ll build a practical Costco scraper in Python that:

  • searches Costco for a query
  • paginates through results
  • extracts product name, price, unit size, and availability when present
  • follows product detail pages to normalize variants
  • exports a clean CSV

We’ll do it with requests + BeautifulSoup, and we’ll show where ProxiesAPI fits in (for reliability and scale).

Costco homepage (target site for this tutorial)

Keep retail crawls stable with ProxiesAPI

Retail sites change, block, and rate-limit fast. ProxiesAPI gives you a reliable network layer so your Costco crawl keeps working as you scale URL count and frequency.


What we’re scraping (Costco site structure)

Costco has multiple surfaces (warehouse, same-day, online). This post targets the Costco online catalog search and product pages.

Typical patterns you’ll run into:

  • Search results URLs that include query parameters and paging
  • Product pages containing a product title, item number / SKU, and pricing blocks

Because Costco’s markup changes and can vary by geography/account, the core approach is:

  1. Fetch HTML (with realistic headers + timeouts)
  2. Parse defensively (multiple selectors, fallbacks)
  3. Keep a raw sample for debugging (save HTML for one URL when things break)

Ground rules (don’t get blocked instantly)

Before code:

  • Use a real User-Agent and set Accept-Language
  • Add delays between requests
  • Don’t hammer pagination (crawl only what you need)
  • Build in retries for transient errors (429/5xx)

If you’re doing this at any scale (hundreds/thousands of URLs), route requests through ProxiesAPI.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll also use the standard library for CSV:

  • csv
  • dataclasses

Step 1: A robust fetch() (headers, timeouts, retries)

Here’s a production-friendly HTTP wrapper.

Important notes:

  • We use a requests.Session() for connection reuse
  • We use connect/read timeouts so the crawler doesn’t hang
  • We retry on 429/5xx with backoff
import time
import random
from typing import Optional

import requests

TIMEOUT = (10, 30)

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()


def fetch(url: str, *, proxy_url: Optional[str] = None, max_retries: int = 4) -> str:
    proxies = None
    if proxy_url:
        proxies = {"http": proxy_url, "https": proxy_url}

    last_err = None
    for attempt in range(1, max_retries + 1):
        try:
            r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT, proxies=proxies)

            # Common: 403/429 when you get rate-limited or flagged
            if r.status_code in (429, 500, 502, 503, 504):
                sleep_s = min(20, (2 ** attempt) + random.random())
                time.sleep(sleep_s)
                continue

            r.raise_for_status()
            return r.text
        except Exception as e:
            last_err = e
            time.sleep(min(20, (2 ** attempt) + random.random()))

    raise RuntimeError(f"fetch failed after {max_retries} retries: {last_err}")

Where ProxiesAPI fits

If ProxiesAPI provides you a single outbound proxy endpoint, you can pass it as proxy_url.

You can also extend this wrapper to:

  • rotate proxy sessions
  • attach an API key in a proxy URL
  • capture block pages for debugging

(Exact integration details depend on your ProxiesAPI account settings and endpoint format.)


Step 2: Build a Costco search URL

Costco search URLs may change. The safest way to generate a search URL is:

  1. open Costco in your browser
  2. search for a product (e.g. “protein bar”)
  3. copy the results URL

Then you can parameterize the query.

For a lot of Costco-like retail sites, paging is either:

  • ?page=2
  • &page=2
  • or cursor-based

We’ll implement paging in a generic way: we’ll parse “next page” links when present, and fall back to page=N if the site uses it.


Step 3: Parse search results (product cards)

On a typical retail results page, each product card gives you:

  • product title
  • product URL
  • maybe price (sometimes only on product page)

We’ll parse defensively by:

  • selecting anchors that look like product links
  • de-duplicating URLs
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.costco.com"


def parse_search_results(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    seen = set()

    # Costco markup can change; this is a conservative approach:
    # - find anchors that look like product links
    for a in soup.select("a[href]"):
        href = a.get("href") or ""

        # Typical product pages include ".product." patterns, but this may vary.
        if "/product/" not in href and ".product" not in href:
            continue

        url = href if href.startswith("http") else urljoin(BASE, href)
        if url in seen:
            continue
        seen.add(url)

        title = a.get_text(" ", strip=True) or None

        out.append({
            "title_hint": title,
            "url": url,
        })

    return out

This isn’t “perfect”, but it’s resilient: when Costco changes CSS classes, anchors still exist.

In production, you’ll want to tighten selectors after inspecting the HTML you get.


Step 4: Parse a Costco product page (price + pack size + availability)

The product page is where we try to extract:

  • product name
  • item number / SKU (if present)
  • price
  • unit size / pack size (often in title or bullets)
  • availability text

Because exact selectors vary, we implement multiple fallbacks.

import re
from bs4 import BeautifulSoup


def clean_text(x: str) -> str:
    return re.sub(r"\s+", " ", (x or "").strip())


def parse_price(text: str) -> str | None:
    # Keep as string to avoid currency localization headaches
    m = re.search(r"\$\s?\d+(?:\.\d{2})?", text or "")
    return m.group(0).replace(" ", "") if m else None


def parse_product_page(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Title
    title = None
    h1 = soup.select_one("h1")
    if h1:
        title = clean_text(h1.get_text(" ", strip=True))

    # SKU / item number heuristics
    sku = None
    body_text = soup.get_text("\n", strip=True)
    m = re.search(r"Item\s*#\s*(\d+)", body_text)
    if m:
        sku = m.group(1)

    # Price heuristics: scan likely price containers, fall back to whole page
    price = None
    for sel in [
        "span.price",
        "div.price",
        "span#price",
        "div#price",
        "*[data-testid*='price']",
    ]:
        el = soup.select_one(sel)
        if el:
            price = parse_price(el.get_text(" ", strip=True))
            if price:
                break

    if not price:
        price = parse_price(body_text)

    # Availability heuristics
    availability = None
    for phrase in ["Out of stock", "In stock", "Currently unavailable", "Available"]:
        if phrase.lower() in body_text.lower():
            availability = phrase
            break

    # Pack / unit size: often in title; also in bullets
    unit = None
    if title:
        m2 = re.search(r"(\d+\s?(?:ct|count|oz|lb|lbs|g|kg|pack))\b", title.lower())
        if m2:
            unit = m2.group(1)

    return {
        "url": url,
        "title": title,
        "sku": sku,
        "price": price,
        "unit": unit,
        "availability": availability,
    }

This parsing style is what keeps your scrapers alive:

  • simple selectors first
  • then fallback to text heuristics
  • keep fields nullable

Step 5: Putting it together (search → paginate → detail pages)

Now we’ll:

  1. fetch a search results page
  2. extract product URLs
  3. fetch product detail pages
  4. write to CSV

Pagination is highly site-specific. We’ll implement two strategies:

  • Try to find a “next” link (rel=next, anchor text, etc.)
  • If none found, stop after the first page (safe default)
import csv
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse


def find_next_page_url(html: str, current_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # 1) rel=next
    link = soup.select_one("link[rel='next'][href]")
    if link:
        href = link.get("href")
        return href if href.startswith("http") else urljoin(current_url, href)

    # 2) anchor with "Next" text
    for a in soup.select("a[href]"):
        if a.get_text(" ", strip=True).lower() in ("next", "next page", ">"):
            href = a.get("href")
            return href if href.startswith("http") else urljoin(current_url, href)

    return None


def crawl_costco_search(search_url: str, *, pages: int = 3, proxy_url: str | None = None) -> list[dict]:
    products = []
    seen_urls = set()

    url = search_url
    for page in range(1, pages + 1):
        html = fetch(url, proxy_url=proxy_url)
        cards = parse_search_results(html)

        print(f"page {page}: found {len(cards)} product links")

        for c in cards:
            if c["url"] in seen_urls:
                continue
            seen_urls.add(c["url"])

            # gentle pacing
            time.sleep(1.0 + random.random())

            detail_html = fetch(c["url"], proxy_url=proxy_url)
            item = parse_product_page(detail_html, c["url"])

            # Use title hint if product page title is missing
            if not item.get("title") and c.get("title_hint"):
                item["title"] = c["title_hint"]

            products.append(item)

        next_url = find_next_page_url(html, url)
        if not next_url:
            break
        url = next_url

        time.sleep(2.0 + random.random())

    return products


def write_csv(items: list[dict], path: str = "costco_products.csv"):
    fields = ["title", "sku", "price", "unit", "availability", "url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for it in items:
            w.writerow({k: it.get(k) for k in fields})


if __name__ == "__main__":
    # Replace with a URL copied from your browser after searching Costco.
    # Example (illustrative):
    # search_url = "https://www.costco.com/CatalogSearch?dept=All&keyword=protein%20bar"
    search_url = "https://www.costco.com/CatalogSearch?dept=All&keyword=coffee"

    # If you have a ProxiesAPI proxy endpoint, set it here.
    # proxy_url = "http://USERNAME:PASSWORD@proxy.proxiesapi.com:PORT"
    proxy_url = None

    items = crawl_costco_search(search_url, pages=2, proxy_url=proxy_url)
    print("items:", len(items))

    write_csv(items)
    print("wrote costco_products.csv")

Handling SKU variants (a practical data model)

In retail scraping, “variants” show up as:

  • same product title, different pack sizes (12ct vs 24ct)
  • same item with different flavors
  • same item with different shipping options / location availability

A simple model that works well:

  • product_group_id: a normalized key (e.g. normalized title)
  • variant_id: SKU or item number (best) else a hash of (url)
  • price: keep as string + currency
  • observed_at: timestamp for history

If you store into SQLite/Postgres, you can track price over time.


QA checklist

  • Spot-check 5 product pages manually vs scraped fields
  • Save one raw HTML response when parsing fails (so you can update selectors)
  • Use delays and retries
  • If you scale, use ProxiesAPI to stabilize request success rate

Next upgrades

  • Add structured logging (URL, status code, retry count)
  • Store results in SQLite (so re-runs update, not duplicate)
  • Implement a “changed price” alert workflow
  • Add location-specific parameters if Costco’s experience differs by region
Keep retail crawls stable with ProxiesAPI

Retail sites change, block, and rate-limit fast. ProxiesAPI gives you a reliable network layer so your Costco crawl keeps working as you scale URL count and frequency.

Related guides

Scrape Rightmove Sold Prices (Second Angle): Price History Dataset Builder
Build a clean Rightmove sold-price history dataset with dedupe + incremental updates, plus a screenshot of the sold-price flow and ProxiesAPI-backed fetching.
tutorial#python#rightmove#web-scraping
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
How to Scrape Cars.com Used Car Prices (Python + ProxiesAPI)
Extract listing title, price, mileage, location, and dealer info from Cars.com search results + detail pages. Includes selector notes, pagination, and a polite crawl plan.
tutorial#python#cars.com#price-scraping
Scrape Live Stock Prices from Yahoo Finance (Python + ProxiesAPI)
Fetch Yahoo Finance quote pages via ProxiesAPI, parse price + change + market cap, and export clean rows to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks