Scrape Costco Product Prices with Python (Search + Pagination + SKU Variants)
Costco is one of those sites where the idea is simple (search → product cards → product detail) but the reality is messy:
- prices can be member-only
- the same product may exist in multiple pack sizes / variants
- pages can be personalized by location and inventory
- anti-bot measures can appear when you crawl too aggressively
In this guide we’ll build a practical Costco scraper in Python that:
- searches Costco for a query
- paginates through results
- extracts product name, price, unit size, and availability when present
- follows product detail pages to normalize variants
- exports a clean CSV
We’ll do it with requests + BeautifulSoup, and we’ll show where ProxiesAPI fits in (for reliability and scale).

Retail sites change, block, and rate-limit fast. ProxiesAPI gives you a reliable network layer so your Costco crawl keeps working as you scale URL count and frequency.
What we’re scraping (Costco site structure)
Costco has multiple surfaces (warehouse, same-day, online). This post targets the Costco online catalog search and product pages.
Typical patterns you’ll run into:
- Search results URLs that include query parameters and paging
- Product pages containing a product title, item number / SKU, and pricing blocks
Because Costco’s markup changes and can vary by geography/account, the core approach is:
- Fetch HTML (with realistic headers + timeouts)
- Parse defensively (multiple selectors, fallbacks)
- Keep a raw sample for debugging (save HTML for one URL when things break)
Ground rules (don’t get blocked instantly)
Before code:
- Use a real
User-Agentand setAccept-Language - Add delays between requests
- Don’t hammer pagination (crawl only what you need)
- Build in retries for transient errors (429/5xx)
If you’re doing this at any scale (hundreds/thousands of URLs), route requests through ProxiesAPI.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll also use the standard library for CSV:
csvdataclasses
Step 1: A robust fetch() (headers, timeouts, retries)
Here’s a production-friendly HTTP wrapper.
Important notes:
- We use a
requests.Session()for connection reuse - We use connect/read timeouts so the crawler doesn’t hang
- We retry on 429/5xx with backoff
import time
import random
from typing import Optional
import requests
TIMEOUT = (10, 30)
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
session = requests.Session()
def fetch(url: str, *, proxy_url: Optional[str] = None, max_retries: int = 4) -> str:
proxies = None
if proxy_url:
proxies = {"http": proxy_url, "https": proxy_url}
last_err = None
for attempt in range(1, max_retries + 1):
try:
r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT, proxies=proxies)
# Common: 403/429 when you get rate-limited or flagged
if r.status_code in (429, 500, 502, 503, 504):
sleep_s = min(20, (2 ** attempt) + random.random())
time.sleep(sleep_s)
continue
r.raise_for_status()
return r.text
except Exception as e:
last_err = e
time.sleep(min(20, (2 ** attempt) + random.random()))
raise RuntimeError(f"fetch failed after {max_retries} retries: {last_err}")
Where ProxiesAPI fits
If ProxiesAPI provides you a single outbound proxy endpoint, you can pass it as proxy_url.
You can also extend this wrapper to:
- rotate proxy sessions
- attach an API key in a proxy URL
- capture block pages for debugging
(Exact integration details depend on your ProxiesAPI account settings and endpoint format.)
Step 2: Build a Costco search URL
Costco search URLs may change. The safest way to generate a search URL is:
- open Costco in your browser
- search for a product (e.g. “protein bar”)
- copy the results URL
Then you can parameterize the query.
For a lot of Costco-like retail sites, paging is either:
?page=2&page=2- or cursor-based
We’ll implement paging in a generic way: we’ll parse “next page” links when present, and fall back to page=N if the site uses it.
Step 3: Parse search results (product cards)
On a typical retail results page, each product card gives you:
- product title
- product URL
- maybe price (sometimes only on product page)
We’ll parse defensively by:
- selecting anchors that look like product links
- de-duplicating URLs
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://www.costco.com"
def parse_search_results(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
seen = set()
# Costco markup can change; this is a conservative approach:
# - find anchors that look like product links
for a in soup.select("a[href]"):
href = a.get("href") or ""
# Typical product pages include ".product." patterns, but this may vary.
if "/product/" not in href and ".product" not in href:
continue
url = href if href.startswith("http") else urljoin(BASE, href)
if url in seen:
continue
seen.add(url)
title = a.get_text(" ", strip=True) or None
out.append({
"title_hint": title,
"url": url,
})
return out
This isn’t “perfect”, but it’s resilient: when Costco changes CSS classes, anchors still exist.
In production, you’ll want to tighten selectors after inspecting the HTML you get.
Step 4: Parse a Costco product page (price + pack size + availability)
The product page is where we try to extract:
- product name
- item number / SKU (if present)
- price
- unit size / pack size (often in title or bullets)
- availability text
Because exact selectors vary, we implement multiple fallbacks.
import re
from bs4 import BeautifulSoup
def clean_text(x: str) -> str:
return re.sub(r"\s+", " ", (x or "").strip())
def parse_price(text: str) -> str | None:
# Keep as string to avoid currency localization headaches
m = re.search(r"\$\s?\d+(?:\.\d{2})?", text or "")
return m.group(0).replace(" ", "") if m else None
def parse_product_page(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
# Title
title = None
h1 = soup.select_one("h1")
if h1:
title = clean_text(h1.get_text(" ", strip=True))
# SKU / item number heuristics
sku = None
body_text = soup.get_text("\n", strip=True)
m = re.search(r"Item\s*#\s*(\d+)", body_text)
if m:
sku = m.group(1)
# Price heuristics: scan likely price containers, fall back to whole page
price = None
for sel in [
"span.price",
"div.price",
"span#price",
"div#price",
"*[data-testid*='price']",
]:
el = soup.select_one(sel)
if el:
price = parse_price(el.get_text(" ", strip=True))
if price:
break
if not price:
price = parse_price(body_text)
# Availability heuristics
availability = None
for phrase in ["Out of stock", "In stock", "Currently unavailable", "Available"]:
if phrase.lower() in body_text.lower():
availability = phrase
break
# Pack / unit size: often in title; also in bullets
unit = None
if title:
m2 = re.search(r"(\d+\s?(?:ct|count|oz|lb|lbs|g|kg|pack))\b", title.lower())
if m2:
unit = m2.group(1)
return {
"url": url,
"title": title,
"sku": sku,
"price": price,
"unit": unit,
"availability": availability,
}
This parsing style is what keeps your scrapers alive:
- simple selectors first
- then fallback to text heuristics
- keep fields nullable
Step 5: Putting it together (search → paginate → detail pages)
Now we’ll:
- fetch a search results page
- extract product URLs
- fetch product detail pages
- write to CSV
Pagination is highly site-specific. We’ll implement two strategies:
- Try to find a “next” link (
rel=next, anchor text, etc.) - If none found, stop after the first page (safe default)
import csv
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
def find_next_page_url(html: str, current_url: str) -> str | None:
soup = BeautifulSoup(html, "lxml")
# 1) rel=next
link = soup.select_one("link[rel='next'][href]")
if link:
href = link.get("href")
return href if href.startswith("http") else urljoin(current_url, href)
# 2) anchor with "Next" text
for a in soup.select("a[href]"):
if a.get_text(" ", strip=True).lower() in ("next", "next page", ">"):
href = a.get("href")
return href if href.startswith("http") else urljoin(current_url, href)
return None
def crawl_costco_search(search_url: str, *, pages: int = 3, proxy_url: str | None = None) -> list[dict]:
products = []
seen_urls = set()
url = search_url
for page in range(1, pages + 1):
html = fetch(url, proxy_url=proxy_url)
cards = parse_search_results(html)
print(f"page {page}: found {len(cards)} product links")
for c in cards:
if c["url"] in seen_urls:
continue
seen_urls.add(c["url"])
# gentle pacing
time.sleep(1.0 + random.random())
detail_html = fetch(c["url"], proxy_url=proxy_url)
item = parse_product_page(detail_html, c["url"])
# Use title hint if product page title is missing
if not item.get("title") and c.get("title_hint"):
item["title"] = c["title_hint"]
products.append(item)
next_url = find_next_page_url(html, url)
if not next_url:
break
url = next_url
time.sleep(2.0 + random.random())
return products
def write_csv(items: list[dict], path: str = "costco_products.csv"):
fields = ["title", "sku", "price", "unit", "availability", "url"]
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fields)
w.writeheader()
for it in items:
w.writerow({k: it.get(k) for k in fields})
if __name__ == "__main__":
# Replace with a URL copied from your browser after searching Costco.
# Example (illustrative):
# search_url = "https://www.costco.com/CatalogSearch?dept=All&keyword=protein%20bar"
search_url = "https://www.costco.com/CatalogSearch?dept=All&keyword=coffee"
# If you have a ProxiesAPI proxy endpoint, set it here.
# proxy_url = "http://USERNAME:PASSWORD@proxy.proxiesapi.com:PORT"
proxy_url = None
items = crawl_costco_search(search_url, pages=2, proxy_url=proxy_url)
print("items:", len(items))
write_csv(items)
print("wrote costco_products.csv")
Handling SKU variants (a practical data model)
In retail scraping, “variants” show up as:
- same product title, different pack sizes (12ct vs 24ct)
- same item with different flavors
- same item with different shipping options / location availability
A simple model that works well:
- product_group_id: a normalized key (e.g. normalized title)
- variant_id: SKU or item number (best) else a hash of (url)
- price: keep as string + currency
- observed_at: timestamp for history
If you store into SQLite/Postgres, you can track price over time.
QA checklist
- Spot-check 5 product pages manually vs scraped fields
- Save one raw HTML response when parsing fails (so you can update selectors)
- Use delays and retries
- If you scale, use ProxiesAPI to stabilize request success rate
Next upgrades
- Add structured logging (URL, status code, retry count)
- Store results in SQLite (so re-runs update, not duplicate)
- Implement a “changed price” alert workflow
- Add location-specific parameters if Costco’s experience differs by region
Retail sites change, block, and rate-limit fast. ProxiesAPI gives you a reliable network layer so your Costco crawl keeps working as you scale URL count and frequency.