Scrape Costco Product Prices with Python (Search + Pagination + Product Pages)
Costco is a great example of a “real-world” ecommerce target:
- search pages (you start from a query)
- listing pages (multiple results)
- pagination (you need to crawl page 1…N)
- product detail pages (true source of price + SKU-ish identifiers)
In this guide we’ll build a repeatable Costco price dataset with Python:
- crawl search results for a query (e.g.
protein) - collect product URLs across pagination
- visit each product page and extract name, price, availability (where available)
- export to CSV/JSON
- add a resilient network layer with timeouts, retries, and ProxiesAPI integration

Ecommerce targets tend to rate-limit and intermittently block repeat traffic. ProxiesAPI helps you run scheduled price crawls with fewer failures and less babysitting.
Important notes (before you start)
- Websites change often. The selectors below are based on Costco’s current markup and designed to be easy to update.
- Costco may show different content by region and may require consent/login for some flows.
- Be respectful: crawl slowly, cache results, and don’t hammer endpoints.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingpandasfor easy CSV export (optional)
Step 1: Build a robust fetcher (timeouts + retries)
You want a single place to control:
- headers
- timeouts
- retry/backoff
- proxy routing (where ProxiesAPI fits)
from __future__ import annotations
import random
import time
from dataclasses import dataclass
import requests
TIMEOUT = (10, 30) # connect, read
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
@dataclass
class FetchConfig:
use_proxiesapi: bool = True
proxiesapi_endpoint: str | None = None
max_retries: int = 4
min_sleep: float = 0.8
max_sleep: float = 1.8
class Fetcher:
def __init__(self, cfg: FetchConfig):
self.cfg = cfg
self.s = requests.Session()
self.s.headers.update(DEFAULT_HEADERS)
def _sleep_jitter(self):
time.sleep(random.uniform(self.cfg.min_sleep, self.cfg.max_sleep))
def get(self, url: str) -> str:
last_err = None
for attempt in range(1, self.cfg.max_retries + 1):
try:
self._sleep_jitter()
# Where ProxiesAPI fits:
# - If you have a ProxiesAPI HTTP(S) proxy endpoint, route traffic through it.
# - Keep this as a config toggle so you can test without proxies.
proxies = None
if self.cfg.use_proxiesapi and self.cfg.proxiesapi_endpoint:
proxies = {
"http": self.cfg.proxiesapi_endpoint,
"https": self.cfg.proxiesapi_endpoint,
}
r = self.s.get(url, timeout=TIMEOUT, proxies=proxies)
# A few sites return 403/429 intermittently. Treat as retryable.
if r.status_code in (403, 429, 500, 502, 503, 504):
raise requests.HTTPError(
f"HTTP {r.status_code} for {url}", response=r
)
r.raise_for_status()
return r.text
except Exception as e:
last_err = e
backoff = 1.2 ** attempt
time.sleep(backoff)
raise RuntimeError(f"Failed after retries: {url}") from last_err
Configure ProxiesAPI
Set your proxy endpoint as an env var (example name):
export PROXIESAPI_PROXY_URL="http://USER:PASS@proxy.proxiesapi.com:PORT"
Then in Python:
import os
cfg = FetchConfig(
use_proxiesapi=True,
proxiesapi_endpoint=os.getenv("PROXIESAPI_PROXY_URL"),
)
fetcher = Fetcher(cfg)
If you don’t have the endpoint yet, you can run with use_proxiesapi=False and still validate selectors.
Step 2: Costco URLs we’ll crawl
Costco search URLs typically look like:
- Search:
https://www.costco.com/CatalogSearch?dept=All&keyword=protein
Pagination/parameters can vary; the practical approach is:
- Start from a search URL
- Parse product card URLs from the HTML
- Find the “next page” link (if any) and repeat
Step 3: Parse search/listing pages (product cards)
We’ll extract:
- product name
- product URL
- optional displayed price (sometimes visible on cards)
from urllib.parse import urljoin
from bs4 import BeautifulSoup
BASE = "https://www.costco.com"
def parse_search_page(html: str) -> tuple[list[dict], str | None]:
soup = BeautifulSoup(html, "lxml")
items: list[dict] = []
# Product tiles commonly contain an anchor to the PDP.
# Use a broad selector, then normalize.
for a in soup.select('a[href*=".product"]'):
href = a.get("href")
if not href:
continue
url = href if href.startswith("http") else urljoin(BASE, href)
# Try to pick a human-visible title from within the tile.
title = a.get_text(" ", strip=True) or None
# Filter out non-product anchors.
if "/" not in url or ".product" not in url:
continue
items.append({
"title": title,
"url": url,
})
# Pagination: look for a "next" link (site markup changes; keep logic forgiving).
next_url = None
next_a = soup.select_one('a[aria-label="Next"], a[rel="next"], a.pagination-next')
if next_a and next_a.get("href"):
href = next_a.get("href")
next_url = href if href.startswith("http") else urljoin(BASE, href)
# Deduplicate by URL
dedup = {}
for it in items:
dedup[it["url"]] = it
return list(dedup.values()), next_url
Sanity check the parser
query = "protein"
start = f"{BASE}/CatalogSearch?dept=All&keyword={query}"
html = fetcher.get(start)
items, next_url = parse_search_page(html)
print("items", len(items))
print("next", next_url)
print(items[:3])
Step 4: Parse a Costco product page (PDP)
On the product page, you want:
- a stable product identifier (often embedded in the URL or in structured data)
- title
- price
- availability / stock messaging (when present)
A reliable strategy:
- Prefer structured data (
application/ld+json) if available - Fall back to visible DOM selectors
import json
import re
def extract_ld_json(soup: BeautifulSoup) -> list[dict]:
out = []
for s in soup.select('script[type="application/ld+json"]'):
raw = s.get_text("\n", strip=True)
if not raw:
continue
try:
data = json.loads(raw)
if isinstance(data, dict):
out.append(data)
elif isinstance(data, list):
out.extend([d for d in data if isinstance(d, dict)])
except Exception:
continue
return out
def parse_product_page(url: str, html: str) -> dict:
soup = BeautifulSoup(html, "lxml")
title = None
price = None
currency = None
availability = None
# 1) Try JSON-LD
for block in extract_ld_json(soup):
# Products sometimes live under @graph
graph = block.get("@graph") if isinstance(block.get("@graph"), list) else None
candidates = graph if graph else [block]
for obj in candidates:
if obj.get("@type") in ("Product", ["Product"]):
title = title or obj.get("name")
offers = obj.get("offers")
if isinstance(offers, dict):
price = price or offers.get("price")
currency = currency or offers.get("priceCurrency")
availability = availability or offers.get("availability")
# 2) Fall back to visible selectors
if not title:
h1 = soup.select_one("h1")
title = h1.get_text(" ", strip=True) if h1 else None
if not price:
# Common pattern: price fragments split across spans.
# Keep it flexible: look for something that looks like $12.34
text = soup.get_text("\n", strip=True)
m = re.search(r"\$(\d{1,4}(?:,\d{3})*(?:\.\d{2})?)", text)
if m:
price = m.group(1)
currency = currency or "USD"
return {
"url": url,
"title": title,
"price": price,
"currency": currency,
"availability": availability,
}
Step 5: Crawl end-to-end (search → products)
Now we stitch it together:
- crawl up to
max_pagesof search results - collect unique product URLs
- fetch + parse each product page
from urllib.parse import urlencode
def crawl_costco_search(keyword: str, max_pages: int = 5) -> list[dict]:
params = {"dept": "All", "keyword": keyword}
url = f"{BASE}/CatalogSearch?{urlencode(params)}"
products: dict[str, dict] = {}
pages = 0
while url and pages < max_pages:
pages += 1
html = fetcher.get(url)
items, next_url = parse_search_page(html)
for it in items:
products[it["url"]] = it
print(f"page {pages}: found {len(items)} items (total unique {len(products)})")
url = next_url
return list(products.values())
def crawl_product_details(urls: list[str]) -> list[dict]:
out = []
for i, url in enumerate(urls, start=1):
html = fetcher.get(url)
data = parse_product_page(url, html)
out.append(data)
print(f"{i}/{len(urls)} parsed", data.get("title"), data.get("price"))
return out
items = crawl_costco_search("protein", max_pages=3)
urls = [it["url"] for it in items]
rows = crawl_product_details(urls[:25]) # start small
print("rows", len(rows))
print(rows[0])
Step 6: Export to CSV + JSON
import json
import pandas as pd
pd.DataFrame(rows).to_csv("costco_prices.csv", index=False)
with open("costco_prices.json", "w", encoding="utf-8") as f:
json.dump(rows, f, ensure_ascii=False, indent=2)
print("wrote costco_prices.csv + costco_prices.json")
Practical production upgrades
If you’re turning this into a tracker (daily/weekly price checks):
- Store results in SQLite/Postgres keyed by product URL
- Cache HTML for debugging failed parses
- Add concurrency cautiously (start with 2–4 threads)
- Add alerting when a price changes beyond a threshold
- Keep a block/failure rate dashboard (403/429/timeout counts)
QA checklist
- Search parser extracts mostly product URLs (spot-check 10)
- Pagination finds next page or stops cleanly
- Product parser returns non-empty title for most URLs
- Price extraction succeeds for a meaningful subset
- Exports are valid CSV/JSON
Where ProxiesAPI helps (honestly)
Ecommerce sites are where scraping reliability becomes a job:
- IP-based rate limits
- intermittent 403/429
- different content per region
ProxiesAPI doesn’t “magically bypass everything,” but it does give you a stable proxy layer you can turn on when your crawl starts failing.
If you keep your network layer isolated (like Fetcher above), you can swap proxy settings without rewriting your parser.
Ecommerce targets tend to rate-limit and intermittently block repeat traffic. ProxiesAPI helps you run scheduled price crawls with fewer failures and less babysitting.