How to Scrape Walmart Product Data at Scale (Python + ProxiesAPI)
Walmart product pages are a classic e-commerce scraping target:
- the data you care about is there (title, price, availability, rating)
- page templates are mostly consistent
- at scale, request stability and retries matter more than clever selectors
In this tutorial we’ll build a Walmart product scraper in Python that:
- fetches product pages with sensible timeouts
- retries safely on transient failures
- extracts title, price, availability, and rating
- exports clean JSONL for downstream pipelines

When you go from 20 URLs to 20,000, the hard part isn’t parsing HTML — it’s keeping requests stable across retries, timeouts, and geo variance. ProxiesAPI gives you a clean proxy layer so your scraper can keep moving.
What we’re scraping (and what we’re not)
A Walmart product page URL typically looks like:
https://www.walmart.com/ip/PRODUCT-NAME/123456789
We’ll scrape publicly visible fields from the HTML. We are not:
- logging in
- adding items to cart
- calling private endpoints
If you need near-real-time price monitoring, you should still build a pipeline that:
- caches responses
- throttles requests
- refreshes only the SKUs that matter
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingtenacityfor robust retries
Network layer: timeouts + retries (the part that saves you at scale)
Scraping at scale fails for boring reasons:
- DNS hiccups
- TLS handshakes that stall
- 5xx bursts
- throttling / soft blocks
The fix is a defensive fetch() with:
- explicit connect/read timeouts
- retry with exponential backoff
- sane headers
import os
import random
import time
from dataclasses import dataclass
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
TIMEOUT = (10, 35) # connect, read
USER_AGENTS = [
# keep a small, realistic pool
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]
@dataclass
class FetchResult:
url: str
status_code: int
text: str
def build_session() -> requests.Session:
s = requests.Session()
# If you use ProxiesAPI as an HTTP proxy, wire it here.
# Example pattern (adjust to your ProxiesAPI docs/account):
# PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
# if PROXY_URL:
# s.proxies.update({"http": PROXY_URL, "https": PROXY_URL})
s.headers.update({
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
})
return s
session = build_session()
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=20),
retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch(url: str) -> FetchResult:
# small jitter helps avoid synchronized bursts
time.sleep(random.uniform(0.2, 0.7))
headers = {"User-Agent": random.choice(USER_AGENTS)}
r = session.get(url, headers=headers, timeout=TIMEOUT)
# Raise on 4xx/5xx so tenacity retries when appropriate
r.raise_for_status()
return FetchResult(url=url, status_code=r.status_code, text=r.text)
Where ProxiesAPI fits
At higher volume, you’ll eventually want a proxy layer.
ProxiesAPI can be used as that layer so:
- your IP reputation doesn’t hinge on one egress
- retries can rotate exit IPs (depending on your ProxiesAPI plan/mode)
- geo/region issues are easier to handle
In the code above, notice the single place you’d wire ProxiesAPI in: session.proxies.
That’s deliberate: keep parsing logic independent from networking.
Parsing Walmart pages reliably
E-commerce pages change. The trick is:
- Extract the most stable representation first (often JSON-LD)
- Fall back to HTML selectors for fields missing in structured data
- Keep selectors conservative and easy to update
Walmart product pages typically include application/ld+json blocks containing a Product.
We’ll parse:
name(title)offers.price(price)offers.availability(availability)aggregateRating.ratingValue(rating)
import json
import re
from typing import Any, Optional
from bs4 import BeautifulSoup
def safe_float(x: Any) -> Optional[float]:
try:
return float(str(x).strip())
except Exception:
return None
def safe_str(x: Any) -> Optional[str]:
if x is None:
return None
s = str(x).strip()
return s if s else None
def extract_jsonld_products(soup: BeautifulSoup) -> list[dict]:
out: list[dict] = []
for tag in soup.select('script[type="application/ld+json"]'):
raw = tag.get_text("\n", strip=True)
if not raw:
continue
# Some pages have multiple JSON objects in one script; handle best-effort
try:
data = json.loads(raw)
except Exception:
continue
items = data if isinstance(data, list) else [data]
for item in items:
if not isinstance(item, dict):
continue
t = item.get("@type")
if t == "Product":
out.append(item)
# Sometimes nested in @graph
if "@graph" in item and isinstance(item["@graph"], list):
for g in item["@graph"]:
if isinstance(g, dict) and g.get("@type") == "Product":
out.append(g)
return out
def parse_walmart_product(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
title = None
price = None
availability = None
rating = None
products = extract_jsonld_products(soup)
if products:
p0 = products[0]
title = safe_str(p0.get("name"))
offers = p0.get("offers")
# offers can be dict or list
if isinstance(offers, list) and offers:
offers = offers[0]
if isinstance(offers, dict):
price = safe_float(offers.get("price"))
availability = safe_str(offers.get("availability"))
agg = p0.get("aggregateRating")
if isinstance(agg, dict):
rating = safe_float(agg.get("ratingValue"))
# Fallbacks (HTML)
if not title:
h1 = soup.select_one("h1")
title = h1.get_text(" ", strip=True) if h1 else None
if price is None:
# Walmart price presentation varies; try a couple of conservative patterns
# Look for typical "$"-prefixed numbers in price blocks
text = soup.get_text("\n", strip=True)
m = re.search(r"\$\s*(\d{1,4}(?:\.\d{2})?)", text)
if m:
price = safe_float(m.group(1))
if availability is None:
# best-effort: detect common phrases
page_text = soup.get_text("\n", strip=True).lower()
if "out of stock" in page_text:
availability = "OutOfStock"
elif "in stock" in page_text or "pickup" in page_text or "delivery" in page_text:
availability = "InStock"
return {
"url": url,
"title": title,
"price": price,
"availability": availability,
"rating": rating,
}
Why JSON-LD first?
It’s designed for machines, and it tends to survive UI redesigns longer than CSS classnames.
That said, don’t assume it’s always present or always complete — hence the fallbacks.
Scrape a list of product URLs (JSONL output)
Here’s a small end-to-end script:
import json
URLS = [
# Replace with your own Walmart product URLs
"https://www.walmart.com/ip/123456789",
]
def run(urls: list[str]) -> None:
with open("walmart_products.jsonl", "w", encoding="utf-8") as f:
for url in urls:
try:
res = fetch(url)
item = parse_walmart_product(res.text, url=url)
f.write(json.dumps(item, ensure_ascii=False) + "\n")
print("ok", url, "->", item.get("title"), item.get("price"))
except Exception as e:
print("fail", url, type(e).__name__, str(e)[:200])
if __name__ == "__main__":
run(URLS)
Run it:
python walmart_scrape.py
Scaling tips (what changes after your first 50 URLs)
When you scale this up, focus on operational correctness:
- Deduplicate URLs before fetching (store a canonical SKU ID)
- Persist failures (write failed URLs to a separate file for re-try)
- Use concurrency carefully (start with 5–10 workers, not 200)
- Rotate proxies/IPs once you see elevated 403/429 rates
- Respect cache: re-fetch only when needed (price monitors can be interval-based)
If you add concurrency, keep retries per worker conservative to avoid turning a transient issue into a stampede.
QA checklist
- For 3–5 URLs, titles match what you see in the browser
- Price is parsed as a number (float)
- Availability is not always
None - JSONL contains one JSON object per line
- Fetch uses timeouts (no hung processes)
Next upgrades
- Extract more fields (brand, images, breadcrumbs, shipping)
- Store results in SQLite/Postgres for incremental updates
- Add structured logging + metrics for retry rate, error rate, and response time
- Add an “HTML snapshot” mode for debugging when selectors break
When you go from 20 URLs to 20,000, the hard part isn’t parsing HTML — it’s keeping requests stable across retries, timeouts, and geo variance. ProxiesAPI gives you a clean proxy layer so your scraper can keep moving.