How to Scrape Walmart Product Data at Scale (Python + ProxiesAPI)

Mar 28, 2026 · tutorial · #python, #walmart, #web-scraping, #beautifulsoup, #lxml, #proxies, #ecommerce

Walmart product pages are a classic e-commerce scraping target:

the data you care about is there (title, price, availability, rating)
page templates are mostly consistent
at scale, request stability and retries matter more than clever selectors

In this tutorial we’ll build a Walmart product scraper in Python that:

fetches product pages with sensible timeouts
retries safely on transient failures
extracts title, price, availability, and rating
exports clean JSONL for downstream pipelines

Walmart product page (we’ll extract title, price, availability, rating)

Make large Walmart crawls more reliable with ProxiesAPI

When you go from 20 URLs to 20,000, the hard part isn’t parsing HTML — it’s keeping requests stable across retries, timeouts, and geo variance. ProxiesAPI gives you a clean proxy layer so your scraper can keep moving.

Get 1,000 free API calls View pricing

What we’re scraping (and what we’re not)

A Walmart product page URL typically looks like:

https://www.walmart.com/ip/PRODUCT-NAME/123456789

We’ll scrape publicly visible fields from the HTML. We are not:

logging in
adding items to cart
calling private endpoints

If you need near-real-time price monitoring, you should still build a pipeline that:

caches responses
throttles requests
refreshes only the SKUs that matter

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
tenacity for robust retries

Network layer: timeouts + retries (the part that saves you at scale)

Scraping at scale fails for boring reasons:

DNS hiccups
TLS handshakes that stall
5xx bursts
throttling / soft blocks

The fix is a defensive fetch() with:

explicit connect/read timeouts
retry with exponential backoff
sane headers

import os
import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 35)  # connect, read

USER_AGENTS = [
    # keep a small, realistic pool
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def build_session() -> requests.Session:
    s = requests.Session()

    # If you use ProxiesAPI as an HTTP proxy, wire it here.
    # Example pattern (adjust to your ProxiesAPI docs/account):
    # PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
    # if PROXY_URL:
    #     s.proxies.update({"http": PROXY_URL, "https": PROXY_URL})

    s.headers.update({
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
    })
    return s


session = build_session()


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch(url: str) -> FetchResult:
    # small jitter helps avoid synchronized bursts
    time.sleep(random.uniform(0.2, 0.7))

    headers = {"User-Agent": random.choice(USER_AGENTS)}
    r = session.get(url, headers=headers, timeout=TIMEOUT)

    # Raise on 4xx/5xx so tenacity retries when appropriate
    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)

Where ProxiesAPI fits

At higher volume, you’ll eventually want a proxy layer.

ProxiesAPI can be used as that layer so:

your IP reputation doesn’t hinge on one egress
retries can rotate exit IPs (depending on your ProxiesAPI plan/mode)
geo/region issues are easier to handle

In the code above, notice the single place you’d wire ProxiesAPI in: session.proxies.

That’s deliberate: keep parsing logic independent from networking.

Parsing Walmart pages reliably

E-commerce pages change. The trick is:

Extract the most stable representation first (often JSON-LD)
Fall back to HTML selectors for fields missing in structured data
Keep selectors conservative and easy to update

Walmart product pages typically include application/ld+json blocks containing a Product.

We’ll parse:

name (title)
offers.price (price)
offers.availability (availability)
aggregateRating.ratingValue (rating)

import json
import re
from typing import Any, Optional

from bs4 import BeautifulSoup


def safe_float(x: Any) -> Optional[float]:
    try:
        return float(str(x).strip())
    except Exception:
        return None


def safe_str(x: Any) -> Optional[str]:
    if x is None:
        return None
    s = str(x).strip()
    return s if s else None


def extract_jsonld_products(soup: BeautifulSoup) -> list[dict]:
    out: list[dict] = []
    for tag in soup.select('script[type="application/ld+json"]'):
        raw = tag.get_text("\n", strip=True)
        if not raw:
            continue

        # Some pages have multiple JSON objects in one script; handle best-effort
        try:
            data = json.loads(raw)
        except Exception:
            continue

        items = data if isinstance(data, list) else [data]
        for item in items:
            if not isinstance(item, dict):
                continue
            t = item.get("@type")
            if t == "Product":
                out.append(item)
            # Sometimes nested in @graph
            if "@graph" in item and isinstance(item["@graph"], list):
                for g in item["@graph"]:
                    if isinstance(g, dict) and g.get("@type") == "Product":
                        out.append(g)
    return out


def parse_walmart_product(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = None
    price = None
    availability = None
    rating = None

    products = extract_jsonld_products(soup)
    if products:
        p0 = products[0]
        title = safe_str(p0.get("name"))

        offers = p0.get("offers")
        # offers can be dict or list
        if isinstance(offers, list) and offers:
            offers = offers[0]

        if isinstance(offers, dict):
            price = safe_float(offers.get("price"))
            availability = safe_str(offers.get("availability"))

        agg = p0.get("aggregateRating")
        if isinstance(agg, dict):
            rating = safe_float(agg.get("ratingValue"))

    # Fallbacks (HTML)
    if not title:
        h1 = soup.select_one("h1")
        title = h1.get_text(" ", strip=True) if h1 else None

    if price is None:
        # Walmart price presentation varies; try a couple of conservative patterns
        # Look for typical "$"-prefixed numbers in price blocks
        text = soup.get_text("\n", strip=True)
        m = re.search(r"\$\s*(\d{1,4}(?:\.\d{2})?)", text)
        if m:
            price = safe_float(m.group(1))

    if availability is None:
        # best-effort: detect common phrases
        page_text = soup.get_text("\n", strip=True).lower()
        if "out of stock" in page_text:
            availability = "OutOfStock"
        elif "in stock" in page_text or "pickup" in page_text or "delivery" in page_text:
            availability = "InStock"

    return {
        "url": url,
        "title": title,
        "price": price,
        "availability": availability,
        "rating": rating,
    }

Why JSON-LD first?

It’s designed for machines, and it tends to survive UI redesigns longer than CSS classnames.

That said, don’t assume it’s always present or always complete — hence the fallbacks.

Scrape a list of product URLs (JSONL output)

Here’s a small end-to-end script:

import json

URLS = [
    # Replace with your own Walmart product URLs
    "https://www.walmart.com/ip/123456789",
]


def run(urls: list[str]) -> None:
    with open("walmart_products.jsonl", "w", encoding="utf-8") as f:
        for url in urls:
            try:
                res = fetch(url)
                item = parse_walmart_product(res.text, url=url)
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
                print("ok", url, "->", item.get("title"), item.get("price"))
            except Exception as e:
                print("fail", url, type(e).__name__, str(e)[:200])


if __name__ == "__main__":
    run(URLS)

Run it:

python walmart_scrape.py

Scaling tips (what changes after your first 50 URLs)

When you scale this up, focus on operational correctness:

Deduplicate URLs before fetching (store a canonical SKU ID)
Persist failures (write failed URLs to a separate file for re-try)
Use concurrency carefully (start with 5–10 workers, not 200)
Rotate proxies/IPs once you see elevated 403/429 rates
Respect cache: re-fetch only when needed (price monitors can be interval-based)

If you add concurrency, keep retries per worker conservative to avoid turning a transient issue into a stampede.