How to Scrape Walmart Product Data at Scale (Python + ProxiesAPI)

Walmart product pages are a classic e-commerce scraping target:

  • the data you care about is there (title, price, availability, rating)
  • page templates are mostly consistent
  • at scale, request stability and retries matter more than clever selectors

In this tutorial we’ll build a Walmart product scraper in Python that:

  • fetches product pages with sensible timeouts
  • retries safely on transient failures
  • extracts title, price, availability, and rating
  • exports clean JSONL for downstream pipelines

Walmart product page (we’ll extract title, price, availability, rating)

Make large Walmart crawls more reliable with ProxiesAPI

When you go from 20 URLs to 20,000, the hard part isn’t parsing HTML — it’s keeping requests stable across retries, timeouts, and geo variance. ProxiesAPI gives you a clean proxy layer so your scraper can keep moving.


What we’re scraping (and what we’re not)

A Walmart product page URL typically looks like:

  • https://www.walmart.com/ip/PRODUCT-NAME/123456789

We’ll scrape publicly visible fields from the HTML. We are not:

  • logging in
  • adding items to cart
  • calling private endpoints

If you need near-real-time price monitoring, you should still build a pipeline that:

  • caches responses
  • throttles requests
  • refreshes only the SKUs that matter

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • tenacity for robust retries

Network layer: timeouts + retries (the part that saves you at scale)

Scraping at scale fails for boring reasons:

  • DNS hiccups
  • TLS handshakes that stall
  • 5xx bursts
  • throttling / soft blocks

The fix is a defensive fetch() with:

  • explicit connect/read timeouts
  • retry with exponential backoff
  • sane headers
import os
import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 35)  # connect, read

USER_AGENTS = [
    # keep a small, realistic pool
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def build_session() -> requests.Session:
    s = requests.Session()

    # If you use ProxiesAPI as an HTTP proxy, wire it here.
    # Example pattern (adjust to your ProxiesAPI docs/account):
    # PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
    # if PROXY_URL:
    #     s.proxies.update({"http": PROXY_URL, "https": PROXY_URL})

    s.headers.update({
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
    })
    return s


session = build_session()


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch(url: str) -> FetchResult:
    # small jitter helps avoid synchronized bursts
    time.sleep(random.uniform(0.2, 0.7))

    headers = {"User-Agent": random.choice(USER_AGENTS)}
    r = session.get(url, headers=headers, timeout=TIMEOUT)

    # Raise on 4xx/5xx so tenacity retries when appropriate
    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)

Where ProxiesAPI fits

At higher volume, you’ll eventually want a proxy layer.

ProxiesAPI can be used as that layer so:

  • your IP reputation doesn’t hinge on one egress
  • retries can rotate exit IPs (depending on your ProxiesAPI plan/mode)
  • geo/region issues are easier to handle

In the code above, notice the single place you’d wire ProxiesAPI in: session.proxies.

That’s deliberate: keep parsing logic independent from networking.


Parsing Walmart pages reliably

E-commerce pages change. The trick is:

  1. Extract the most stable representation first (often JSON-LD)
  2. Fall back to HTML selectors for fields missing in structured data
  3. Keep selectors conservative and easy to update

Walmart product pages typically include application/ld+json blocks containing a Product.

We’ll parse:

  • name (title)
  • offers.price (price)
  • offers.availability (availability)
  • aggregateRating.ratingValue (rating)
import json
import re
from typing import Any, Optional

from bs4 import BeautifulSoup


def safe_float(x: Any) -> Optional[float]:
    try:
        return float(str(x).strip())
    except Exception:
        return None


def safe_str(x: Any) -> Optional[str]:
    if x is None:
        return None
    s = str(x).strip()
    return s if s else None


def extract_jsonld_products(soup: BeautifulSoup) -> list[dict]:
    out: list[dict] = []
    for tag in soup.select('script[type="application/ld+json"]'):
        raw = tag.get_text("\n", strip=True)
        if not raw:
            continue

        # Some pages have multiple JSON objects in one script; handle best-effort
        try:
            data = json.loads(raw)
        except Exception:
            continue

        items = data if isinstance(data, list) else [data]
        for item in items:
            if not isinstance(item, dict):
                continue
            t = item.get("@type")
            if t == "Product":
                out.append(item)
            # Sometimes nested in @graph
            if "@graph" in item and isinstance(item["@graph"], list):
                for g in item["@graph"]:
                    if isinstance(g, dict) and g.get("@type") == "Product":
                        out.append(g)
    return out


def parse_walmart_product(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = None
    price = None
    availability = None
    rating = None

    products = extract_jsonld_products(soup)
    if products:
        p0 = products[0]
        title = safe_str(p0.get("name"))

        offers = p0.get("offers")
        # offers can be dict or list
        if isinstance(offers, list) and offers:
            offers = offers[0]

        if isinstance(offers, dict):
            price = safe_float(offers.get("price"))
            availability = safe_str(offers.get("availability"))

        agg = p0.get("aggregateRating")
        if isinstance(agg, dict):
            rating = safe_float(agg.get("ratingValue"))

    # Fallbacks (HTML)
    if not title:
        h1 = soup.select_one("h1")
        title = h1.get_text(" ", strip=True) if h1 else None

    if price is None:
        # Walmart price presentation varies; try a couple of conservative patterns
        # Look for typical "$"-prefixed numbers in price blocks
        text = soup.get_text("\n", strip=True)
        m = re.search(r"\$\s*(\d{1,4}(?:\.\d{2})?)", text)
        if m:
            price = safe_float(m.group(1))

    if availability is None:
        # best-effort: detect common phrases
        page_text = soup.get_text("\n", strip=True).lower()
        if "out of stock" in page_text:
            availability = "OutOfStock"
        elif "in stock" in page_text or "pickup" in page_text or "delivery" in page_text:
            availability = "InStock"

    return {
        "url": url,
        "title": title,
        "price": price,
        "availability": availability,
        "rating": rating,
    }

Why JSON-LD first?

It’s designed for machines, and it tends to survive UI redesigns longer than CSS classnames.

That said, don’t assume it’s always present or always complete — hence the fallbacks.


Scrape a list of product URLs (JSONL output)

Here’s a small end-to-end script:

import json

URLS = [
    # Replace with your own Walmart product URLs
    "https://www.walmart.com/ip/123456789",
]


def run(urls: list[str]) -> None:
    with open("walmart_products.jsonl", "w", encoding="utf-8") as f:
        for url in urls:
            try:
                res = fetch(url)
                item = parse_walmart_product(res.text, url=url)
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
                print("ok", url, "->", item.get("title"), item.get("price"))
            except Exception as e:
                print("fail", url, type(e).__name__, str(e)[:200])


if __name__ == "__main__":
    run(URLS)

Run it:

python walmart_scrape.py

Scaling tips (what changes after your first 50 URLs)

When you scale this up, focus on operational correctness:

  • Deduplicate URLs before fetching (store a canonical SKU ID)
  • Persist failures (write failed URLs to a separate file for re-try)
  • Use concurrency carefully (start with 5–10 workers, not 200)
  • Rotate proxies/IPs once you see elevated 403/429 rates
  • Respect cache: re-fetch only when needed (price monitors can be interval-based)

If you add concurrency, keep retries per worker conservative to avoid turning a transient issue into a stampede.


QA checklist

  • For 3–5 URLs, titles match what you see in the browser
  • Price is parsed as a number (float)
  • Availability is not always None
  • JSONL contains one JSON object per line
  • Fetch uses timeouts (no hung processes)

Next upgrades

  • Extract more fields (brand, images, breadcrumbs, shipping)
  • Store results in SQLite/Postgres for incremental updates
  • Add structured logging + metrics for retry rate, error rate, and response time
  • Add an “HTML snapshot” mode for debugging when selectors break
Make large Walmart crawls more reliable with ProxiesAPI

When you go from 20 URLs to 20,000, the hard part isn’t parsing HTML — it’s keeping requests stable across retries, timeouts, and geo variance. ProxiesAPI gives you a clean proxy layer so your scraper can keep moving.

Related guides

How to Scrape LinkedIn Job Postings (Public Jobs) with Python + ProxiesAPI
Collect role, company, location, and posted date from LinkedIn public job pages (no login) using robust HTML parsing, retries, and a clean export format. Includes a real screenshot.
tutorial#python#linkedin#jobs
Scrape Product Data from Amazon (with Python + ProxiesAPI)
Extract Amazon product title, price, rating, and availability from a product page using requests + BeautifulSoup, with retries and proxy-backed fetching via ProxiesAPI.
tutorial#python#amazon#web-scraping
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Restaurant Data from TripAdvisor (Reviews, Ratings, and Locations)
Build a practical TripAdvisor scraper in Python: discover restaurant listing URLs, extract name/rating/review count/address, and export clean CSV/JSON with ProxiesAPI in the fetch layer.
tutorial#python#web-scraping#beautifulsoup