Scrape Product Prices from Home Depot (Search + Category Pages) with Python + ProxiesAPI

Home Depot is one of those “looks simple, blocks hard” targets.

If you only need a handful of pages, you might get away with plain requests. But as soon as you do search + pagination + category browsing at scale, you’ll see:

  • inconsistent HTML depending on device/region
  • intermittent 403/429
  • “soft blocks” (you get HTML, but it’s a bot page)
  • price formatting differences (sale price, range price, “See lower price in cart”, etc.)

In this guide we’ll build a production-shaped scraper that extracts from listing pages (not individual product detail pages):

  • product name
  • product URL
  • current price (best-effort)
  • basic availability signal (in stock / out of stock / pickup/delivery badge when present)
  • pagination (search and category)

We’ll keep the parsing honest and resilient, and we’ll show where ProxiesAPI fits: in the network layer.

Home Depot listing page (we’ll scrape the product cards)

Keep Home Depot crawls stable with ProxiesAPI

Retail sites rate-limit aggressively and HTML can vary by region/device. ProxiesAPI gives you a reliable proxy layer so your scraper keeps working as your URL count grows.


What we’re scraping (two listing types)

Home Depot has multiple listing surfaces. Two common ones:

  1. Search results

Example (your exact URL will differ by query):

  • https://www.homedepot.com/s/dewalt%20drill
  1. Category pages

Example:

  • https://www.homedepot.com/b/Tools-Power-Tools-Drills/N-5yc1vZc2h8

Both typically render a grid of “product cards”. Our scraper will treat both as “listing pages” and attempt to extract the same fields.

A note on stability

Retail sites change markup often. So instead of betting everything on one fragile selector, we’ll use:

  • multiple extraction strategies (JSON-LD first, then HTML fallbacks)
  • normalization functions for price text
  • a “diagnostics mode” so you can quickly spot when you’re blocked or served a different template

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing

Step 1: A fetch() that’s scraper-friendly

Key rules:

  • use a session (cookies matter)
  • set timeouts
  • send realistic headers
  • detect obvious soft-block pages
  • plug in ProxiesAPI without changing the parsing logic

Option A: Plain requests (works for small tests)

import requests

TIMEOUT = (10, 30)

session = requests.Session()

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
}


def fetch_html(url: str) -> str:
    r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text

Option B: Use ProxiesAPI for the same request layer

How you wire ProxiesAPI depends on the exact API shape you have enabled (gateway URL vs proxy host, auth method, etc.). The pattern is always the same:

  • keep fetch_html(url) as your single entry point
  • configure proxies/credentials once
  • retry on transient network errors

Below is a template you can adapt by setting PROXIESAPI_PROXY_URL (for example: http://USER:PASS@proxy.proxiesapi.com:PORT).

import os
import requests

TIMEOUT = (10, 30)

session = requests.Session()

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")


def fetch_html(url: str) -> str:
    proxies = None
    if PROXIESAPI_PROXY_URL:
        proxies = {"http": PROXIESAPI_PROXY_URL, "https": PROXIESAPI_PROXY_URL}

    r = session.get(
        url,
        headers=DEFAULT_HEADERS,
        timeout=TIMEOUT,
        proxies=proxies,
    )

    r.raise_for_status()
    text = r.text

    # basic soft-block detection (keep it conservative)
    lower = text.lower()
    if "access denied" in lower or "unusual traffic" in lower:
        raise RuntimeError("Likely blocked/soft-blocked HTML received")

    return text

Step 2: Extract products from listing HTML

Home Depot pages often embed structured data. When present, JSON-LD is usually the most stable way to get name + URL + price.

We’ll implement:

  1. extract_from_jsonld(soup) – best-case
  2. extract_from_cards(soup) – HTML fallback

Helpers: price parsing

import re


def parse_price(text: str) -> float | None:
    """Extract a float price from text like '$199.00' or '199' or '199.00'."""
    if not text:
        return None

    # remove commas and common currency symbols
    t = text.replace(",", "")
    m = re.search(r"(\d+(?:\.\d{1,2})?)", t)
    return float(m.group(1)) if m else None

Strategy 1: JSON-LD

import json
from bs4 import BeautifulSoup


def extract_products_from_jsonld(soup: BeautifulSoup) -> list[dict]:
    products = []

    for script in soup.select('script[type="application/ld+json"]'):
        raw = script.get_text(" ", strip=True)
        if not raw:
            continue

        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            continue

        # JSON-LD can be a dict or list of dicts
        nodes = data if isinstance(data, list) else [data]

        for node in nodes:
            if not isinstance(node, dict):
                continue

            # Some pages embed ItemList → itemListElement
            if node.get("@type") == "ItemList" and isinstance(node.get("itemListElement"), list):
                for el in node["itemListElement"]:
                    item = el.get("item") if isinstance(el, dict) else None
                    if isinstance(item, dict) and item.get("@type") in ("Product", "Offer"):
                        products.append(item)
                continue

            # Some pages embed Product directly
            if node.get("@type") == "Product":
                products.append(node)

    out = []
    for p in products:
        name = p.get("name")
        url = p.get("url")

        price = None
        availability = None

        offers = p.get("offers")
        if isinstance(offers, dict):
            price = offers.get("price")
            availability = offers.get("availability")
        elif isinstance(offers, list) and offers:
            # pick the first offer with price
            for off in offers:
                if isinstance(off, dict) and off.get("price") is not None:
                    price = off.get("price")
                    availability = off.get("availability")
                    break

        # normalize price if it's a string
        if isinstance(price, str):
            price = parse_price(price)

        if name and url:
            out.append({
                "name": name,
                "url": url,
                "price": float(price) if isinstance(price, (int, float)) else None,
                "availability": availability,
                "source": "jsonld",
            })

    # de-dupe by url
    seen = set()
    deduped = []
    for item in out:
        u = item["url"]
        if u in seen:
            continue
        seen.add(u)
        deduped.append(item)

    return deduped

Strategy 2: HTML product cards (fallback)

This is intentionally conservative: we look for anchors that look like product links and try to find nearby price text.

from bs4 import BeautifulSoup


def extract_products_from_cards(soup: BeautifulSoup) -> list[dict]:
    out = []

    # Common pattern: product card links are often /p/…
    for a in soup.select('a[href*="/p/"]'):
        href = a.get("href")
        if not href:
            continue

        url = href
        if url.startswith("/"):
            url = "https://www.homedepot.com" + url

        name = a.get_text(" ", strip=True)
        if not name or len(name) < 5:
            continue

        # look around the anchor for price-ish text
        card = a
        for _ in range(4):
            if card.parent:
                card = card.parent

        text = card.get_text(" ", strip=True)
        price = parse_price(text)

        out.append({
            "name": name,
            "url": url,
            "price": price,
            "availability": None,
            "source": "html",
        })

    # de-dupe by url
    seen = set()
    deduped = []
    for item in out:
        if item["url"] in seen:
            continue
        seen.add(item["url"])
        deduped.append(item)

    return deduped

Combine strategies

from bs4 import BeautifulSoup


def extract_products(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    products = extract_products_from_jsonld(soup)
    if products:
        return products

    return extract_products_from_cards(soup)

Step 3: Pagination (search + category)

Home Depot pagination patterns vary. A safe approach is:

  1. fetch the first page
  2. parse products
  3. find “next page” link (if present)
  4. repeat until max_pages or no new URL
from urllib.parse import urljoin


def find_next_page(soup: BeautifulSoup, current_url: str) -> str | None:
    # Many listing pages include a rel="next" link
    link = soup.select_one('link[rel="next"]')
    if link and link.get("href"):
        return urljoin(current_url, link["href"])

    # Fallback: anchor with aria-label mentioning Next
    a = soup.select_one('a[aria-label*="Next"], a[aria-label*="next"]')
    if a and a.get("href"):
        return urljoin(current_url, a["href"])

    return None


def crawl_listing(start_url: str, max_pages: int = 5) -> list[dict]:
    all_items = []
    seen_urls = set()

    url = start_url

    for page in range(1, max_pages + 1):
        html = fetch_html(url)
        soup = BeautifulSoup(html, "lxml")

        batch = extract_products(html)

        added = 0
        for item in batch:
            u = item.get("url")
            if not u or u in seen_urls:
                continue
            seen_urls.add(u)
            all_items.append(item)
            added += 1

        print(f"page {page}: batch={len(batch)} added={added} total={len(all_items)} url={url}")

        next_url = find_next_page(soup, url)
        if not next_url:
            break

        url = next_url

    return all_items
if __name__ == "__main__":
    start = "https://www.homedepot.com/s/dewalt%20drill"
    items = crawl_listing(start, max_pages=3)
    print("items:", len(items))
    print(items[:3])

Run it (category)

if __name__ == "__main__":
    start = "https://www.homedepot.com/b/Tools-Power-Tools-Drills/N-5yc1vZc2h8"
    items = crawl_listing(start, max_pages=3)
    print("items:", len(items))

Export: CSV (for price monitoring)

import csv


def to_csv(items: list[dict], path: str = "home_depot_products.csv"):
    fieldnames = ["name", "price", "availability", "url", "source"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for it in items:
            w.writerow({k: it.get(k) for k in fieldnames})


# usage:
# to_csv(items)

Practical advice (what keeps this working)

  • Prefer JSON-LD when it exists. It’s designed for machines.
  • Keep your scraper tolerant of missing prices (some prices are “in cart” or personalized).
  • Add a “blocked HTML” detector and log the first ~2000 chars when it happens.
  • Don’t hammer pages. Crawl like a human: moderate concurrency, jitter, retries.

Where ProxiesAPI fits (no overclaims)

Home Depot has sophisticated bot mitigation. Proxies alone don’t guarantee success.

What ProxiesAPI does help with is the boring part of scraping at scale:

  • reducing IP-based rate limits across many URLs
  • making retries more effective
  • stabilizing crawl runs when volume increases

Keep the rest of your system solid: good parsing, good logging, conservative crawl behavior.


QA checklist

  • Scraper returns at least 10 products for a common search (e.g. “dewalt drill”)
  • URLs look like real product pages (/p/...)
  • Price parses into floats for most items
  • Pagination stops when “next” disappears
  • When blocked, you see a clear error and can retry via ProxiesAPI
Keep Home Depot crawls stable with ProxiesAPI

Retail sites rate-limit aggressively and HTML can vary by region/device. ProxiesAPI gives you a reliable proxy layer so your scraper keeps working as your URL count grows.

Related guides

Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
Build a routes→prices dataset from Google Flights with pagination-safe requests, retries, and a proof screenshot. Includes export to CSV/JSON and pragmatic anti-blocking guidance.
tutorial#python#google#google-flights
Scrape Vinted Listings with Python: Search, Prices, Images, and Pagination
Build a dataset from Vinted search results (title, price, size, condition, seller, images) with a production-minded Python scraper + a proxy-backed fetch layer via ProxiesAPI.
tutorial#python#vinted#ecommerce
How to Scrape Walmart Grocery Prices with Python (Search + Product Pages)
Build a practical Walmart grocery price scraper: search for items, follow product links, extract price/size/availability, and export clean JSON. Includes ProxiesAPI integration, retries, and selector fallbacks.
tutorial#python#walmart#price-scraping
How to Scrape Cars.com Used Car Prices (Python + ProxiesAPI)
Extract listing title, price, mileage, location, and dealer info from Cars.com search results + detail pages. Includes selector notes, pagination, and a polite crawl plan.
tutorial#python#cars.com#price-scraping