Scrape Product Prices from Home Depot (Search + Category Pages) with Python + ProxiesAPI
Home Depot is one of those “looks simple, blocks hard” targets.
If you only need a handful of pages, you might get away with plain requests. But as soon as you do search + pagination + category browsing at scale, you’ll see:
- inconsistent HTML depending on device/region
- intermittent
403/429 - “soft blocks” (you get HTML, but it’s a bot page)
- price formatting differences (sale price, range price, “See lower price in cart”, etc.)
In this guide we’ll build a production-shaped scraper that extracts from listing pages (not individual product detail pages):
- product name
- product URL
- current price (best-effort)
- basic availability signal (in stock / out of stock / pickup/delivery badge when present)
- pagination (search and category)
We’ll keep the parsing honest and resilient, and we’ll show where ProxiesAPI fits: in the network layer.

Retail sites rate-limit aggressively and HTML can vary by region/device. ProxiesAPI gives you a reliable proxy layer so your scraper keeps working as your URL count grows.
What we’re scraping (two listing types)
Home Depot has multiple listing surfaces. Two common ones:
- Search results
Example (your exact URL will differ by query):
https://www.homedepot.com/s/dewalt%20drill
- Category pages
Example:
https://www.homedepot.com/b/Tools-Power-Tools-Drills/N-5yc1vZc2h8
Both typically render a grid of “product cards”. Our scraper will treat both as “listing pages” and attempt to extract the same fields.
A note on stability
Retail sites change markup often. So instead of betting everything on one fragile selector, we’ll use:
- multiple extraction strategies (JSON-LD first, then HTML fallbacks)
- normalization functions for price text
- a “diagnostics mode” so you can quickly spot when you’re blocked or served a different template
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsing
Step 1: A fetch() that’s scraper-friendly
Key rules:
- use a session (cookies matter)
- set timeouts
- send realistic headers
- detect obvious soft-block pages
- plug in ProxiesAPI without changing the parsing logic
Option A: Plain requests (works for small tests)
import requests
TIMEOUT = (10, 30)
session = requests.Session()
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
}
def fetch_html(url: str) -> str:
r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)
r.raise_for_status()
return r.text
Option B: Use ProxiesAPI for the same request layer
How you wire ProxiesAPI depends on the exact API shape you have enabled (gateway URL vs proxy host, auth method, etc.). The pattern is always the same:
- keep
fetch_html(url)as your single entry point - configure proxies/credentials once
- retry on transient network errors
Below is a template you can adapt by setting PROXIESAPI_PROXY_URL (for example: http://USER:PASS@proxy.proxiesapi.com:PORT).
import os
import requests
TIMEOUT = (10, 30)
session = requests.Session()
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
PROXIESAPI_PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
def fetch_html(url: str) -> str:
proxies = None
if PROXIESAPI_PROXY_URL:
proxies = {"http": PROXIESAPI_PROXY_URL, "https": PROXIESAPI_PROXY_URL}
r = session.get(
url,
headers=DEFAULT_HEADERS,
timeout=TIMEOUT,
proxies=proxies,
)
r.raise_for_status()
text = r.text
# basic soft-block detection (keep it conservative)
lower = text.lower()
if "access denied" in lower or "unusual traffic" in lower:
raise RuntimeError("Likely blocked/soft-blocked HTML received")
return text
Step 2: Extract products from listing HTML
Home Depot pages often embed structured data. When present, JSON-LD is usually the most stable way to get name + URL + price.
We’ll implement:
extract_from_jsonld(soup)– best-caseextract_from_cards(soup)– HTML fallback
Helpers: price parsing
import re
def parse_price(text: str) -> float | None:
"""Extract a float price from text like '$199.00' or '199' or '199.00'."""
if not text:
return None
# remove commas and common currency symbols
t = text.replace(",", "")
m = re.search(r"(\d+(?:\.\d{1,2})?)", t)
return float(m.group(1)) if m else None
Strategy 1: JSON-LD
import json
from bs4 import BeautifulSoup
def extract_products_from_jsonld(soup: BeautifulSoup) -> list[dict]:
products = []
for script in soup.select('script[type="application/ld+json"]'):
raw = script.get_text(" ", strip=True)
if not raw:
continue
try:
data = json.loads(raw)
except json.JSONDecodeError:
continue
# JSON-LD can be a dict or list of dicts
nodes = data if isinstance(data, list) else [data]
for node in nodes:
if not isinstance(node, dict):
continue
# Some pages embed ItemList → itemListElement
if node.get("@type") == "ItemList" and isinstance(node.get("itemListElement"), list):
for el in node["itemListElement"]:
item = el.get("item") if isinstance(el, dict) else None
if isinstance(item, dict) and item.get("@type") in ("Product", "Offer"):
products.append(item)
continue
# Some pages embed Product directly
if node.get("@type") == "Product":
products.append(node)
out = []
for p in products:
name = p.get("name")
url = p.get("url")
price = None
availability = None
offers = p.get("offers")
if isinstance(offers, dict):
price = offers.get("price")
availability = offers.get("availability")
elif isinstance(offers, list) and offers:
# pick the first offer with price
for off in offers:
if isinstance(off, dict) and off.get("price") is not None:
price = off.get("price")
availability = off.get("availability")
break
# normalize price if it's a string
if isinstance(price, str):
price = parse_price(price)
if name and url:
out.append({
"name": name,
"url": url,
"price": float(price) if isinstance(price, (int, float)) else None,
"availability": availability,
"source": "jsonld",
})
# de-dupe by url
seen = set()
deduped = []
for item in out:
u = item["url"]
if u in seen:
continue
seen.add(u)
deduped.append(item)
return deduped
Strategy 2: HTML product cards (fallback)
This is intentionally conservative: we look for anchors that look like product links and try to find nearby price text.
from bs4 import BeautifulSoup
def extract_products_from_cards(soup: BeautifulSoup) -> list[dict]:
out = []
# Common pattern: product card links are often /p/…
for a in soup.select('a[href*="/p/"]'):
href = a.get("href")
if not href:
continue
url = href
if url.startswith("/"):
url = "https://www.homedepot.com" + url
name = a.get_text(" ", strip=True)
if not name or len(name) < 5:
continue
# look around the anchor for price-ish text
card = a
for _ in range(4):
if card.parent:
card = card.parent
text = card.get_text(" ", strip=True)
price = parse_price(text)
out.append({
"name": name,
"url": url,
"price": price,
"availability": None,
"source": "html",
})
# de-dupe by url
seen = set()
deduped = []
for item in out:
if item["url"] in seen:
continue
seen.add(item["url"])
deduped.append(item)
return deduped
Combine strategies
from bs4 import BeautifulSoup
def extract_products(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
products = extract_products_from_jsonld(soup)
if products:
return products
return extract_products_from_cards(soup)
Step 3: Pagination (search + category)
Home Depot pagination patterns vary. A safe approach is:
- fetch the first page
- parse products
- find “next page” link (if present)
- repeat until
max_pagesor no new URL
from urllib.parse import urljoin
def find_next_page(soup: BeautifulSoup, current_url: str) -> str | None:
# Many listing pages include a rel="next" link
link = soup.select_one('link[rel="next"]')
if link and link.get("href"):
return urljoin(current_url, link["href"])
# Fallback: anchor with aria-label mentioning Next
a = soup.select_one('a[aria-label*="Next"], a[aria-label*="next"]')
if a and a.get("href"):
return urljoin(current_url, a["href"])
return None
def crawl_listing(start_url: str, max_pages: int = 5) -> list[dict]:
all_items = []
seen_urls = set()
url = start_url
for page in range(1, max_pages + 1):
html = fetch_html(url)
soup = BeautifulSoup(html, "lxml")
batch = extract_products(html)
added = 0
for item in batch:
u = item.get("url")
if not u or u in seen_urls:
continue
seen_urls.add(u)
all_items.append(item)
added += 1
print(f"page {page}: batch={len(batch)} added={added} total={len(all_items)} url={url}")
next_url = find_next_page(soup, url)
if not next_url:
break
url = next_url
return all_items
Run it (search)
if __name__ == "__main__":
start = "https://www.homedepot.com/s/dewalt%20drill"
items = crawl_listing(start, max_pages=3)
print("items:", len(items))
print(items[:3])
Run it (category)
if __name__ == "__main__":
start = "https://www.homedepot.com/b/Tools-Power-Tools-Drills/N-5yc1vZc2h8"
items = crawl_listing(start, max_pages=3)
print("items:", len(items))
Export: CSV (for price monitoring)
import csv
def to_csv(items: list[dict], path: str = "home_depot_products.csv"):
fieldnames = ["name", "price", "availability", "url", "source"]
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()
for it in items:
w.writerow({k: it.get(k) for k in fieldnames})
# usage:
# to_csv(items)
Practical advice (what keeps this working)
- Prefer JSON-LD when it exists. It’s designed for machines.
- Keep your scraper tolerant of missing prices (some prices are “in cart” or personalized).
- Add a “blocked HTML” detector and log the first ~2000 chars when it happens.
- Don’t hammer pages. Crawl like a human: moderate concurrency, jitter, retries.
Where ProxiesAPI fits (no overclaims)
Home Depot has sophisticated bot mitigation. Proxies alone don’t guarantee success.
What ProxiesAPI does help with is the boring part of scraping at scale:
- reducing IP-based rate limits across many URLs
- making retries more effective
- stabilizing crawl runs when volume increases
Keep the rest of your system solid: good parsing, good logging, conservative crawl behavior.
QA checklist
- Scraper returns at least 10 products for a common search (e.g. “dewalt drill”)
- URLs look like real product pages (
/p/...) - Price parses into floats for most items
- Pagination stops when “next” disappears
- When blocked, you see a clear error and can retry via ProxiesAPI
Retail sites rate-limit aggressively and HTML can vary by region/device. ProxiesAPI gives you a reliable proxy layer so your scraper keeps working as your URL count grows.