Scrape Product Data from Amazon (with Python + ProxiesAPI)
Amazon product pages are a classic scraping target — and also one of the quickest places to get blocked if you hammer requests from a single IP.
In this tutorial we’ll build a real Python scraper that extracts:
- product title
- price (best-effort, from common price blocks)
- rating (stars)
- review count
- availability / stock text
- canonical product URL
We’ll use requests + BeautifulSoup (server-rendered HTML parsing) and we’ll show exactly where ProxiesAPI fits in the network layer.
Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.
Important notes (so your scraper doesn’t break instantly)
- HTML varies by locale and experiments. Amazon A/B tests markup frequently.
- Don’t rely on a single selector. Use a fallback chain.
- Send a realistic User-Agent + Accept-Language. It reduces “robot page” responses.
- Expect intermittent failures. Build retries + backoff from day one.
Also: this guide does not claim to bypass protected challenges. If you receive a challenge page, treat it as a failed fetch and move on.
Quick sanity check: fetch HTML
Pick a single product URL (example):
https://www.amazon.com/dp/B0C7W6G2Q2
Try fetching headers-only first:
curl -I "https://www.amazon.com/dp/B0C7W6G2Q2" | head
If you get HTML, we’re good. If you get an interstitial or “robot check”, that’s exactly why we’ll add proxy-backed fetching and retries.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Step 1: A robust fetch() (timeouts, headers, retries)
import random
import time
from urllib.parse import quote
import requests
TIMEOUT = (10, 30) # connect, read
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Connection": "keep-alive",
}
session = requests.Session()
def fetch_direct(url: str) -> str:
r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
r.raise_for_status()
return r.text
def fetch_with_proxiesapi(url: str, api_key: str) -> str:
# ProxiesAPI format required by the guide
proxied = f"http://api.proxiesapi.com/?key={quote(api_key)}&url={quote(url, safe='')}"
r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
r.raise_for_status()
return r.text
def fetch(url: str, api_key: str | None = None, retries: int = 4) -> str:
last_err = None
for attempt in range(1, retries + 1):
try:
html = fetch_with_proxiesapi(url, api_key) if api_key else fetch_direct(url)
# very lightweight “challenge-ish” detection
lowered = html.lower()
if "captcha" in lowered and "amazon" in lowered:
raise RuntimeError("Possible robot check/captcha page")
return html
except Exception as e:
last_err = e
sleep_s = (2 ** attempt) + random.random()
print(f"attempt {attempt}/{retries} failed: {e}. sleeping {sleep_s:.1f}s")
time.sleep(sleep_s)
raise RuntimeError(f"Failed to fetch after {retries} attempts: {last_err}")
ProxiesAPI curl (same URL)
API_KEY="YOUR_KEY"
URL="https://www.amazon.com/dp/B0C7W6G2Q2"
curl -s "http://api.proxiesapi.com/?key=$API_KEY&url=$URL" | head -n 20
Step 2: Understand the page structure (selectors that actually exist)
Across many Amazon product pages, these are common IDs/classes:
- Title:
#productTitle - Price blocks (varies):
#priceblock_ourprice(older)#priceblock_dealprice(older)span.a-price > span.a-offscreen(common modern)
- Rating:
span[data-hook="rating-out-of-text"]ori[data-hook="average-star-rating"] span - Review count:
span[data-hook="total-review-count"] - Availability:
#availability span - Canonical URL:
link[rel="canonical"]
We’ll code with fallbacks so you don’t lose everything when one selector changes.
Step 3: Parse product fields with fallbacks
import re
from bs4 import BeautifulSoup
def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
for sel in selectors:
el = soup.select_one(sel)
if not el:
continue
txt = el.get_text(" ", strip=True)
if txt:
return txt
return None
def parse_price(text: str | None) -> float | None:
if not text:
return None
# handles "$1,299.99" or "₹1,299.99" etc.
m = re.search(r"([0-9][0-9,]*\.?[0-9]{0,2})", text)
if not m:
return None
return float(m.group(1).replace(",", ""))
def parse_product(html: str) -> dict:
soup = BeautifulSoup(html, "lxml")
title = first_text(soup, [
"#productTitle",
"h1#title span#productTitle",
"h1 span.a-size-large",
])
price_text = first_text(soup, [
"#priceblock_ourprice",
"#priceblock_dealprice",
"span.a-price span.a-offscreen",
"span.apexPriceToPay span.a-offscreen",
])
rating_text = first_text(soup, [
'span[data-hook="rating-out-of-text"]',
'i[data-hook="average-star-rating"] span',
'span.a-icon-alt',
])
review_count = first_text(soup, [
'span[data-hook="total-review-count"]',
"#acrCustomerReviewText",
])
availability = first_text(soup, [
"#availability span",
"#availability",
])
canonical = None
can = soup.select_one('link[rel="canonical"]')
if can:
canonical = can.get("href")
return {
"title": title,
"price_text": price_text,
"price": parse_price(price_text),
"rating_text": rating_text,
"review_count_text": review_count,
"availability": availability,
"canonical_url": canonical,
}
Step 4: End-to-end script (fetch → parse → print JSON)
Create amazon_product_scrape.py:
import json
PRODUCT_URL = "https://www.amazon.com/dp/B0C7W6G2Q2"
# set to None to fetch direct (more likely to fail at scale)
PROXIESAPI_KEY = None # "YOUR_KEY"
html = fetch(PRODUCT_URL, api_key=PROXIESAPI_KEY)
product = parse_product(html)
print(json.dumps(product, ensure_ascii=False, indent=2))
Example output (typical)
{
"title": "...",
"price_text": "$...",
"price": 1299.99,
"rating_text": "4.6 out of 5 stars",
"review_count_text": "2,341",
"availability": "In Stock.",
"canonical_url": "https://www.amazon.com/dp/B0C7W6G2Q2"
}
Practical tips for scraping Amazon without constant breakage
- Only scrape what you need. Every extra request increases block probability.
- Add caching. If you re-run often, store raw HTML and re-parse locally.
- Backoff on 503/429. Don’t retry aggressively.
- Rotate targets. Don’t crawl thousands of items from one category page in one burst.
Where ProxiesAPI fits (honestly)
Your scraper’s biggest enemy is not BeautifulSoup — it’s network instability (temporary blocks, throttling, inconsistent responses).
ProxiesAPI fits as a simple drop-in for the HTTP fetch layer:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.amazon.com/dp/B0C7W6G2Q2"
If you keep everything else identical (timeouts, headers, retries), proxy-backed fetching tends to produce fewer dead runs when you scale from “1 URL” to “1,000 URLs”.
QA checklist
- Title is non-empty for 3 different products
- Price parsing works for at least one product with a visible price
- Rating + review count are present when the product has reviews
- Availability reads correctly for in-stock and out-of-stock examples
-
fetch()uses timeouts and retries
Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.