How to Scrape Amazon Product Data, Reviews, and Prices
Amazon is the canonical e-commerce scraping target:
- huge product catalog
- lots of structured data (prices, ratings, bullets)
- reviews and Q&A that are useful for research
It’s also a site where naive scraping fails quickly.
In this guide we’ll build a scraper that:
- fetches a product page via ProxiesAPI
- extracts product fields (title, price, rating, review count)
- follows the “See all reviews” flow
- paginates review pages and extracts review rows
- exports everything to JSON
We’ll keep the claims honest: Amazon search pages are commonly blocked, and even product pages can intermittently fail. But with a good fetch layer + conservative crawling, you can pull a lot of useful data.
Amazon is sensitive to automation. ProxiesAPI helps reduce fetch failures and gives you a consistent request surface while you focus on parsing and data quality.
What we’re scraping (URLs)
We’ll focus on two URL types:
1) Product detail page
Typical pattern:
https://www.amazon.com/dp/ASINhttps://www.amazon.com/gp/product/ASIN
Where ASIN is a 10-character id.
2) Reviews pages
Common patterns:
https://www.amazon.com/product-reviews/ASINhttps://www.amazon.com/product-reviews/ASIN/?pageNumber=2
We’ll use those because they’re relatively stable.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Step 1: Build a ProxiesAPI fetch helper
Canonical request:
curl -s "http://api.proxiesapi.com/?key=API_KEY&url=https://www.amazon.com/dp/B000000000" | head
Python helper (with conservative retries + block detection):
import time
import random
import requests
from urllib.parse import quote_plus
TIMEOUT = (10, 60)
def proxiesapi_url(target_url: str, api_key: str) -> str:
return f"http://api.proxiesapi.com/?key={quote_plus(api_key)}&url={quote_plus(target_url)}"
def looks_blocked(html: str) -> bool:
t = (html or "").lower()
# Amazon often returns a robot-check page or a minimal error page
markers = [
"robot check",
"enter the characters you see below",
"sorry, we just need to make sure you're not a robot",
"type the characters",
"to discuss automated access",
"captcha",
]
return any(m in t for m in markers)
def fetch_html(target_url: str, api_key: str, *, max_attempts: int = 6) -> str | None:
session = requests.Session()
headers = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
last_err = None
for attempt in range(1, max_attempts + 1):
try:
url = proxiesapi_url(target_url, api_key)
r = session.get(url, timeout=TIMEOUT, headers=headers)
if r.status_code >= 400:
raise requests.HTTPError(f"HTTP {r.status_code}")
html = r.text
if looks_blocked(html):
raise RuntimeError("blocked/captcha detected")
return html
except Exception as e:
last_err = e
sleep_s = min(40, (2 ** attempt)) + random.random()
time.sleep(sleep_s)
print("failed:", last_err)
return None
Step 2: Parse product data (real selectors)
Amazon’s HTML varies by locale and experiment.
So we’ll use a selector strategy:
- try several common ids for the same field
- fall back to OpenGraph / meta tags when possible
import re
from bs4 import BeautifulSoup
def clean(s: str) -> str:
return re.sub(r"\s+", " ", (s or "").strip())
def text_or_none(el) -> str | None:
return clean(el.get_text(" ", strip=True)) if el else None
def parse_price(soup: BeautifulSoup) -> str | None:
# Common price ids
for sel in [
"span.a-price > span.a-offscreen",
"span#priceblock_ourprice",
"span#priceblock_dealprice",
"span#priceblock_saleprice",
]:
el = soup.select_one(sel)
if el:
return clean(el.get_text(strip=True))
return None
def parse_product(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
title = text_or_none(soup.select_one("#productTitle"))
price = parse_price(soup)
rating = None
rating_el = soup.select_one("i[data-hook='average-star-rating'] span") or soup.select_one("span[data-hook='rating-out-of-text']")
if rating_el:
rating = clean(rating_el.get_text(strip=True))
review_count = None
rc = soup.select_one("#acrCustomerReviewText")
if rc:
review_count = clean(rc.get_text(strip=True))
brand = None
brand_el = soup.select_one("#bylineInfo")
if brand_el:
brand = clean(brand_el.get_text(" ", strip=True))
bullets = [clean(li.get_text(" ", strip=True)) for li in soup.select("#feature-bullets li span.a-list-item")]
bullets = [b for b in bullets if b]
# Reviews URL: often linked in the reviews section
reviews_url = None
a_reviews = soup.select_one("a[data-hook='see-all-reviews-link-foot']") or soup.select_one("a[data-hook='see-all-reviews-link']")
if a_reviews and a_reviews.get("href"):
reviews_url = "https://www.amazon.com" + a_reviews.get("href")
return {
"url": url,
"title": title,
"price": price,
"rating": rating,
"review_count_text": review_count,
"brand_text": brand,
"bullets": bullets,
"reviews_url": reviews_url,
}
Test it:
API_KEY = "API_KEY"
ASIN = "B00ZV9RDKK" # replace with your product
product_url = f"https://www.amazon.com/dp/{ASIN}"
html = fetch_html(product_url, API_KEY)
if not html:
raise SystemExit("blocked")
product = parse_product(html, product_url)
print(product)
Step 3: Scrape reviews + pagination
Review rows are typically marked with data-hook="review".
We’ll extract:
- review id
- title
- rating
- date
- verified purchase (when present)
- body text
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
def set_page(url: str, page: int) -> str:
p = urlparse(url)
q = parse_qs(p.query)
q["pageNumber"] = [str(page)]
return urlunparse((p.scheme, p.netloc, p.path, p.params, urlencode(q, doseq=True), p.fragment))
def parse_reviews(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
for div in soup.select("div[data-hook='review']"):
rid = div.get("id")
title = text_or_none(div.select_one("a[data-hook='review-title'] span")) or text_or_none(div.select_one("span[data-hook='review-title']"))
rating = text_or_none(div.select_one("i[data-hook='review-star-rating'] span")) or text_or_none(div.select_one("i[data-hook='cmps-review-star-rating'] span"))
date_text = text_or_none(div.select_one("span[data-hook='review-date']"))
verified = bool(div.select_one("span[data-hook='avp-badge']"))
body = text_or_none(div.select_one("span[data-hook='review-body']"))
out.append({
"id": rid,
"title": title,
"rating_text": rating,
"date_text": date_text,
"verified_purchase": verified,
"body": body,
})
return out
Crawler:
def crawl_reviews(reviews_url: str, api_key: str, pages: int = 3) -> list[dict]:
all_reviews = []
seen = set()
for page in range(1, pages + 1):
url = reviews_url if page == 1 else set_page(reviews_url, page)
html = fetch_html(url, api_key)
if not html:
print("blocked on review page", page)
break
batch = parse_reviews(html)
for r in batch:
rid = r.get("id")
if rid and rid in seen:
continue
if rid:
seen.add(rid)
all_reviews.append(r)
print("review page", page, "rows", len(batch), "total", len(all_reviews))
# be polite
time.sleep(1.0)
return all_reviews
Step 4: Save output (JSON)
import json
def save_json(obj, path: str) -> None:
with open(path, "w", encoding="utf-8") as f:
json.dump(obj, f, ensure_ascii=False, indent=2)
API_KEY = "API_KEY"
ASIN = "B00ZV9RDKK" # replace
product_url = f"https://www.amazon.com/dp/{ASIN}"
html = fetch_html(product_url, API_KEY)
if not html:
raise SystemExit("blocked")
product = parse_product(html, product_url)
print("product title:", product.get("title"))
reviews_url = product.get("reviews_url") or f"https://www.amazon.com/product-reviews/{ASIN}"
reviews = crawl_reviews(reviews_url, API_KEY, pages=3)
save_json({"product": product, "reviews": reviews}, f"amazon_{ASIN}.json")
print("saved", len(reviews), "reviews")
Throttling + block handling (what actually helps)
Practical advice for Amazon:
- Start small: 1 product → 3 review pages
- Sleep between pages: even 1–2 seconds helps
- Cache HTML: don’t refetch unchanged pages
- Detect blocks: stop early when you see a robot-check page
- Avoid search pages: they’re often more aggressively protected than
/dp/ASINpages
Also note that Amazon serves different HTML by locale.
If you’re scraping amazon.co.uk, adjust the domain in your URL builder and test selectors.
Comparison table: common approaches
| Approach | Works for | Pros | Cons |
|---|---|---|---|
| Direct Requests (no proxy) | small tests | simplest | blocks quickly |
| Requests + raw proxy pool | medium scale | control | operational overhead |
| ProxiesAPI fetch pattern | app integration + stability | simplest “proxy-backed fetch” | less low-level control |
| Paid datasets / official feeds | production apps | stable + legal clarity | cost |
QA checklist
- You can fetch
/dp/ASINHTML consistently - Title + price parse correctly on 3 products
- Reviews parse returns non-empty rows
- Pagination increases unique review count
- You stop when blocked (don’t loop)
Final thoughts
Scraping Amazon is less about clever parsing and more about discipline:
- stable fetch layer
- conservative crawling
- fast block detection
- incremental improvements over time
Once your pipeline is solid, you can expand to:
- multiple ASINs per run
- category discovery via external sources
- price tracking over time
But start with one product and get the fundamentals right.
Amazon is sensitive to automation. ProxiesAPI helps reduce fetch failures and gives you a consistent request surface while you focus on parsing and data quality.