How to Scrape Amazon Product Data, Reviews, and Prices

Amazon is the canonical e-commerce scraping target:

  • huge product catalog
  • lots of structured data (prices, ratings, bullets)
  • reviews and Q&A that are useful for research

It’s also a site where naive scraping fails quickly.

In this guide we’ll build a scraper that:

  1. fetches a product page via ProxiesAPI
  2. extracts product fields (title, price, rating, review count)
  3. follows the “See all reviews” flow
  4. paginates review pages and extracts review rows
  5. exports everything to JSON

We’ll keep the claims honest: Amazon search pages are commonly blocked, and even product pages can intermittently fail. But with a good fetch layer + conservative crawling, you can pull a lot of useful data.

Keep Amazon crawls stable with ProxiesAPI

Amazon is sensitive to automation. ProxiesAPI helps reduce fetch failures and gives you a consistent request surface while you focus on parsing and data quality.


What we’re scraping (URLs)

We’ll focus on two URL types:

1) Product detail page

Typical pattern:

  • https://www.amazon.com/dp/ASIN
  • https://www.amazon.com/gp/product/ASIN

Where ASIN is a 10-character id.

2) Reviews pages

Common patterns:

  • https://www.amazon.com/product-reviews/ASIN
  • https://www.amazon.com/product-reviews/ASIN/?pageNumber=2

We’ll use those because they’re relatively stable.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Build a ProxiesAPI fetch helper

Canonical request:

curl -s "http://api.proxiesapi.com/?key=API_KEY&url=https://www.amazon.com/dp/B000000000" | head

Python helper (with conservative retries + block detection):

import time
import random
import requests
from urllib.parse import quote_plus

TIMEOUT = (10, 60)


def proxiesapi_url(target_url: str, api_key: str) -> str:
    return f"http://api.proxiesapi.com/?key={quote_plus(api_key)}&url={quote_plus(target_url)}"


def looks_blocked(html: str) -> bool:
    t = (html or "").lower()

    # Amazon often returns a robot-check page or a minimal error page
    markers = [
        "robot check",
        "enter the characters you see below",
        "sorry, we just need to make sure you're not a robot",
        "type the characters",
        "to discuss automated access",
        "captcha",
    ]
    return any(m in t for m in markers)


def fetch_html(target_url: str, api_key: str, *, max_attempts: int = 6) -> str | None:
    session = requests.Session()

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }

    last_err = None
    for attempt in range(1, max_attempts + 1):
        try:
            url = proxiesapi_url(target_url, api_key)
            r = session.get(url, timeout=TIMEOUT, headers=headers)

            if r.status_code >= 400:
                raise requests.HTTPError(f"HTTP {r.status_code}")

            html = r.text
            if looks_blocked(html):
                raise RuntimeError("blocked/captcha detected")

            return html

        except Exception as e:
            last_err = e
            sleep_s = min(40, (2 ** attempt)) + random.random()
            time.sleep(sleep_s)

    print("failed:", last_err)
    return None

Step 2: Parse product data (real selectors)

Amazon’s HTML varies by locale and experiment.

So we’ll use a selector strategy:

  • try several common ids for the same field
  • fall back to OpenGraph / meta tags when possible
import re
from bs4 import BeautifulSoup


def clean(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def text_or_none(el) -> str | None:
    return clean(el.get_text(" ", strip=True)) if el else None


def parse_price(soup: BeautifulSoup) -> str | None:
    # Common price ids
    for sel in [
        "span.a-price > span.a-offscreen",
        "span#priceblock_ourprice",
        "span#priceblock_dealprice",
        "span#priceblock_saleprice",
    ]:
        el = soup.select_one(sel)
        if el:
            return clean(el.get_text(strip=True))
    return None


def parse_product(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = text_or_none(soup.select_one("#productTitle"))

    price = parse_price(soup)

    rating = None
    rating_el = soup.select_one("i[data-hook='average-star-rating'] span") or soup.select_one("span[data-hook='rating-out-of-text']")
    if rating_el:
        rating = clean(rating_el.get_text(strip=True))

    review_count = None
    rc = soup.select_one("#acrCustomerReviewText")
    if rc:
        review_count = clean(rc.get_text(strip=True))

    brand = None
    brand_el = soup.select_one("#bylineInfo")
    if brand_el:
        brand = clean(brand_el.get_text(" ", strip=True))

    bullets = [clean(li.get_text(" ", strip=True)) for li in soup.select("#feature-bullets li span.a-list-item")]
    bullets = [b for b in bullets if b]

    # Reviews URL: often linked in the reviews section
    reviews_url = None
    a_reviews = soup.select_one("a[data-hook='see-all-reviews-link-foot']") or soup.select_one("a[data-hook='see-all-reviews-link']")
    if a_reviews and a_reviews.get("href"):
        reviews_url = "https://www.amazon.com" + a_reviews.get("href")

    return {
        "url": url,
        "title": title,
        "price": price,
        "rating": rating,
        "review_count_text": review_count,
        "brand_text": brand,
        "bullets": bullets,
        "reviews_url": reviews_url,
    }

Test it:

API_KEY = "API_KEY"
ASIN = "B00ZV9RDKK"  # replace with your product
product_url = f"https://www.amazon.com/dp/{ASIN}"

html = fetch_html(product_url, API_KEY)
if not html:
    raise SystemExit("blocked")

product = parse_product(html, product_url)
print(product)

Step 3: Scrape reviews + pagination

Review rows are typically marked with data-hook="review".

We’ll extract:

  • review id
  • title
  • rating
  • date
  • verified purchase (when present)
  • body text
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse


def set_page(url: str, page: int) -> str:
    p = urlparse(url)
    q = parse_qs(p.query)
    q["pageNumber"] = [str(page)]
    return urlunparse((p.scheme, p.netloc, p.path, p.params, urlencode(q, doseq=True), p.fragment))


def parse_reviews(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    out = []

    for div in soup.select("div[data-hook='review']"):
        rid = div.get("id")

        title = text_or_none(div.select_one("a[data-hook='review-title'] span")) or text_or_none(div.select_one("span[data-hook='review-title']"))

        rating = text_or_none(div.select_one("i[data-hook='review-star-rating'] span")) or text_or_none(div.select_one("i[data-hook='cmps-review-star-rating'] span"))

        date_text = text_or_none(div.select_one("span[data-hook='review-date']"))

        verified = bool(div.select_one("span[data-hook='avp-badge']"))

        body = text_or_none(div.select_one("span[data-hook='review-body']"))

        out.append({
            "id": rid,
            "title": title,
            "rating_text": rating,
            "date_text": date_text,
            "verified_purchase": verified,
            "body": body,
        })

    return out

Crawler:


def crawl_reviews(reviews_url: str, api_key: str, pages: int = 3) -> list[dict]:
    all_reviews = []
    seen = set()

    for page in range(1, pages + 1):
        url = reviews_url if page == 1 else set_page(reviews_url, page)
        html = fetch_html(url, api_key)
        if not html:
            print("blocked on review page", page)
            break

        batch = parse_reviews(html)
        for r in batch:
            rid = r.get("id")
            if rid and rid in seen:
                continue
            if rid:
                seen.add(rid)
            all_reviews.append(r)

        print("review page", page, "rows", len(batch), "total", len(all_reviews))

        # be polite
        time.sleep(1.0)

    return all_reviews

Step 4: Save output (JSON)

import json


def save_json(obj, path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)


API_KEY = "API_KEY"
ASIN = "B00ZV9RDKK"  # replace
product_url = f"https://www.amazon.com/dp/{ASIN}"

html = fetch_html(product_url, API_KEY)
if not html:
    raise SystemExit("blocked")

product = parse_product(html, product_url)
print("product title:", product.get("title"))

reviews_url = product.get("reviews_url") or f"https://www.amazon.com/product-reviews/{ASIN}"
reviews = crawl_reviews(reviews_url, API_KEY, pages=3)

save_json({"product": product, "reviews": reviews}, f"amazon_{ASIN}.json")
print("saved", len(reviews), "reviews")

Throttling + block handling (what actually helps)

Practical advice for Amazon:

  • Start small: 1 product → 3 review pages
  • Sleep between pages: even 1–2 seconds helps
  • Cache HTML: don’t refetch unchanged pages
  • Detect blocks: stop early when you see a robot-check page
  • Avoid search pages: they’re often more aggressively protected than /dp/ASIN pages

Also note that Amazon serves different HTML by locale.

If you’re scraping amazon.co.uk, adjust the domain in your URL builder and test selectors.


Comparison table: common approaches

ApproachWorks forProsCons
Direct Requests (no proxy)small testssimplestblocks quickly
Requests + raw proxy poolmedium scalecontroloperational overhead
ProxiesAPI fetch patternapp integration + stabilitysimplest “proxy-backed fetch”less low-level control
Paid datasets / official feedsproduction appsstable + legal claritycost

QA checklist

  • You can fetch /dp/ASIN HTML consistently
  • Title + price parse correctly on 3 products
  • Reviews parse returns non-empty rows
  • Pagination increases unique review count
  • You stop when blocked (don’t loop)

Final thoughts

Scraping Amazon is less about clever parsing and more about discipline:

  • stable fetch layer
  • conservative crawling
  • fast block detection
  • incremental improvements over time

Once your pipeline is solid, you can expand to:

  • multiple ASINs per run
  • category discovery via external sources
  • price tracking over time

But start with one product and get the fundamentals right.

Keep Amazon crawls stable with ProxiesAPI

Amazon is sensitive to automation. ProxiesAPI helps reduce fetch failures and gives you a consistent request surface while you focus on parsing and data quality.

Related guides

Scrape Product Prices from Home Depot (Search + Category Pages) with Python + ProxiesAPI
Extract product name, price, and availability from Home Depot listing pages (search + category) with pagination, resilient parsing, and an anti-block-friendly request layer.
tutorial#python#home-depot#ecommerce
Scrape Product Data from Amazon (with Python + ProxiesAPI)
Extract Amazon product title, price, rating, and availability from a product page using requests + BeautifulSoup, with retries and proxy-backed fetching via ProxiesAPI.
tutorial#python#amazon#web-scraping
How to Scrape Google Flights Prices with Python (Routes, Dates, and Price Quotes)
A practical guide to extracting flight price quotes from Google Flights responsibly: capture share URLs, fetch server-rendered HTML, parse price cards, and export clean JSON. Includes ProxiesAPI-backed requests + a screenshot.
tutorial#python#google-flights#travel
Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
A practical approach to monitoring flight prices: take a proof screenshot, extract prices from HTML snapshots, and run with retries + proxy rotation.
tutorial#python#google-flights#price-scraping