How to Scrape Amazon Product Data, Reviews, and Prices

May 07, 2026 · tutorial · #python, #amazon, #ecommerce, #price-scraping, #reviews, #web-scraping, #beautifulsoup, #requests

Amazon is the canonical e-commerce scraping target:

huge product catalog
lots of structured data (prices, ratings, bullets)
reviews and Q&A that are useful for research

It’s also a site where naive scraping fails quickly.

In this guide we’ll build a scraper that:

fetches a product page via ProxiesAPI
extracts product fields (title, price, rating, review count)
follows the “See all reviews” flow
paginates review pages and extracts review rows
exports everything to JSON

We’ll keep the claims honest: Amazon search pages are commonly blocked, and even product pages can intermittently fail. But with a good fetch layer + conservative crawling, you can pull a lot of useful data.

Keep Amazon crawls stable with ProxiesAPI

Amazon is sensitive to automation. ProxiesAPI helps reduce fetch failures and gives you a consistent request surface while you focus on parsing and data quality.

Get 1,000 free API calls View pricing

What we’re scraping (URLs)

We’ll focus on two URL types:

1) Product detail page

Typical pattern:

https://www.amazon.com/dp/ASIN
https://www.amazon.com/gp/product/ASIN

Where ASIN is a 10-character id.

2) Reviews pages

Common patterns:

https://www.amazon.com/product-reviews/ASIN
https://www.amazon.com/product-reviews/ASIN/?pageNumber=2

We’ll use those because they’re relatively stable.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Build a ProxiesAPI fetch helper

Canonical request:

curl -s "http://api.proxiesapi.com/?key=API_KEY&url=https://www.amazon.com/dp/B000000000" | head

Python helper (with conservative retries + block detection):

import time
import random
import requests
from urllib.parse import quote_plus

TIMEOUT = (10, 60)


def proxiesapi_url(target_url: str, api_key: str) -> str:
    return f"http://api.proxiesapi.com/?key={quote_plus(api_key)}&url={quote_plus(target_url)}"


def looks_blocked(html: str) -> bool:
    t = (html or "").lower()

    # Amazon often returns a robot-check page or a minimal error page
    markers = [
        "robot check",
        "enter the characters you see below",
        "sorry, we just need to make sure you're not a robot",
        "type the characters",
        "to discuss automated access",
        "captcha",
    ]
    return any(m in t for m in markers)


def fetch_html(target_url: str, api_key: str, *, max_attempts: int = 6) -> str | None:
    session = requests.Session()

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }

    last_err = None
    for attempt in range(1, max_attempts + 1):
        try:
            url = proxiesapi_url(target_url, api_key)
            r = session.get(url, timeout=TIMEOUT, headers=headers)

            if r.status_code >= 400:
                raise requests.HTTPError(f"HTTP {r.status_code}")

            html = r.text
            if looks_blocked(html):
                raise RuntimeError("blocked/captcha detected")

            return html

        except Exception as e:
            last_err = e
            sleep_s = min(40, (2 ** attempt)) + random.random()
            time.sleep(sleep_s)

    print("failed:", last_err)
    return None

Step 2: Parse product data (real selectors)

Amazon’s HTML varies by locale and experiment.

So we’ll use a selector strategy:

try several common ids for the same field
fall back to OpenGraph / meta tags when possible

import re
from bs4 import BeautifulSoup


def clean(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def text_or_none(el) -> str | None:
    return clean(el.get_text(" ", strip=True)) if el else None


def parse_price(soup: BeautifulSoup) -> str | None:
    # Common price ids
    for sel in [
        "span.a-price > span.a-offscreen",
        "span#priceblock_ourprice",
        "span#priceblock_dealprice",
        "span#priceblock_saleprice",
    ]:
        el = soup.select_one(sel)
        if el:
            return clean(el.get_text(strip=True))
    return None


def parse_product(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = text_or_none(soup.select_one("#productTitle"))

    price = parse_price(soup)

    rating = None
    rating_el = soup.select_one("i[data-hook='average-star-rating'] span") or soup.select_one("span[data-hook='rating-out-of-text']")
    if rating_el:
        rating = clean(rating_el.get_text(strip=True))

    review_count = None
    rc = soup.select_one("#acrCustomerReviewText")
    if rc:
        review_count = clean(rc.get_text(strip=True))

    brand = None
    brand_el = soup.select_one("#bylineInfo")
    if brand_el:
        brand = clean(brand_el.get_text(" ", strip=True))

    bullets = [clean(li.get_text(" ", strip=True)) for li in soup.select("#feature-bullets li span.a-list-item")]
    bullets = [b for b in bullets if b]

    # Reviews URL: often linked in the reviews section
    reviews_url = None
    a_reviews = soup.select_one("a[data-hook='see-all-reviews-link-foot']") or soup.select_one("a[data-hook='see-all-reviews-link']")
    if a_reviews and a_reviews.get("href"):
        reviews_url = "https://www.amazon.com" + a_reviews.get("href")

    return {
        "url": url,
        "title": title,
        "price": price,
        "rating": rating,
        "review_count_text": review_count,
        "brand_text": brand,
        "bullets": bullets,
        "reviews_url": reviews_url,
    }

Test it:

API_KEY = "API_KEY"
ASIN = "B00ZV9RDKK"  # replace with your product
product_url = f"https://www.amazon.com/dp/{ASIN}"

html = fetch_html(product_url, API_KEY)
if not html:
    raise SystemExit("blocked")

product = parse_product(html, product_url)
print(product)

Step 3: Scrape reviews + pagination

Review rows are typically marked with data-hook="review".

We’ll extract:

review id
title
rating
date
verified purchase (when present)
body text

from urllib.parse import urlparse, parse_qs, urlencode, urlunparse


def set_page(url: str, page: int) -> str:
    p = urlparse(url)
    q = parse_qs(p.query)
    q["pageNumber"] = [str(page)]
    return urlunparse((p.scheme, p.netloc, p.path, p.params, urlencode(q, doseq=True), p.fragment))


def parse_reviews(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    out = []

    for div in soup.select("div[data-hook='review']"):
        rid = div.get("id")

        title = text_or_none(div.select_one("a[data-hook='review-title'] span")) or text_or_none(div.select_one("span[data-hook='review-title']"))

        rating = text_or_none(div.select_one("i[data-hook='review-star-rating'] span")) or text_or_none(div.select_one("i[data-hook='cmps-review-star-rating'] span"))

        date_text = text_or_none(div.select_one("span[data-hook='review-date']"))

        verified = bool(div.select_one("span[data-hook='avp-badge']"))

        body = text_or_none(div.select_one("span[data-hook='review-body']"))

        out.append({
            "id": rid,
            "title": title,
            "rating_text": rating,
            "date_text": date_text,
            "verified_purchase": verified,
            "body": body,
        })

    return out

Crawler:


def crawl_reviews(reviews_url: str, api_key: str, pages: int = 3) -> list[dict]:
    all_reviews = []
    seen = set()

    for page in range(1, pages + 1):
        url = reviews_url if page == 1 else set_page(reviews_url, page)
        html = fetch_html(url, api_key)
        if not html:
            print("blocked on review page", page)
            break

        batch = parse_reviews(html)
        for r in batch:
            rid = r.get("id")
            if rid and rid in seen:
                continue
            if rid:
                seen.add(rid)
            all_reviews.append(r)

        print("review page", page, "rows", len(batch), "total", len(all_reviews))

        # be polite
        time.sleep(1.0)

    return all_reviews

Step 4: Save output (JSON)

import json


def save_json(obj, path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)


API_KEY = "API_KEY"
ASIN = "B00ZV9RDKK"  # replace
product_url = f"https://www.amazon.com/dp/{ASIN}"

html = fetch_html(product_url, API_KEY)
if not html:
    raise SystemExit("blocked")

product = parse_product(html, product_url)
print("product title:", product.get("title"))

reviews_url = product.get("reviews_url") or f"https://www.amazon.com/product-reviews/{ASIN}"
reviews = crawl_reviews(reviews_url, API_KEY, pages=3)

save_json({"product": product, "reviews": reviews}, f"amazon_{ASIN}.json")
print("saved", len(reviews), "reviews")

Throttling + block handling (what actually helps)

Practical advice for Amazon:

Start small: 1 product → 3 review pages
Sleep between pages: even 1–2 seconds helps
Cache HTML: don’t refetch unchanged pages
Detect blocks: stop early when you see a robot-check page
Avoid search pages: they’re often more aggressively protected than /dp/ASIN pages

Also note that Amazon serves different HTML by locale.

If you’re scraping amazon.co.uk, adjust the domain in your URL builder and test selectors.

Comparison table: common approaches

Approach	Works for	Pros	Cons
Direct Requests (no proxy)	small tests	simplest	blocks quickly
Requests + raw proxy pool	medium scale	control	operational overhead
ProxiesAPI fetch pattern	app integration + stability	simplest “proxy-backed fetch”	less low-level control
Paid datasets / official feeds	production apps	stable + legal clarity	cost

QA checklist

You can fetch /dp/ASIN HTML consistently
Title + price parse correctly on 3 products
Reviews parse returns non-empty rows
Pagination increases unique review count
You stop when blocked (don’t loop)

Final thoughts

Scraping Amazon is less about clever parsing and more about discipline:

stable fetch layer
conservative crawling
fast block detection
incremental improvements over time

Once your pipeline is solid, you can expand to:

multiple ASINs per run
category discovery via external sources
price tracking over time

But start with one product and get the fundamentals right.

Keep Amazon crawls stable with ProxiesAPI

Amazon is sensitive to automation. ProxiesAPI helps reduce fetch failures and gives you a consistent request surface while you focus on parsing and data quality.

Get 1,000 free API calls View pricing

Extract product name, price, and availability from Home Depot listing pages (search + category) with pagination, resilient parsing, and an anti-block-friendly request layer.

tutorial#python#home-depot#ecommerce

Scrape Product Data from Amazon (with Python + ProxiesAPI)

Extract Amazon product title, price, rating, and availability from a product page using requests + BeautifulSoup, with retries and proxy-backed fetching via ProxiesAPI.

tutorial#python#amazon#web-scraping

How to Scrape Google Flights Prices with Python (Routes, Dates, and Price Quotes)

A practical guide to extracting flight price quotes from Google Flights responsibly: capture share URLs, fetch server-rendered HTML, parse price cards, and export clean JSON. Includes ProxiesAPI-backed requests + a screenshot.

tutorial#python#google-flights#travel

Scrape Flight Prices from Google Flights (Python + ProxiesAPI)

A practical approach to monitoring flight prices: take a proof screenshot, extract prices from HTML snapshots, and run with retries + proxy rotation.

tutorial#python#google-flights#price-scraping

How to Scrape Amazon Product Data, Reviews, and Prices

Related guides