Scrape Product Data from Amazon (with Python + ProxiesAPI)

Amazon product pages are a classic scraping target — and also one of the quickest places to get blocked if you hammer requests from a single IP.

Amazon product page we’ll scrape (title, price, rating, availability)

In this tutorial we’ll build a real Python scraper that extracts:

  • product title
  • price (best-effort, from common price blocks)
  • rating (stars)
  • review count
  • availability / stock text
  • canonical product URL

We’ll use requests + BeautifulSoup (server-rendered HTML parsing) and we’ll show exactly where ProxiesAPI fits in the network layer.

Make your Amazon fetches more reliable with ProxiesAPI

Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.


Important notes (so your scraper doesn’t break instantly)

  1. HTML varies by locale and experiments. Amazon A/B tests markup frequently.
  2. Don’t rely on a single selector. Use a fallback chain.
  3. Send a realistic User-Agent + Accept-Language. It reduces “robot page” responses.
  4. Expect intermittent failures. Build retries + backoff from day one.

Also: this guide does not claim to bypass protected challenges. If you receive a challenge page, treat it as a failed fetch and move on.


Quick sanity check: fetch HTML

Pick a single product URL (example):

https://www.amazon.com/dp/B0C7W6G2Q2

Try fetching headers-only first:

curl -I "https://www.amazon.com/dp/B0C7W6G2Q2" | head

If you get HTML, we’re good. If you get an interstitial or “robot check”, that’s exactly why we’ll add proxy-backed fetching and retries.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: A robust fetch() (timeouts, headers, retries)

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)  # connect, read

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Connection": "keep-alive",
}

session = requests.Session()


def fetch_direct(url: str) -> str:
    r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def fetch_with_proxiesapi(url: str, api_key: str) -> str:
    # ProxiesAPI format required by the guide
    proxied = f"http://api.proxiesapi.com/?key={quote(api_key)}&url={quote(url, safe='')}"
    r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def fetch(url: str, api_key: str | None = None, retries: int = 4) -> str:
    last_err = None
    for attempt in range(1, retries + 1):
        try:
            html = fetch_with_proxiesapi(url, api_key) if api_key else fetch_direct(url)

            # very lightweight “challenge-ish” detection
            lowered = html.lower()
            if "captcha" in lowered and "amazon" in lowered:
                raise RuntimeError("Possible robot check/captcha page")

            return html
        except Exception as e:
            last_err = e
            sleep_s = (2 ** attempt) + random.random()
            print(f"attempt {attempt}/{retries} failed: {e}. sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after {retries} attempts: {last_err}")

ProxiesAPI curl (same URL)

API_KEY="YOUR_KEY"
URL="https://www.amazon.com/dp/B0C7W6G2Q2"
curl -s "http://api.proxiesapi.com/?key=$API_KEY&url=$URL" | head -n 20

Step 2: Understand the page structure (selectors that actually exist)

Across many Amazon product pages, these are common IDs/classes:

  • Title: #productTitle
  • Price blocks (varies):
    • #priceblock_ourprice (older)
    • #priceblock_dealprice (older)
    • span.a-price > span.a-offscreen (common modern)
  • Rating: span[data-hook="rating-out-of-text"] or i[data-hook="average-star-rating"] span
  • Review count: span[data-hook="total-review-count"]
  • Availability: #availability span
  • Canonical URL: link[rel="canonical"]

We’ll code with fallbacks so you don’t lose everything when one selector changes.


Step 3: Parse product fields with fallbacks

import re
from bs4 import BeautifulSoup


def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
    for sel in selectors:
        el = soup.select_one(sel)
        if not el:
            continue
        txt = el.get_text(" ", strip=True)
        if txt:
            return txt
    return None


def parse_price(text: str | None) -> float | None:
    if not text:
        return None
    # handles "$1,299.99" or "₹1,299.99" etc.
    m = re.search(r"([0-9][0-9,]*\.?[0-9]{0,2})", text)
    if not m:
        return None
    return float(m.group(1).replace(",", ""))


def parse_product(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = first_text(soup, [
        "#productTitle",
        "h1#title span#productTitle",
        "h1 span.a-size-large",
    ])

    price_text = first_text(soup, [
        "#priceblock_ourprice",
        "#priceblock_dealprice",
        "span.a-price span.a-offscreen",
        "span.apexPriceToPay span.a-offscreen",
    ])

    rating_text = first_text(soup, [
        'span[data-hook="rating-out-of-text"]',
        'i[data-hook="average-star-rating"] span',
        'span.a-icon-alt',
    ])

    review_count = first_text(soup, [
        'span[data-hook="total-review-count"]',
        "#acrCustomerReviewText",
    ])

    availability = first_text(soup, [
        "#availability span",
        "#availability",
    ])

    canonical = None
    can = soup.select_one('link[rel="canonical"]')
    if can:
        canonical = can.get("href")

    return {
        "title": title,
        "price_text": price_text,
        "price": parse_price(price_text),
        "rating_text": rating_text,
        "review_count_text": review_count,
        "availability": availability,
        "canonical_url": canonical,
    }

Step 4: End-to-end script (fetch → parse → print JSON)

Create amazon_product_scrape.py:

import json

PRODUCT_URL = "https://www.amazon.com/dp/B0C7W6G2Q2"

# set to None to fetch direct (more likely to fail at scale)
PROXIESAPI_KEY = None  # "YOUR_KEY"

html = fetch(PRODUCT_URL, api_key=PROXIESAPI_KEY)
product = parse_product(html)
print(json.dumps(product, ensure_ascii=False, indent=2))

Example output (typical)

{
  "title": "...",
  "price_text": "$...",
  "price": 1299.99,
  "rating_text": "4.6 out of 5 stars",
  "review_count_text": "2,341",
  "availability": "In Stock.",
  "canonical_url": "https://www.amazon.com/dp/B0C7W6G2Q2"
}

Practical tips for scraping Amazon without constant breakage

  • Only scrape what you need. Every extra request increases block probability.
  • Add caching. If you re-run often, store raw HTML and re-parse locally.
  • Backoff on 503/429. Don’t retry aggressively.
  • Rotate targets. Don’t crawl thousands of items from one category page in one burst.

Where ProxiesAPI fits (honestly)

Your scraper’s biggest enemy is not BeautifulSoup — it’s network instability (temporary blocks, throttling, inconsistent responses).

ProxiesAPI fits as a simple drop-in for the HTTP fetch layer:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.amazon.com/dp/B0C7W6G2Q2"

If you keep everything else identical (timeouts, headers, retries), proxy-backed fetching tends to produce fewer dead runs when you scale from “1 URL” to “1,000 URLs”.


QA checklist

  • Title is non-empty for 3 different products
  • Price parsing works for at least one product with a visible price
  • Rating + review count are present when the product has reviews
  • Availability reads correctly for in-stock and out-of-stock examples
  • fetch() uses timeouts and retries
Make your Amazon fetches more reliable with ProxiesAPI

Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.

Related guides

Scrape Vinted Listings with Python: Search, Prices, Images, and Pagination
Build a dataset from Vinted search results (title, price, size, condition, seller, images) with a production-minded Python scraper + a proxy-backed fetch layer via ProxiesAPI.
tutorial#python#vinted#ecommerce
Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
Build a routes→prices dataset from Google Flights with pagination-safe requests, retries, and a proof screenshot. Includes export to CSV/JSON and pragmatic anti-blocking guidance.
tutorial#python#google#google-flights
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Crawl tag pages + question detail pages, extract accepted answers, and handle pagination + rate limits.
tutorial#python#stack-overflow#web-scraping