Scrape Product Data from Amazon (with Python + ProxiesAPI)

Amazon product pages are a classic scraping target — and also one of the quickest places to get blocked if you hammer requests from a single IP.

Amazon product page we’ll scrape (title, price, rating, availability)

In this tutorial we’ll build a real Python scraper that extracts:

  • product title
  • price (best-effort, from common price blocks)
  • rating (stars)
  • review count
  • availability / stock text
  • canonical product URL

We’ll use requests + BeautifulSoup (server-rendered HTML parsing) and we’ll show exactly where ProxiesAPI fits in the network layer.

Make your Amazon fetches more reliable with ProxiesAPI

Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.


Important notes (so your scraper doesn’t break instantly)

  1. HTML varies by locale and experiments. Amazon A/B tests markup frequently.
  2. Don’t rely on a single selector. Use a fallback chain.
  3. Send a realistic User-Agent + Accept-Language. It reduces “robot page” responses.
  4. Expect intermittent failures. Build retries + backoff from day one.

Also: this guide does not claim to bypass protected challenges. If you receive a challenge page, treat it as a failed fetch and move on.


Quick sanity check: fetch HTML

Pick a single product URL (example):

https://www.amazon.com/dp/B0C7W6G2Q2

Try fetching headers-only first:

curl -I "https://www.amazon.com/dp/B0C7W6G2Q2" | head

If you get HTML, we’re good. If you get an interstitial or “robot check”, that’s exactly why we’ll add proxy-backed fetching and retries.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: A robust fetch() (timeouts, headers, retries)

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)  # connect, read

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Connection": "keep-alive",
}

session = requests.Session()


def fetch_direct(url: str) -> str:
    r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def fetch_with_proxiesapi(url: str, api_key: str) -> str:
    # ProxiesAPI format required by the guide
    proxied = f"http://api.proxiesapi.com/?key={quote(api_key)}&url={quote(url, safe='')}"
    r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def fetch(url: str, api_key: str | None = None, retries: int = 4) -> str:
    last_err = None
    for attempt in range(1, retries + 1):
        try:
            html = fetch_with_proxiesapi(url, api_key) if api_key else fetch_direct(url)

            # very lightweight “challenge-ish” detection
            lowered = html.lower()
            if "captcha" in lowered and "amazon" in lowered:
                raise RuntimeError("Possible robot check/captcha page")

            return html
        except Exception as e:
            last_err = e
            sleep_s = (2 ** attempt) + random.random()
            print(f"attempt {attempt}/{retries} failed: {e}. sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after {retries} attempts: {last_err}")

ProxiesAPI curl (same URL)

API_KEY="YOUR_KEY"
URL="https://www.amazon.com/dp/B0C7W6G2Q2"
curl -s "http://api.proxiesapi.com/?key=$API_KEY&url=$URL" | head -n 20

Step 2: Understand the page structure (selectors that actually exist)

Across many Amazon product pages, these are common IDs/classes:

  • Title: #productTitle
  • Price blocks (varies):
    • #priceblock_ourprice (older)
    • #priceblock_dealprice (older)
    • span.a-price > span.a-offscreen (common modern)
  • Rating: span[data-hook="rating-out-of-text"] or i[data-hook="average-star-rating"] span
  • Review count: span[data-hook="total-review-count"]
  • Availability: #availability span
  • Canonical URL: link[rel="canonical"]

We’ll code with fallbacks so you don’t lose everything when one selector changes.


Step 3: Parse product fields with fallbacks

import re
from bs4 import BeautifulSoup


def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
    for sel in selectors:
        el = soup.select_one(sel)
        if not el:
            continue
        txt = el.get_text(" ", strip=True)
        if txt:
            return txt
    return None


def parse_price(text: str | None) -> float | None:
    if not text:
        return None
    # handles "$1,299.99" or "₹1,299.99" etc.
    m = re.search(r"([0-9][0-9,]*\.?[0-9]{0,2})", text)
    if not m:
        return None
    return float(m.group(1).replace(",", ""))


def parse_product(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = first_text(soup, [
        "#productTitle",
        "h1#title span#productTitle",
        "h1 span.a-size-large",
    ])

    price_text = first_text(soup, [
        "#priceblock_ourprice",
        "#priceblock_dealprice",
        "span.a-price span.a-offscreen",
        "span.apexPriceToPay span.a-offscreen",
    ])

    rating_text = first_text(soup, [
        'span[data-hook="rating-out-of-text"]',
        'i[data-hook="average-star-rating"] span',
        'span.a-icon-alt',
    ])

    review_count = first_text(soup, [
        'span[data-hook="total-review-count"]',
        "#acrCustomerReviewText",
    ])

    availability = first_text(soup, [
        "#availability span",
        "#availability",
    ])

    canonical = None
    can = soup.select_one('link[rel="canonical"]')
    if can:
        canonical = can.get("href")

    return {
        "title": title,
        "price_text": price_text,
        "price": parse_price(price_text),
        "rating_text": rating_text,
        "review_count_text": review_count,
        "availability": availability,
        "canonical_url": canonical,
    }

Step 4: End-to-end script (fetch → parse → print JSON)

Create amazon_product_scrape.py:

import json

PRODUCT_URL = "https://www.amazon.com/dp/B0C7W6G2Q2"

# set to None to fetch direct (more likely to fail at scale)
PROXIESAPI_KEY = None  # "YOUR_KEY"

html = fetch(PRODUCT_URL, api_key=PROXIESAPI_KEY)
product = parse_product(html)
print(json.dumps(product, ensure_ascii=False, indent=2))

Example output (typical)

{
  "title": "...",
  "price_text": "$...",
  "price": 1299.99,
  "rating_text": "4.6 out of 5 stars",
  "review_count_text": "2,341",
  "availability": "In Stock.",
  "canonical_url": "https://www.amazon.com/dp/B0C7W6G2Q2"
}

Practical tips for scraping Amazon without constant breakage

  • Only scrape what you need. Every extra request increases block probability.
  • Add caching. If you re-run often, store raw HTML and re-parse locally.
  • Backoff on 503/429. Don’t retry aggressively.
  • Rotate targets. Don’t crawl thousands of items from one category page in one burst.

Where ProxiesAPI fits (honestly)

Your scraper’s biggest enemy is not BeautifulSoup — it’s network instability (temporary blocks, throttling, inconsistent responses).

ProxiesAPI fits as a simple drop-in for the HTTP fetch layer:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.amazon.com/dp/B0C7W6G2Q2"

If you keep everything else identical (timeouts, headers, retries), proxy-backed fetching tends to produce fewer dead runs when you scale from “1 URL” to “1,000 URLs”.


QA checklist

  • Title is non-empty for 3 different products
  • Price parsing works for at least one product with a visible price
  • Rating + review count are present when the product has reviews
  • Availability reads correctly for in-stock and out-of-stock examples
  • fetch() uses timeouts and retries
Make your Amazon fetches more reliable with ProxiesAPI

Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.

Related guides

Web Scraping with Python: The Complete 2026 Tutorial
A from-scratch, production-minded guide to web scraping in Python: requests + BeautifulSoup, pagination, retries, caching, proxies, and a reusable scraper template.
guide#web scraping python#python#web-scraping
Build a Job Board with Data from Indeed (Python scraper tutorial)
Scrape Indeed job listings (title, company, location, salary, summary) with Python (requests + BeautifulSoup), then save a clean dataset you can render as a simple job board. Includes pagination + ProxiesAPI fetch.
tutorial#python#indeed#jobs
Scrape OpenStreetMap Wiki pages with Python
Collect category pages and linked wiki entries into a structured index for research or monitoring.
tutorial#python#openstreetmap#osm
How to Scrape MDN Docs Pages with Python
Extract headings and table-of-contents structure from MDN docs pages with Python and BeautifulSoup.
tutorial#python#mdn#web-scraping