Scrape Product Data from Amazon (with Python + ProxiesAPI)

Mar 17, 2026 · tutorial · #python, #amazon, #web-scraping, #requests, #beautifulsoup, #proxies, #ecommerce

Amazon product pages are a classic scraping target — and also one of the quickest places to get blocked if you hammer requests from a single IP.

Amazon product page we’ll scrape (title, price, rating, availability)

In this tutorial we’ll build a real Python scraper that extracts:

product title
price (best-effort, from common price blocks)
rating (stars)
review count
availability / stock text
canonical product URL

We’ll use requests + BeautifulSoup (server-rendered HTML parsing) and we’ll show exactly where ProxiesAPI fits in the network layer.

Make your Amazon fetches more reliable with ProxiesAPI

Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.

Get 1,000 free API calls View pricing

Important notes (so your scraper doesn’t break instantly)

HTML varies by locale and experiments. Amazon A/B tests markup frequently.
Don’t rely on a single selector. Use a fallback chain.
Send a realistic User-Agent + Accept-Language. It reduces “robot page” responses.
Expect intermittent failures. Build retries + backoff from day one.

Also: this guide does not claim to bypass protected challenges. If you receive a challenge page, treat it as a failed fetch and move on.

Quick sanity check: fetch HTML

Pick a single product URL (example):

https://www.amazon.com/dp/B0C7W6G2Q2

Try fetching headers-only first:

curl -I "https://www.amazon.com/dp/B0C7W6G2Q2" | head

If you get HTML, we’re good. If you get an interstitial or “robot check”, that’s exactly why we’ll add proxy-backed fetching and retries.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: A robust fetch() (timeouts, headers, retries)

import random
import time
from urllib.parse import quote

import requests

TIMEOUT = (10, 30)  # connect, read

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Connection": "keep-alive",
}

session = requests.Session()


def fetch_direct(url: str) -> str:
    r = session.get(url, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def fetch_with_proxiesapi(url: str, api_key: str) -> str:
    # ProxiesAPI format required by the guide
    proxied = f"http://api.proxiesapi.com/?key={quote(api_key)}&url={quote(url, safe='')}"
    r = session.get(proxied, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def fetch(url: str, api_key: str | None = None, retries: int = 4) -> str:
    last_err = None
    for attempt in range(1, retries + 1):
        try:
            html = fetch_with_proxiesapi(url, api_key) if api_key else fetch_direct(url)

            # very lightweight “challenge-ish” detection
            lowered = html.lower()
            if "captcha" in lowered and "amazon" in lowered:
                raise RuntimeError("Possible robot check/captcha page")

            return html
        except Exception as e:
            last_err = e
            sleep_s = (2 ** attempt) + random.random()
            print(f"attempt {attempt}/{retries} failed: {e}. sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after {retries} attempts: {last_err}")

ProxiesAPI curl (same URL)

API_KEY="YOUR_KEY"
URL="https://www.amazon.com/dp/B0C7W6G2Q2"
curl -s "http://api.proxiesapi.com/?key=$API_KEY&url=$URL" | head -n 20

Step 2: Understand the page structure (selectors that actually exist)

Across many Amazon product pages, these are common IDs/classes:

Title: #productTitle
Price blocks (varies):
- #priceblock_ourprice (older)
- #priceblock_dealprice (older)
- span.a-price > span.a-offscreen (common modern)
Rating: span[data-hook="rating-out-of-text"] or i[data-hook="average-star-rating"] span
Review count: span[data-hook="total-review-count"]
Availability: #availability span
Canonical URL: link[rel="canonical"]

We’ll code with fallbacks so you don’t lose everything when one selector changes.

Step 3: Parse product fields with fallbacks

import re
from bs4 import BeautifulSoup


def first_text(soup: BeautifulSoup, selectors: list[str]) -> str | None:
    for sel in selectors:
        el = soup.select_one(sel)
        if not el:
            continue
        txt = el.get_text(" ", strip=True)
        if txt:
            return txt
    return None


def parse_price(text: str | None) -> float | None:
    if not text:
        return None
    # handles "$1,299.99" or "₹1,299.99" etc.
    m = re.search(r"([0-9][0-9,]*\.?[0-9]{0,2})", text)
    if not m:
        return None
    return float(m.group(1).replace(",", ""))


def parse_product(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = first_text(soup, [
        "#productTitle",
        "h1#title span#productTitle",
        "h1 span.a-size-large",
    ])

    price_text = first_text(soup, [
        "#priceblock_ourprice",
        "#priceblock_dealprice",
        "span.a-price span.a-offscreen",
        "span.apexPriceToPay span.a-offscreen",
    ])

    rating_text = first_text(soup, [
        'span[data-hook="rating-out-of-text"]',
        'i[data-hook="average-star-rating"] span',
        'span.a-icon-alt',
    ])

    review_count = first_text(soup, [
        'span[data-hook="total-review-count"]',
        "#acrCustomerReviewText",
    ])

    availability = first_text(soup, [
        "#availability span",
        "#availability",
    ])

    canonical = None
    can = soup.select_one('link[rel="canonical"]')
    if can:
        canonical = can.get("href")

    return {
        "title": title,
        "price_text": price_text,
        "price": parse_price(price_text),
        "rating_text": rating_text,
        "review_count_text": review_count,
        "availability": availability,
        "canonical_url": canonical,
    }

Step 4: End-to-end script (fetch → parse → print JSON)

Create amazon_product_scrape.py:

import json

PRODUCT_URL = "https://www.amazon.com/dp/B0C7W6G2Q2"

# set to None to fetch direct (more likely to fail at scale)
PROXIESAPI_KEY = None  # "YOUR_KEY"

html = fetch(PRODUCT_URL, api_key=PROXIESAPI_KEY)
product = parse_product(html)
print(json.dumps(product, ensure_ascii=False, indent=2))

Example output (typical)

{
  "title": "...",
  "price_text": "$...",
  "price": 1299.99,
  "rating_text": "4.6 out of 5 stars",
  "review_count_text": "2,341",
  "availability": "In Stock.",
  "canonical_url": "https://www.amazon.com/dp/B0C7W6G2Q2"
}

Practical tips for scraping Amazon without constant breakage

Only scrape what you need. Every extra request increases block probability.
Add caching. If you re-run often, store raw HTML and re-parse locally.
Backoff on 503/429. Don’t retry aggressively.
Rotate targets. Don’t crawl thousands of items from one category page in one burst.

Where ProxiesAPI fits (honestly)

Your scraper’s biggest enemy is not BeautifulSoup — it’s network instability (temporary blocks, throttling, inconsistent responses).

ProxiesAPI fits as a simple drop-in for the HTTP fetch layer:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.amazon.com/dp/B0C7W6G2Q2"

If you keep everything else identical (timeouts, headers, retries), proxy-backed fetching tends to produce fewer dead runs when you scale from “1 URL” to “1,000 URLs”.

QA checklist

Title is non-empty for 3 different products
Price parsing works for at least one product with a visible price
Rating + review count are present when the product has reviews
Availability reads correctly for in-stock and out-of-stock examples
fetch() uses timeouts and retries

Make your Amazon fetches more reliable with ProxiesAPI

Amazon is sensitive to repeated requests. ProxiesAPI gives you a simple, stable way to proxy your HTTP fetches so your scraper fails less as you scale your URL count.

Get 1,000 free API calls View pricing

Build an Amazon product-list scraper in Python that extracts title, URL, ASIN, price, and rating across multiple result pages. Includes retries, headers, and a ProxiesAPI-ready request wrapper.

tutorial#python#amazon#ecommerce

How to Scrape Amazon Product Data, Reviews, and Prices

A practical blueprint for scraping Amazon product pages and review listings: extract core fields, follow pagination, handle throttling, and detect blocks. Includes ProxiesAPI fetch code and real selectors.

tutorial#python#amazon#ecommerce

Scrape Product Prices from Home Depot (Search + Category Pages) with Python + ProxiesAPI

Extract product name, price, and availability from Home Depot listing pages (search + category) with pagination, resilient parsing, and an anti-block-friendly request layer.

tutorial#python#home-depot#ecommerce

Scrape GitHub Repository Data

Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.

tutorial#python#github#web-scraping

Scrape Product Data from Amazon (with Python + ProxiesAPI)

Related guides