Scrape Products from Amazon (Python) — Title, Price, Rating + Pagination

Amazon is one of the most-requested scraping targets because product data is structured and valuable:

  • titles + URLs for discovery
  • prices for monitoring
  • ratings + review counts for popularity signals
  • pagination for scale

But it’s also one of the easiest places to get blocked.

In this tutorial, we’ll build a practical Amazon search-results scraper in Python that extracts:

  • title
  • product_url
  • asin
  • price (best-effort)
  • rating + rating_count (best-effort)
  • across multiple pages

We’ll use server-rendered HTML (no browser automation) and structure the code so you can later plug in ProxiesAPI at the network layer.

Amazon search results page (we’ll scrape product cards + pagination)

Make Amazon scraping more reliable with ProxiesAPI

Amazon is aggressive about bot detection. ProxiesAPI won’t magically bypass everything, but it gives you a consistent proxy layer and rotation so your scraper can retry intelligently instead of dying on the first 503/CAPTCHA.


Important note (CAPTCHAs + legality + ToS)

Amazon may show:

  • CAPTCHAs
  • “Robot Check” pages
  • 503 / throttling
  • localized experiences

Scraping may violate Amazon’s Terms of Service and can have legal/compliance implications depending on your use case and jurisdiction.

This guide focuses on:

  • how to parse the HTML you receive
  • how to detect blocks
  • how to build a scraper that fails safely

Use it responsibly.


What we’re scraping (Amazon search structure)

We’ll scrape a search results URL like:

https://www.amazon.com/s?k=wireless+mouse

On typical Amazon SERPs, each product card is a div with:

  • data-component-type="s-search-result"
  • data-asin="..."

That’s your anchor.

Pagination usually appears as a list with a.s-pagination-item links and a page= parameter.

Quick sanity check (HTML returned)

curl -A "Mozilla/5.0" -s "https://www.amazon.com/s?k=wireless+mouse" | head -n 20

If you see a “Robot Check” form or something like /errors/validateCaptcha, you’re blocked. Don’t waste time parsing those pages.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing

Step 1: A fetch() wrapper with timeouts + retries

Amazon is flaky for bots. You want:

  • timeouts (never hang)
  • retry with backoff
  • block detection

Here’s a minimal but production-shaped wrapper:

import random
import time
from dataclasses import dataclass

import requests

TIMEOUT = (10, 30)  # connect, read

USER_AGENTS = [
    # Keep a small, realistic UA pool (don’t go crazy)
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def looks_blocked(html: str) -> bool:
    if not html:
        return True
    needles = [
        "Robot Check",
        "Enter the characters you see below",
        "/errors/validateCaptcha",
        "Sorry, we just need to make sure you're not a robot",
    ]
    h = html.lower()
    return any(n.lower() in h for n in needles)


def fetch(session: requests.Session, url: str, max_retries: int = 4) -> FetchResult:
    last_exc = None

    for attempt in range(1, max_retries + 1):
        try:
            headers = {
                "User-Agent": random.choice(USER_AGENTS),
                "Accept-Language": "en-US,en;q=0.9",
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Connection": "keep-alive",
            }

            # --- ProxiesAPI integration point ---
            # If ProxiesAPI gives you an HTTP proxy URL (or rotating endpoint),
            # wire it here. Example shape (DO NOT hardcode credentials):
            # proxies = {"http": PROXY_URL, "https": PROXY_URL}
            # r = session.get(url, headers=headers, timeout=TIMEOUT, proxies=proxies)
            # -----------------------------------

            r = session.get(url, headers=headers, timeout=TIMEOUT)

            text = r.text or ""

            # treat obvious block pages as retryable
            if r.status_code in (429, 503) or looks_blocked(text):
                raise RuntimeError(f"blocked_or_throttled status={r.status_code}")

            r.raise_for_status()
            return FetchResult(url=url, status_code=r.status_code, text=text)

        except Exception as e:
            last_exc = e
            sleep_s = min(12, 1.5 ** attempt) + random.random()
            print(f"attempt {attempt}/{max_retries} failed: {e} — sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"failed to fetch after {max_retries} retries: {url}") from last_exc

That wrapper is intentionally honest:

  • it doesn’t claim it can bypass CAPTCHAs
  • it just helps you retry and detect blocks

Step 2: Parse product cards from the HTML

Now we parse the search-result cards.

Common useful fields:

  • data-asin (stable product identifier)
  • title link under h2 a
  • rating often under i.a-icon-star-small (varies)
  • price often under span.a-price > span.a-offscreen (varies)

Because Amazon’s DOM varies by category and experiment, we’ll implement:

  • primary selectors
  • fallbacks
  • graceful None
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.amazon.com"


def parse_price(text: str):
    if not text:
        return None
    # e.g. "$19.99" → 19.99
    m = re.search(r"([0-9]+(?:\.[0-9]{2})?)", text.replace(",", ""))
    return float(m.group(1)) if m else None


def parse_int(text: str):
    if not text:
        return None
    m = re.search(r"(\d[\d,]*)", text)
    return int(m.group(1).replace(",", "")) if m else None


def parse_rating(text: str):
    if not text:
        return None
    # e.g. "4.5 out of 5 stars" → 4.5
    m = re.search(r"(\d+(?:\.\d+)?)\s*out of\s*5", text)
    return float(m.group(1)) if m else None


def parse_search_page(html: str):
    soup = BeautifulSoup(html, "lxml")

    results = []
    for card in soup.select('div[data-component-type="s-search-result"]'):
        asin = card.get("data-asin") or None
        if not asin:
            continue

        title_a = card.select_one("h2 a")
        title = title_a.get_text(" ", strip=True) if title_a else None
        href = title_a.get("href") if title_a else None
        product_url = urljoin(BASE, href) if href else None

        # price (best-effort)
        price = None
        price_el = card.select_one("span.a-price > span.a-offscreen")
        if price_el:
            price = parse_price(price_el.get_text(strip=True))

        # rating
        rating = None
        rating_count = None

        rating_el = card.select_one("i.a-icon-star-small span.a-icon-alt") or card.select_one(
            "i.a-icon-star span.a-icon-alt"
        )
        if rating_el:
            rating = parse_rating(rating_el.get_text(" ", strip=True))

        count_el = card.select_one('span[aria-label$="ratings"]')
        if count_el:
            rating_count = parse_int(count_el.get("aria-label", ""))
        else:
            # common fallback: a link next to the rating
            count_link = card.select_one('a[href*="customerReviews"] span')
            if count_link:
                rating_count = parse_int(count_link.get_text(" ", strip=True))

        results.append(
            {
                "asin": asin,
                "title": title,
                "product_url": product_url,
                "price": price,
                "rating": rating,
                "rating_count": rating_count,
            }
        )

    return results

Tip: log a few parsed rows early

When scraping Amazon, your #1 debugging tool is:

  • print the first 3 parsed items
  • confirm they look sane

Step 3: Find the next page URL (pagination)

Amazon pagination links vary, but you usually have a page= query parameter.

We’ll implement two approaches:

  1. Prefer a “Next” button.
  2. Fallback: if you know the page number, construct &page=N.
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, parse_qs


def find_next_page_url(html: str):
    soup = BeautifulSoup(html, "lxml")

    # Approach 1: explicit Next link
    next_a = soup.select_one("a.s-pagination-next")
    if next_a and next_a.get("href"):
        return urljoin(BASE, next_a.get("href"))

    return None


def set_page(url: str, page: int) -> str:
    # Simple fallback: append/replace page parameter
    parsed = urlparse(url)
    q = parse_qs(parsed.query)
    q["page"] = [str(page)]

    # rebuild query manually
    parts = []
    for k, vals in q.items():
        for v in vals:
            parts.append(f"{k}={v}")

    base = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
    return base + "?" + "&".join(parts)

Step 4: Crawl multiple pages (dedupe by ASIN)

Now we combine everything:

  • fetch first page
  • parse cards
  • resolve next page
  • repeat
import json


def crawl_amazon_search(start_url: str, pages: int = 3):
    session = requests.Session()
    seen = set()
    out = []

    url = start_url

    for i in range(1, pages + 1):
        print(f"\n=== page {i}: {url}")
        res = fetch(session, url)

        batch = parse_search_page(res.text)
        print("items parsed:", len(batch))

        for item in batch:
            asin = item.get("asin")
            if not asin or asin in seen:
                continue
            seen.add(asin)
            out.append(item)

        # try “Next”
        nxt = find_next_page_url(res.text)
        if nxt:
            url = nxt
        else:
            # fallback: if next not found, try forcing &page=
            url = set_page(start_url, i + 1)

    return out


if __name__ == "__main__":
    start = "https://www.amazon.com/s?k=wireless+mouse"
    items = crawl_amazon_search(start, pages=5)

    with open("amazon_results.json", "w", encoding="utf-8") as f:
        json.dump(items, f, ensure_ascii=False, indent=2)

    print("\nunique items:", len(items))
    print("first item:", items[0] if items else None)

This gives you a clean JSON file you can feed into:

  • a price-monitoring job
  • a data warehouse
  • a product discovery tool

Making it more stable (practical anti-block checklist)

Amazon stability is a systems problem:

  1. Throttle: don’t hit 10 req/sec on a single IP.
  2. Retries: treat 503/429 as retryable.
  3. Detect blocks: don’t parse CAPTCHA pages.
  4. Rotate IPs: proxies can help reduce per-IP rate.
  5. Persist progress: so a mid-run failure doesn’t waste work.

Where ProxiesAPI fits

ProxiesAPI typically fits at the fetch() layer:

  • you keep your parsing/crawling logic the same
  • you swap the network path to use a rotating proxy endpoint
  • you track success/failure by proxy session

If you’re getting blocked constantly, consider moving up the stack:

  • use a browser-based approach (Playwright)
  • reduce request volume
  • or switch to an approved data provider

QA checklist

  • You’re scraping search results, not product detail pages
  • Each row has a non-empty asin + title
  • Pagination increases unique ASIN count
  • You stop/slow down when block pages appear
  • You store results in a file/DB for repeatable runs

Next upgrades

  • Add SQLite storage keyed by asin
  • Add incremental refresh (only re-fetch changed categories)
  • Crawl product detail pages (specs, variations) carefully
  • Add Playwright fallback when HTML is gated
Make Amazon scraping more reliable with ProxiesAPI

Amazon is aggressive about bot detection. ProxiesAPI won’t magically bypass everything, but it gives you a consistent proxy layer and rotation so your scraper can retry intelligently instead of dying on the first 503/CAPTCHA.

Related guides

Scrape Vinted Listings with Python: Search, Prices, Images, and Pagination
Build a dataset from Vinted search results (title, price, size, condition, seller, images) with a production-minded Python scraper + a proxy-backed fetch layer via ProxiesAPI.
tutorial#python#vinted#ecommerce
Scrape Product Data from Amazon (with Python + ProxiesAPI)
Extract Amazon product title, price, rating, and availability from a product page using requests + BeautifulSoup, with retries and proxy-backed fetching via ProxiesAPI.
tutorial#python#amazon#web-scraping
How to Scrape Amazon Product Data, Reviews, and Prices
A practical blueprint for scraping Amazon product pages and review listings: extract core fields, follow pagination, handle throttling, and detect blocks. Includes ProxiesAPI fetch code and real selectors.
tutorial#python#amazon#ecommerce
Scrape Product Prices from Home Depot (Search + Category Pages) with Python + ProxiesAPI
Extract product name, price, and availability from Home Depot listing pages (search + category) with pagination, resilient parsing, and an anti-block-friendly request layer.
tutorial#python#home-depot#ecommerce