Scrape Products from Amazon (Python) — Title, Price, Rating + Pagination

May 13, 2026 · tutorial · #python, #amazon, #ecommerce, #web-scraping, #pagination, #beautifulsoup, #requests, #proxies

Amazon is one of the most-requested scraping targets because product data is structured and valuable:

titles + URLs for discovery
prices for monitoring
ratings + review counts for popularity signals
pagination for scale

But it’s also one of the easiest places to get blocked.

In this tutorial, we’ll build a practical Amazon search-results scraper in Python that extracts:

title
product_url
asin
price (best-effort)
rating + rating_count (best-effort)
across multiple pages

We’ll use server-rendered HTML (no browser automation) and structure the code so you can later plug in ProxiesAPI at the network layer.

Make Amazon scraping more reliable with ProxiesAPI

Amazon is aggressive about bot detection. ProxiesAPI won’t magically bypass everything, but it gives you a consistent proxy layer and rotation so your scraper can retry intelligently instead of dying on the first 503/CAPTCHA.

Get 1,000 free API calls View pricing

Important note (CAPTCHAs + legality + ToS)

Amazon may show:

CAPTCHAs
“Robot Check” pages
503 / throttling
localized experiences

Scraping may violate Amazon’s Terms of Service and can have legal/compliance implications depending on your use case and jurisdiction.

This guide focuses on:

how to parse the HTML you receive
how to detect blocks
how to build a scraper that fails safely

Use it responsibly.

What we’re scraping (Amazon search structure)

We’ll scrape a search results URL like:

https://www.amazon.com/s?k=wireless+mouse

On typical Amazon SERPs, each product card is a div with:

data-component-type="s-search-result"
data-asin="..."

That’s your anchor.

Pagination usually appears as a list with a.s-pagination-item links and a page= parameter.

Quick sanity check (HTML returned)

curl -A "Mozilla/5.0" -s "https://www.amazon.com/s?k=wireless+mouse" | head -n 20

If you see a “Robot Check” form or something like /errors/validateCaptcha, you’re blocked. Don’t waste time parsing those pages.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing

Step 1: A fetch() wrapper with timeouts + retries

Amazon is flaky for bots. You want:

timeouts (never hang)
retry with backoff
block detection

Here’s a minimal but production-shaped wrapper:

import random
import time
from dataclasses import dataclass

import requests

TIMEOUT = (10, 30)  # connect, read

USER_AGENTS = [
    # Keep a small, realistic UA pool (don’t go crazy)
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def looks_blocked(html: str) -> bool:
    if not html:
        return True
    needles = [
        "Robot Check",
        "Enter the characters you see below",
        "/errors/validateCaptcha",
        "Sorry, we just need to make sure you're not a robot",
    ]
    h = html.lower()
    return any(n.lower() in h for n in needles)


def fetch(session: requests.Session, url: str, max_retries: int = 4) -> FetchResult:
    last_exc = None

    for attempt in range(1, max_retries + 1):
        try:
            headers = {
                "User-Agent": random.choice(USER_AGENTS),
                "Accept-Language": "en-US,en;q=0.9",
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Connection": "keep-alive",
            }

            # --- ProxiesAPI integration point ---
            # If ProxiesAPI gives you an HTTP proxy URL (or rotating endpoint),
            # wire it here. Example shape (DO NOT hardcode credentials):
            # proxies = {"http": PROXY_URL, "https": PROXY_URL}
            # r = session.get(url, headers=headers, timeout=TIMEOUT, proxies=proxies)
            # -----------------------------------

            r = session.get(url, headers=headers, timeout=TIMEOUT)

            text = r.text or ""

            # treat obvious block pages as retryable
            if r.status_code in (429, 503) or looks_blocked(text):
                raise RuntimeError(f"blocked_or_throttled status={r.status_code}")

            r.raise_for_status()
            return FetchResult(url=url, status_code=r.status_code, text=text)

        except Exception as e:
            last_exc = e
            sleep_s = min(12, 1.5 ** attempt) + random.random()
            print(f"attempt {attempt}/{max_retries} failed: {e} — sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

    raise RuntimeError(f"failed to fetch after {max_retries} retries: {url}") from last_exc

That wrapper is intentionally honest:

it doesn’t claim it can bypass CAPTCHAs
it just helps you retry and detect blocks

Step 2: Parse product cards from the HTML

Now we parse the search-result cards.

Common useful fields:

data-asin (stable product identifier)
title link under h2 a
rating often under i.a-icon-star-small (varies)
price often under span.a-price > span.a-offscreen (varies)

Because Amazon’s DOM varies by category and experiment, we’ll implement:

primary selectors
fallbacks
graceful None

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.amazon.com"


def parse_price(text: str):
    if not text:
        return None
    # e.g. "$19.99" → 19.99
    m = re.search(r"([0-9]+(?:\.[0-9]{2})?)", text.replace(",", ""))
    return float(m.group(1)) if m else None


def parse_int(text: str):
    if not text:
        return None
    m = re.search(r"(\d[\d,]*)", text)
    return int(m.group(1).replace(",", "")) if m else None


def parse_rating(text: str):
    if not text:
        return None
    # e.g. "4.5 out of 5 stars" → 4.5
    m = re.search(r"(\d+(?:\.\d+)?)\s*out of\s*5", text)
    return float(m.group(1)) if m else None


def parse_search_page(html: str):
    soup = BeautifulSoup(html, "lxml")

    results = []
    for card in soup.select('div[data-component-type="s-search-result"]'):
        asin = card.get("data-asin") or None
        if not asin:
            continue

        title_a = card.select_one("h2 a")
        title = title_a.get_text(" ", strip=True) if title_a else None
        href = title_a.get("href") if title_a else None
        product_url = urljoin(BASE, href) if href else None

        # price (best-effort)
        price = None
        price_el = card.select_one("span.a-price > span.a-offscreen")
        if price_el:
            price = parse_price(price_el.get_text(strip=True))

        # rating
        rating = None
        rating_count = None

        rating_el = card.select_one("i.a-icon-star-small span.a-icon-alt") or card.select_one(
            "i.a-icon-star span.a-icon-alt"
        )
        if rating_el:
            rating = parse_rating(rating_el.get_text(" ", strip=True))

        count_el = card.select_one('span[aria-label$="ratings"]')
        if count_el:
            rating_count = parse_int(count_el.get("aria-label", ""))
        else:
            # common fallback: a link next to the rating
            count_link = card.select_one('a[href*="customerReviews"] span')
            if count_link:
                rating_count = parse_int(count_link.get_text(" ", strip=True))

        results.append(
            {
                "asin": asin,
                "title": title,
                "product_url": product_url,
                "price": price,
                "rating": rating,
                "rating_count": rating_count,
            }
        )

    return results

Tip: log a few parsed rows early

When scraping Amazon, your #1 debugging tool is:

print the first 3 parsed items
confirm they look sane

Step 3: Find the next page URL (pagination)

Amazon pagination links vary, but you usually have a page= query parameter.

We’ll implement two approaches:

Prefer a “Next” button.
Fallback: if you know the page number, construct &page=N.

from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, parse_qs


def find_next_page_url(html: str):
    soup = BeautifulSoup(html, "lxml")

    # Approach 1: explicit Next link
    next_a = soup.select_one("a.s-pagination-next")
    if next_a and next_a.get("href"):
        return urljoin(BASE, next_a.get("href"))

    return None


def set_page(url: str, page: int) -> str:
    # Simple fallback: append/replace page parameter
    parsed = urlparse(url)
    q = parse_qs(parsed.query)
    q["page"] = [str(page)]

    # rebuild query manually
    parts = []
    for k, vals in q.items():
        for v in vals:
            parts.append(f"{k}={v}")

    base = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
    return base + "?" + "&".join(parts)

Step 4: Crawl multiple pages (dedupe by ASIN)

Now we combine everything:

fetch first page
parse cards
resolve next page
repeat

import json


def crawl_amazon_search(start_url: str, pages: int = 3):
    session = requests.Session()
    seen = set()
    out = []

    url = start_url

    for i in range(1, pages + 1):
        print(f"\n=== page {i}: {url}")
        res = fetch(session, url)

        batch = parse_search_page(res.text)
        print("items parsed:", len(batch))

        for item in batch:
            asin = item.get("asin")
            if not asin or asin in seen:
                continue
            seen.add(asin)
            out.append(item)

        # try “Next”
        nxt = find_next_page_url(res.text)
        if nxt:
            url = nxt
        else:
            # fallback: if next not found, try forcing &page=
            url = set_page(start_url, i + 1)

    return out


if __name__ == "__main__":
    start = "https://www.amazon.com/s?k=wireless+mouse"
    items = crawl_amazon_search(start, pages=5)

    with open("amazon_results.json", "w", encoding="utf-8") as f:
        json.dump(items, f, ensure_ascii=False, indent=2)

    print("\nunique items:", len(items))
    print("first item:", items[0] if items else None)

This gives you a clean JSON file you can feed into: