How to Scrape Etsy Product Listings with Python (ProxiesAPI + Pagination)

Etsy search pages are one of the most common “I need this for my price tracker / product research / competitor monitor” targets.

They’re also a classic example of why scraping needs more than just parsing HTML:

  • requests get throttled quickly when you paginate
  • HTML changes (A/B tests)
  • you’ll see intermittent 403/429 responses

In this guide we’ll build a practical Etsy search scraper in Python that:

  • fetches multiple search pages (pagination)
  • extracts listings: title, price, rating, review count, shop name, listing URL
  • uses ProxiesAPI for a stable network layer (rotation + fewer blocks)
  • exports to JSONL/CSV for downstream pipelines

Etsy search results (we’ll scrape listing cards + pagination)

Make Etsy scraping stable with ProxiesAPI

Marketplace pages block aggressively at scale. ProxiesAPI gives you a clean, rotating proxy layer + retries so your scraper fails less and needs less babysitting.


What we’re scraping (Etsy search pages)

Example search URL:

https://www.etsy.com/search?q=linen%20shirt

Pagination is typically done via a ref=pagination link and/or a page= query param. In practice you’ll encounter URLs like:

  • page 1: https://www.etsy.com/search?q=linen%20shirt
  • page 2: https://www.etsy.com/search?q=linen%20shirt&page=2

Your first job is to verify how the site behaves today.

Quick sanity check

curl -I "https://www.etsy.com/search?q=linen%20shirt" | head -n 5

If you get 403/429 intermittently, that’s normal at higher volumes — which is exactly where a proxy layer helps.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml python-dotenv

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) to parse server HTML
  • dotenv for environment config

Create a .env file:

PROXIESAPI_KEY="YOUR_KEY_HERE"

ProxiesAPI request helper (retries + timeouts)

A “toy” scraper dies on the first flaky response.

A production scraper treats the network as unreliable:

  • always set timeouts
  • retry transient failures
  • rotate IPs when blocked

Below is a simple helper that sends requests through ProxiesAPI.

Note: ProxiesAPI has multiple integration modes. This example uses a proxy endpoint style where you pass your destination URL as a parameter. If your account uses a different pattern, keep the retry logic and replace only the URL construction.

import os
import time
import urllib.parse
import requests

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY", "")

TIMEOUT = (10, 30)  # connect, read

session = requests.Session()
session.headers.update({
    # Keep this modest. Overly-botty headers don’t magically fix blocking.
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
})


def proxiesapi_url(target_url: str) -> str:
    # Common pattern: https://api.proxiesapi.com/?auth_key=...&url=...
    qs = urllib.parse.urlencode({
        "auth_key": PROXIESAPI_KEY,
        "url": target_url,
    })
    return f"https://api.proxiesapi.com/?{qs}"


def fetch_html(url: str, retries: int = 5) -> str:
    last_exc = None

    for attempt in range(1, retries + 1):
        try:
            r = session.get(proxiesapi_url(url), timeout=TIMEOUT)

            # Treat common anti-bot responses as retryable.
            if r.status_code in (403, 429, 500, 502, 503, 504):
                wait = min(2 ** attempt, 20)
                time.sleep(wait)
                continue

            r.raise_for_status()
            return r.text

        except requests.RequestException as e:
            last_exc = e
            wait = min(2 ** attempt, 20)
            time.sleep(wait)

    raise RuntimeError(f"Failed to fetch after {retries} tries: {url}") from last_exc

This isn’t fancy, but it’s the difference between “works on my laptop once” and “runs every day”.


Step 1: Identify stable selectors on Etsy

Etsy’s markup changes, and it often includes multiple list formats.

The safest approach is:

  1. find the listing card container selector that returns many results
  2. within each card, extract fields defensively (some are missing)
  3. never assume price/rating exists

Today, Etsy search results are usually rendered with listing cards that contain:

  • a link to the listing (often an <a> with /listing/ in the href)
  • a title element (sometimes h3)
  • a price element near a currency symbol
  • rating/review counts (if present)

We’ll use “pattern selectors” and validate outputs.


Step 2: Parse listing cards

import re
from bs4 import BeautifulSoup

BASE = "https://www.etsy.com"


def clean_text(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "")).strip()


def parse_price(text: str) -> str | None:
    # Keep as a string so you don’t lose currency, decimals, etc.
    t = clean_text(text)
    return t if t else None


def parse_rating(text: str) -> float | None:
    # Example: "4.8 out of 5 stars"
    m = re.search(r"(\d+(?:\.\d+)?)", text or "")
    return float(m.group(1)) if m else None


def parse_review_count(text: str) -> int | None:
    # Example: "(1,234)" or "123"
    if not text:
        return None
    t = text.replace(",", "")
    m = re.search(r"(\d+)", t)
    return int(m.group(1)) if m else None


def parse_search_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Strategy:
    # - Find links that look like listing URLs
    # - Walk up to a reasonable card container
    # Etsy is dynamic; this is intentionally resilient, not pretty.

    listing_links = soup.select('a[href*="/listing/"]')

    seen = set()
    out = []

    for a in listing_links:
        href = a.get("href") or ""
        if "/listing/" not in href:
            continue

        # Normalize to absolute URL
        url = href if href.startswith("http") else f"{BASE}{href}"

        # De-dupe: same listing link appears multiple times in a card
        m = re.search(r"/listing/(\d+)", url)
        listing_id = m.group(1) if m else url
        if listing_id in seen:
            continue
        seen.add(listing_id)

        # Heuristic: listing title is usually inside the same card.
        card = a
        for _ in range(6):
            if not card:
                break
            # Stop climbing when we hit a list item/article-ish container.
            if card.name in ("li", "article", "div"):
                # Cards often have data-listing-id or similar.
                if card.get("data-listing-id") or "listing" in " ".join(card.get("class", [])):
                    break
            card = card.parent

        container = card or a.parent

        title = None
        # Try common patterns
        h = container.select_one("h3") if container else None
        if h:
            title = clean_text(h.get_text(" ", strip=True))
        if not title:
            title = clean_text(a.get_text(" ", strip=True))

        # Price: find first element with a currency-ish pattern.
        price = None
        if container:
            price_el = container.select_one('[data-buy-box-region="price"], .currency-value')
            if price_el:
                price = parse_price(price_el.get_text(" ", strip=True))

        if not price and container:
            text = container.get_text(" ", strip=True)
            mprice = re.search(r"([$€£₹]\s*\d[\d,]*(?:\.\d{1,2})?)", text)
            price = mprice.group(1) if mprice else None

        # Rating + reviews
        rating = None
        reviews = None
        shop = None

        if container:
            # Rating often appears in aria-label on a star element
            star = container.select_one('[aria-label*="out of 5"]')
            if star:
                rating = parse_rating(star.get("aria-label", ""))

            # Review count may be near rating or in parentheses
            rt = container.get_text(" ", strip=True)
            mrevs = re.search(r"\((\d[\d,]*)\)", rt)
            reviews = parse_review_count(mrevs.group(1)) if mrevs else None

            # Shop name is commonly shown as a small label; we’ll use a soft heuristic.
            shop_el = container.select_one('p:has(a[href*="/shop/"])')
            if shop_el:
                shop_a = shop_el.select_one('a[href*="/shop/"]')
                if shop_a:
                    shop = clean_text(shop_a.get_text(" ", strip=True))

        out.append({
            "listing_id": listing_id,
            "title": title or None,
            "price": price,
            "rating": rating,
            "review_count": reviews,
            "shop": shop,
            "url": url,
        })

    # Filter obvious junk: keep entries that have URL + at least title.
    out = [x for x in out if x.get("url") and x.get("title")]

    return out

This parser uses heuristics because Etsy’s DOM isn’t a stable “API”. That’s the point: you want something that survives minor structure changes.


Step 3: Pagination (crawl multiple pages)

import urllib.parse


def build_search_url(query: str, page: int) -> str:
    qs = urllib.parse.urlencode({"q": query, "page": page})
    return f"https://www.etsy.com/search?{qs}"


def crawl_search(query: str, pages: int = 3) -> list[dict]:
    all_items = []
    seen = set()

    for p in range(1, pages + 1):
        url = build_search_url(query, p)
        html = fetch_html(url)
        batch = parse_search_page(html)

        for item in batch:
            lid = item.get("listing_id")
            if not lid or lid in seen:
                continue
            seen.add(lid)
            all_items.append(item)

        print(f"page {p}: {len(batch)} items, total unique: {len(all_items)}")

        # polite delay (even with proxies)
        time.sleep(1.5)

    return all_items


if __name__ == "__main__":
    items = crawl_search("linen shirt", pages=5)
    print("total:", len(items))
    print(items[0] if items else None)

Export: JSONL + CSV

import csv
import json


def export_jsonl(path: str, rows: list[dict]):
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")


def export_csv(path: str, rows: list[dict]):
    if not rows:
        return
    cols = list(rows[0].keys())
    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()
        for r in rows:
            w.writerow(r)


items = crawl_search("linen shirt", pages=3)
export_jsonl("etsy_listings.jsonl", items)
export_csv("etsy_listings.csv", items)
print("wrote", len(items))

Common failure modes (and how to handle them)

1) 403/429 spikes after page 1

  • reduce concurrency
  • add backoff (already in fetch_html)
  • rotate IPs (ProxiesAPI)
  • store a “blocked” sample HTML so you can detect it programmatically

2) Missing price/rating/shop fields

Normal. Not every listing shows all metadata in search cards.

For a high-quality dataset, do a 2-step crawl:

  1. scrape search pages → collect listing URLs
  2. visit listing detail pages → extract canonical fields

3) HTML changes

Build a small validation layer:

  • if a page returns < 5 listings, flag it
  • store the HTML to disk for debugging
  • keep selectors in one file so changes are easy

Where ProxiesAPI fits (honestly)

You can scrape Etsy without proxies for small experiments.

But if you’re doing:

  • hundreds/thousands of listing pages
  • daily refreshes
  • multiple search terms

…a rotating proxy layer becomes the difference between “randomly breaks” and “reliable pipeline”.


QA checklist

  • page 1 returns a realistic number of listings
  • pagination increases unique listing count
  • you’re exporting valid JSONL/CSV
  • retries/backoff trigger on 403/429
  • you can spot-check 5 listings manually in the browser
Make Etsy scraping stable with ProxiesAPI

Marketplace pages block aggressively at scale. ProxiesAPI gives you a clean, rotating proxy layer + retries so your scraper fails less and needs less babysitting.

Related guides

Scrape Product Comparisons from CNET (Python + ProxiesAPI)
Collect CNET comparison tables and spec blocks, normalize the data into a clean dataset, and keep the crawl stable with retries + ProxiesAPI. Includes screenshot workflow.
tutorial#python#cnet#web-scraping
Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)
Extract Glassdoor company reviews and salary ranges more reliably: discover URLs, handle pagination, keep sessions consistent, rotate proxies when blocked, and export clean JSON.
tutorial#python#glassdoor#web-scraping
Scrape NBA Scores and Standings from ESPN with Python (Box Scores + Schedule)
Build a clean dataset of today’s NBA games and standings from ESPN pages using robust selectors and proxy-safe requests.
tutorial#python#nba#espn
Scrape Google Maps Business Listings with Python: Search → Place Details → Reviews (ProxiesAPI)
Extract local leads from Google Maps: search results → place details → reviews, with a resilient fetch pipeline and a screenshot-driven selector approach.
tutorial#python#google-maps#local-leads