Scrape Shopee Product Listings with Python (ProxiesAPI)

Shopee is one of the most popular e-commerce marketplaces in Southeast Asia, which makes it a common target for price monitoring, catalog intelligence, and availability tracking.

The catch: Shopee pages can be JS-heavy and they can be inconsistent by region. The goal of this tutorial is to build a scraper that works when Shopee returns usable HTML, and to do it in a way that’s production-shaped:

  • robust HTTP fetching (timeouts + retries)
  • parsing with real selectors (and fallbacks)
  • clean output
  • CSV export
  • a screenshot of the target website (so you can visually confirm what you’re scraping)

Shopee product page (we’ll extract title, price, and sold count)

Make Shopee scraping more reliable with ProxiesAPI

Shopee is a high-demand e-commerce target. ProxiesAPI gives you a simple way to route requests through proxies and keep your scraper stable as you scale to more products and more categories.


What we’re scraping (and what we’re not)

Shopee has multiple surfaces:

  • Product detail pages (PDP): title, price, sold count, rating, variants
  • Category / search pages: many items, but often rendered client-side

In this guide we’ll focus on product pages because:

  1. they’re easier to validate (you know what a given product should say)
  2. they’re the right unit for monitoring (you usually track specific SKUs)

We’ll scrape:

  • title
  • price
  • currency (when available)
  • sold count (e.g., “2.3k sold”)
  • canonical_url

A note on “listings”

The proposal says “product listings”, but in practice Shopee “listing” data is most reliably extracted from product pages.

If you specifically need category listings (many products), you typically have to:

  • call Shopee’s internal APIs (often signed)
  • or run a browser (Playwright) to render JS

This post stays on the honest side: HTML product pages via ProxiesAPI.


Requirements

  • Python 3.10+
  • A ProxiesAPI key

Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml python-dotenv

Create a .env:

PROXIESAPI_KEY="YOUR_KEY"

Step 1: A reliable fetch layer using ProxiesAPI

ProxiesAPI works by requesting:

http://api.proxiesapi.com/?auth_key=KEY&url=TARGET_URL

We’ll wrap that in a fetch function with:

  • connect/read timeouts
  • retry with exponential backoff
  • a realistic User-Agent
import os
import time
import random
import urllib.parse

import requests
from dotenv import load_dotenv

load_dotenv()

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")
if not PROXIESAPI_KEY:
    raise RuntimeError("Missing PROXIESAPI_KEY in environment")

PROXIESAPI_ENDPOINT = "http://api.proxiesapi.com/"

TIMEOUT = (15, 45)  # connect, read

session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
})


def proxiesapi_url(target_url: str) -> str:
    return (
        f"{PROXIESAPI_ENDPOINT}?auth_key={urllib.parse.quote(PROXIESAPI_KEY)}"
        f"&url={urllib.parse.quote(target_url, safe='')}"
    )


def fetch_html(url: str, *, retries: int = 4) -> str:
    last_err = None

    for attempt in range(1, retries + 1):
        try:
            r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
            r.raise_for_status()

            html = r.text
            if len(html) < 5000:
                # Shopee pages can be large; very small responses are often blocks/errors.
                raise RuntimeError(f"Response too small ({len(html)} bytes)")

            return html
        except Exception as e:
            last_err = e
            if attempt == retries:
                break

            sleep = (2 ** attempt) + random.uniform(0.0, 0.6)
            print(f"attempt {attempt} failed: {e} — sleeping {sleep:.1f}s")
            time.sleep(sleep)

    raise RuntimeError(f"Failed to fetch {url}: {last_err}")

Quick sanity check

Pick a Shopee product page from the region you care about (example domains include shopee.sg, shopee.ph, shopee.co.th).

html = fetch_html("https://shopee.sg/")
print("bytes:", len(html))
print(html[:200])

If this fails, it usually means:

  • the page is fully client-rendered for that region
  • your target is geo-dependent
  • you’re getting a bot-check page

In that case, switch to a specific product URL you can open in a normal browser.


Step 2: Extract data from the HTML (realistic approach)

Shopee’s HTML structure can vary, but there are two common places to look:

  1. Open Graph / meta tags (often stable)
  2. Embedded JSON (state blobs)

We’ll implement both.

2.1 Parse common meta tags

import re
import json
from bs4 import BeautifulSoup


def text_or_none(el):
    return el.get_text(strip=True) if el else None


def attr_or_none(el, attr: str):
    return el.get(attr) if el and el.has_attr(attr) else None


def parse_meta(soup: BeautifulSoup) -> dict:
    def meta(name=None, prop=None):
        if name:
            return soup.select_one(f"meta[name='{name}']")
        if prop:
            return soup.select_one(f"meta[property='{prop}']")
        return None

    title = attr_or_none(meta(prop="og:title"), "content")
    url = attr_or_none(meta(prop="og:url"), "content")
    price = attr_or_none(meta(prop="product:price:amount"), "content")
    currency = attr_or_none(meta(prop="product:price:currency"), "content")

    return {
        "title": title,
        "canonical_url": url,
        "price": price,
        "currency": currency,
    }

2.2 Extract embedded JSON when present

Many modern e-commerce pages embed a JSON blob (for hydration).

On Shopee, a practical technique is:

  • search for <script type="application/ld+json"> (structured data)
  • search for any script tags that contain product-like keys

def parse_ld_json(soup: BeautifulSoup) -> dict:
    out = {}
    for s in soup.select("script[type='application/ld+json']"):
        try:
            data = json.loads(s.get_text(strip=True))
        except Exception:
            continue

        # Sometimes it's a list, sometimes an object
        if isinstance(data, list):
            candidates = data
        else:
            candidates = [data]

        for obj in candidates:
            if not isinstance(obj, dict):
                continue
            if obj.get("@type") in ("Product", "ItemPage") or "offers" in obj:
                out["ldjson"] = obj
                # Try to read price
                offers = obj.get("offers")
                if isinstance(offers, dict):
                    out["price"] = offers.get("price") or out.get("price")
                    out["currency"] = offers.get("priceCurrency") or out.get("currency")
                out["title"] = obj.get("name") or out.get("title")
                return out

    return out

2.3 Sold count (best-effort)

“Sold count” is frequently rendered as text like:

  • "2.1k sold"
  • "12 sold"

If it’s in the HTML, we can extract it with a regex search.


def parse_sold_count(html: str) -> str | None:
    # Keep it conservative: capture the most obvious pattern.
    m = re.search(r"\b(\d+(?:\.\d+)?\s*(?:k|m)?\s*)sold\b", html, flags=re.I)
    if not m:
        return None
    return m.group(0).strip()

Step 3: Build a complete product scraper

Now we combine the fetch + parse layers into a function that takes a list of product URLs and returns normalized rows.

from datetime import datetime


def scrape_shopee_products(urls: list[str]) -> list[dict]:
    rows = []

    for url in urls:
        html = fetch_html(url)
        soup = BeautifulSoup(html, "lxml")

        meta = parse_meta(soup)
        ld = parse_ld_json(soup)

        title = ld.get("title") or meta.get("title")
        price = ld.get("price") or meta.get("price")
        currency = ld.get("currency") or meta.get("currency")
        canonical_url = meta.get("canonical_url") or url
        sold = parse_sold_count(html)

        rows.append({
            "input_url": url,
            "canonical_url": canonical_url,
            "title": title,
            "price": price,
            "currency": currency,
            "sold": sold,
            "scraped_at": datetime.utcnow().isoformat() + "Z",
        })

    return rows

Example run

urls = [
    "https://shopee.sg/",  # replace with a real product URL
]

rows = scrape_shopee_products(urls)
print(rows[0])

Step 4: Export to CSV

import csv


def export_csv(rows: list[dict], path: str = "shopee_products.csv") -> None:
    if not rows:
        raise ValueError("No rows to export")

    fieldnames = list(rows[0].keys())
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)

    print("wrote", path, "rows:", len(rows))

Put it together:

if __name__ == "__main__":
    urls = [
        # Use real Shopee product URLs for your region.
        "https://shopee.sg/",
    ]

    rows = scrape_shopee_products(urls)
    export_csv(rows)

Practical tips for scraping Shopee without getting blocked

  1. Use product pages, not search pages. Search/category often requires JS.
  2. Throttle requests. Even with proxies, hitting hundreds of pages/minute is asking for captchas.
  3. Cache results. If you re-scrape the same URL hourly, store raw HTML or parsed JSON to avoid waste.
  4. Validate data. Spot-check 10 products in a browser and compare.
  5. Handle “empty HTML”. Very small responses are often blocks; treat them as retriable errors.

Where ProxiesAPI helps (honest version)

ProxiesAPI doesn’t magically make every Shopee page scrapeable.

What it does help with:

  • routing through proxies without you managing a proxy pool
  • keeping your request layer consistent across sites
  • improving resilience when your crawler runs at scale

If you hit a wall on a specific Shopee surface (especially category/search pages), the next step is usually a browser-based approach (Playwright) or a dedicated API integration.


QA checklist

  • Open your product URL in a normal browser and confirm title/price/sold exist
  • Fetch via ProxiesAPI and confirm len(html) is not tiny
  • Print extracted fields for 3–5 products
  • Export CSV and open it (values in correct columns)
  • Add retry/backoff logs to monitor failures
Make Shopee scraping more reliable with ProxiesAPI

Shopee is a high-demand e-commerce target. ProxiesAPI gives you a simple way to route requests through proxies and keep your scraper stable as you scale to more products and more categories.

Related guides

Scrape Marktplaats.nl Listings with Python (ProxiesAPI)
Build a Netherlands classifieds scraper: fetch search pages via ProxiesAPI, paginate results, extract title/price/location/URL, and export a clean dataset. Includes a screenshot and a robust parsing strategy.
tutorial#python#marktplaats#classifieds
Scrape Costco Product Prices with Python (Search + Pagination + Product Pages)
Build a repeatable Costco price dataset from search → listings → product pages, with ProxiesAPI + retries.
tutorial#python#costco#price-scraping
Scrape Google Scholar Search Results with Python (Titles, Authors, Citations)
Collect Scholar SERP pages into a clean dataset, handling pagination + lightweight anti-bot tactics.
tutorial#python#google-scholar#serp
Scrape Numbeo Cost of Living Data with Python (cities, indices, and tables)
Extract Numbeo cost-of-living tables into a structured dataset (with a screenshot), then export to JSON/CSV using ProxiesAPI-backed requests.
tutorial#python#web-scraping#beautifulsoup