Scrape Shopee Product Listings with Python (ProxiesAPI)

Apr 22, 2026 · tutorial · #python, #shopee, #ecommerce, #web-scraping, #beautifulsoup, #csv, #proxies

Shopee is one of the most popular e-commerce marketplaces in Southeast Asia, which makes it a common target for price monitoring, catalog intelligence, and availability tracking.

The catch: Shopee pages can be JS-heavy and they can be inconsistent by region. The goal of this tutorial is to build a scraper that works when Shopee returns usable HTML, and to do it in a way that’s production-shaped:

robust HTTP fetching (timeouts + retries)
parsing with real selectors (and fallbacks)
clean output
CSV export
a screenshot of the target website (so you can visually confirm what you’re scraping)

Shopee product page (we’ll extract title, price, and sold count)

Make Shopee scraping more reliable with ProxiesAPI

Shopee is a high-demand e-commerce target. ProxiesAPI gives you a simple way to route requests through proxies and keep your scraper stable as you scale to more products and more categories.

Get 1,000 free API calls View pricing

What we’re scraping (and what we’re not)

Shopee has multiple surfaces:

Product detail pages (PDP): title, price, sold count, rating, variants
Category / search pages: many items, but often rendered client-side

In this guide we’ll focus on product pages because:

they’re easier to validate (you know what a given product should say)
they’re the right unit for monitoring (you usually track specific SKUs)

We’ll scrape:

title
price
currency (when available)
sold count (e.g., “2.3k sold”)
canonical_url

A note on “listings”

The proposal says “product listings”, but in practice Shopee “listing” data is most reliably extracted from product pages.

If you specifically need category listings (many products), you typically have to:

call Shopee’s internal APIs (often signed)
or run a browser (Playwright) to render JS

This post stays on the honest side: HTML product pages via ProxiesAPI.

Requirements

Python 3.10+
A ProxiesAPI key

Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml python-dotenv

Create a .env:

PROXIESAPI_KEY="YOUR_KEY"

Step 1: A reliable fetch layer using ProxiesAPI

ProxiesAPI works by requesting:

http://api.proxiesapi.com/?auth_key=KEY&url=TARGET_URL

We’ll wrap that in a fetch function with:

connect/read timeouts
retry with exponential backoff
a realistic User-Agent

import os
import time
import random
import urllib.parse

import requests
from dotenv import load_dotenv

load_dotenv()

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")
if not PROXIESAPI_KEY:
    raise RuntimeError("Missing PROXIESAPI_KEY in environment")

PROXIESAPI_ENDPOINT = "http://api.proxiesapi.com/"

TIMEOUT = (15, 45)  # connect, read

session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
})


def proxiesapi_url(target_url: str) -> str:
    return (
        f"{PROXIESAPI_ENDPOINT}?auth_key={urllib.parse.quote(PROXIESAPI_KEY)}"
        f"&url={urllib.parse.quote(target_url, safe='')}"
    )


def fetch_html(url: str, *, retries: int = 4) -> str:
    last_err = None

    for attempt in range(1, retries + 1):
        try:
            r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
            r.raise_for_status()

            html = r.text
            if len(html) < 5000:
                # Shopee pages can be large; very small responses are often blocks/errors.
                raise RuntimeError(f"Response too small ({len(html)} bytes)")

            return html
        except Exception as e:
            last_err = e
            if attempt == retries:
                break

            sleep = (2 ** attempt) + random.uniform(0.0, 0.6)
            print(f"attempt {attempt} failed: {e} — sleeping {sleep:.1f}s")
            time.sleep(sleep)

    raise RuntimeError(f"Failed to fetch {url}: {last_err}")

Quick sanity check

Pick a Shopee product page from the region you care about (example domains include shopee.sg, shopee.ph, shopee.co.th).

html = fetch_html("https://shopee.sg/")
print("bytes:", len(html))
print(html[:200])

If this fails, it usually means:

the page is fully client-rendered for that region
your target is geo-dependent
you’re getting a bot-check page

In that case, switch to a specific product URL you can open in a normal browser.

Step 2: Extract data from the HTML (realistic approach)

Shopee’s HTML structure can vary, but there are two common places to look:

Open Graph / meta tags (often stable)
Embedded JSON (state blobs)

We’ll implement both.

2.1 Parse common meta tags

import re
import json
from bs4 import BeautifulSoup


def text_or_none(el):
    return el.get_text(strip=True) if el else None


def attr_or_none(el, attr: str):
    return el.get(attr) if el and el.has_attr(attr) else None


def parse_meta(soup: BeautifulSoup) -> dict:
    def meta(name=None, prop=None):
        if name:
            return soup.select_one(f"meta[name='{name}']")
        if prop:
            return soup.select_one(f"meta[property='{prop}']")
        return None

    title = attr_or_none(meta(prop="og:title"), "content")
    url = attr_or_none(meta(prop="og:url"), "content")
    price = attr_or_none(meta(prop="product:price:amount"), "content")
    currency = attr_or_none(meta(prop="product:price:currency"), "content")

    return {
        "title": title,
        "canonical_url": url,
        "price": price,
        "currency": currency,
    }

2.2 Extract embedded JSON when present

Many modern e-commerce pages embed a JSON blob (for hydration).

On Shopee, a practical technique is:

search for <script type="application/ld+json"> (structured data)
search for any script tags that contain product-like keys


def parse_ld_json(soup: BeautifulSoup) -> dict:
    out = {}
    for s in soup.select("script[type='application/ld+json']"):
        try:
            data = json.loads(s.get_text(strip=True))
        except Exception:
            continue

        # Sometimes it's a list, sometimes an object
        if isinstance(data, list):
            candidates = data
        else:
            candidates = [data]

        for obj in candidates:
            if not isinstance(obj, dict):
                continue
            if obj.get("@type") in ("Product", "ItemPage") or "offers" in obj:
                out["ldjson"] = obj
                # Try to read price
                offers = obj.get("offers")
                if isinstance(offers, dict):
                    out["price"] = offers.get("price") or out.get("price")
                    out["currency"] = offers.get("priceCurrency") or out.get("currency")
                out["title"] = obj.get("name") or out.get("title")
                return out

    return out

2.3 Sold count (best-effort)

“Sold count” is frequently rendered as text like:

"2.1k sold"
"12 sold"

If it’s in the HTML, we can extract it with a regex search.


def parse_sold_count(html: str) -> str | None:
    # Keep it conservative: capture the most obvious pattern.
    m = re.search(r"\b(\d+(?:\.\d+)?\s*(?:k|m)?\s*)sold\b", html, flags=re.I)
    if not m:
        return None
    return m.group(0).strip()

Step 3: Build a complete product scraper

Now we combine the fetch + parse layers into a function that takes a list of product URLs and returns normalized rows.

from datetime import datetime


def scrape_shopee_products(urls: list[str]) -> list[dict]:
    rows = []

    for url in urls:
        html = fetch_html(url)
        soup = BeautifulSoup(html, "lxml")

        meta = parse_meta(soup)
        ld = parse_ld_json(soup)

        title = ld.get("title") or meta.get("title")
        price = ld.get("price") or meta.get("price")
        currency = ld.get("currency") or meta.get("currency")
        canonical_url = meta.get("canonical_url") or url
        sold = parse_sold_count(html)

        rows.append({
            "input_url": url,
            "canonical_url": canonical_url,
            "title": title,
            "price": price,
            "currency": currency,
            "sold": sold,
            "scraped_at": datetime.utcnow().isoformat() + "Z",
        })

    return rows

Example run

urls = [
    "https://shopee.sg/",  # replace with a real product URL
]

rows = scrape_shopee_products(urls)
print(rows[0])

Step 4: Export to CSV

import csv


def export_csv(rows: list[dict], path: str = "shopee_products.csv") -> None:
    if not rows:
        raise ValueError("No rows to export")

    fieldnames = list(rows[0].keys())
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)

    print("wrote", path, "rows:", len(rows))

Put it together:

if __name__ == "__main__":
    urls = [
        # Use real Shopee product URLs for your region.
        "https://shopee.sg/",
    ]

    rows = scrape_shopee_products(urls)
    export_csv(rows)

Practical tips for scraping Shopee without getting blocked

Use product pages, not search pages. Search/category often requires JS.
Throttle requests. Even with proxies, hitting hundreds of pages/minute is asking for captchas.
Cache results. If you re-scrape the same URL hourly, store raw HTML or parsed JSON to avoid waste.
Validate data. Spot-check 10 products in a browser and compare.
Handle “empty HTML”. Very small responses are often blocks; treat them as retriable errors.

Where ProxiesAPI helps (honest version)

ProxiesAPI doesn’t magically make every Shopee page scrapeable.

What it does help with:

routing through proxies without you managing a proxy pool
keeping your request layer consistent across sites
improving resilience when your crawler runs at scale

If you hit a wall on a specific Shopee surface (especially category/search pages), the next step is usually a browser-based approach (Playwright) or a dedicated API integration.

QA checklist

Open your product URL in a normal browser and confirm title/price/sold exist
Fetch via ProxiesAPI and confirm len(html) is not tiny
Print extracted fields for 3–5 products
Export CSV and open it (values in correct columns)
Add retry/backoff logs to monitor failures

Make Shopee scraping more reliable with ProxiesAPI

Shopee is a high-demand e-commerce target. ProxiesAPI gives you a simple way to route requests through proxies and keep your scraper stable as you scale to more products and more categories.

Get 1,000 free API calls View pricing

Extract Numbeo's city-level quality-of-life scores, safety, traffic, pollution, and climate indicators into a clean dataset with Python and ProxiesAPI.

tutorial#python#numbeo#web-scraping

Scrape Shopee Seller Storefronts and Top Products with Python

Collect seller metadata and top-product signals from public Shopee storefronts using a browser-assisted workflow, bootstrap data extraction, and ProxiesAPI-backed requests.

tutorial#python#shopee#playwright

Scrape GitHub Topic Pages with Python + ProxiesAPI

Collect repository cards, stars, languages, repo URLs, and update timestamps from GitHub topic pages into a niche-watch dataset.

tutorial#python#github#web-scraping

Scrape Numbeo Crime Index by City with Python + ProxiesAPI

Extract city crime rankings, safety scores, and comparison-ready rows from Numbeo's public rankings table into JSON and CSV.

tutorial#python#numbeo#web-scraping

Scrape Shopee Product Listings with Python (ProxiesAPI)

Related guides