Scrape Shopee Product Listings with Python (ProxiesAPI)
Shopee is one of the most popular e-commerce marketplaces in Southeast Asia, which makes it a common target for price monitoring, catalog intelligence, and availability tracking.
The catch: Shopee pages can be JS-heavy and they can be inconsistent by region. The goal of this tutorial is to build a scraper that works when Shopee returns usable HTML, and to do it in a way that’s production-shaped:
- robust HTTP fetching (timeouts + retries)
- parsing with real selectors (and fallbacks)
- clean output
- CSV export
- a screenshot of the target website (so you can visually confirm what you’re scraping)

Shopee is a high-demand e-commerce target. ProxiesAPI gives you a simple way to route requests through proxies and keep your scraper stable as you scale to more products and more categories.
What we’re scraping (and what we’re not)
Shopee has multiple surfaces:
- Product detail pages (PDP): title, price, sold count, rating, variants
- Category / search pages: many items, but often rendered client-side
In this guide we’ll focus on product pages because:
- they’re easier to validate (you know what a given product should say)
- they’re the right unit for monitoring (you usually track specific SKUs)
We’ll scrape:
titlepricecurrency(when available)soldcount (e.g., “2.3k sold”)canonical_url
A note on “listings”
The proposal says “product listings”, but in practice Shopee “listing” data is most reliably extracted from product pages.
If you specifically need category listings (many products), you typically have to:
- call Shopee’s internal APIs (often signed)
- or run a browser (Playwright) to render JS
This post stays on the honest side: HTML product pages via ProxiesAPI.
Requirements
- Python 3.10+
- A ProxiesAPI key
Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml python-dotenv
Create a .env:
PROXIESAPI_KEY="YOUR_KEY"
Step 1: A reliable fetch layer using ProxiesAPI
ProxiesAPI works by requesting:
http://api.proxiesapi.com/?auth_key=KEY&url=TARGET_URL
We’ll wrap that in a fetch function with:
- connect/read timeouts
- retry with exponential backoff
- a realistic User-Agent
import os
import time
import random
import urllib.parse
import requests
from dotenv import load_dotenv
load_dotenv()
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")
if not PROXIESAPI_KEY:
raise RuntimeError("Missing PROXIESAPI_KEY in environment")
PROXIESAPI_ENDPOINT = "http://api.proxiesapi.com/"
TIMEOUT = (15, 45) # connect, read
session = requests.Session()
session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
})
def proxiesapi_url(target_url: str) -> str:
return (
f"{PROXIESAPI_ENDPOINT}?auth_key={urllib.parse.quote(PROXIESAPI_KEY)}"
f"&url={urllib.parse.quote(target_url, safe='')}"
)
def fetch_html(url: str, *, retries: int = 4) -> str:
last_err = None
for attempt in range(1, retries + 1):
try:
r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
r.raise_for_status()
html = r.text
if len(html) < 5000:
# Shopee pages can be large; very small responses are often blocks/errors.
raise RuntimeError(f"Response too small ({len(html)} bytes)")
return html
except Exception as e:
last_err = e
if attempt == retries:
break
sleep = (2 ** attempt) + random.uniform(0.0, 0.6)
print(f"attempt {attempt} failed: {e} — sleeping {sleep:.1f}s")
time.sleep(sleep)
raise RuntimeError(f"Failed to fetch {url}: {last_err}")
Quick sanity check
Pick a Shopee product page from the region you care about (example domains include shopee.sg, shopee.ph, shopee.co.th).
html = fetch_html("https://shopee.sg/")
print("bytes:", len(html))
print(html[:200])
If this fails, it usually means:
- the page is fully client-rendered for that region
- your target is geo-dependent
- you’re getting a bot-check page
In that case, switch to a specific product URL you can open in a normal browser.
Step 2: Extract data from the HTML (realistic approach)
Shopee’s HTML structure can vary, but there are two common places to look:
- Open Graph / meta tags (often stable)
- Embedded JSON (state blobs)
We’ll implement both.
2.1 Parse common meta tags
import re
import json
from bs4 import BeautifulSoup
def text_or_none(el):
return el.get_text(strip=True) if el else None
def attr_or_none(el, attr: str):
return el.get(attr) if el and el.has_attr(attr) else None
def parse_meta(soup: BeautifulSoup) -> dict:
def meta(name=None, prop=None):
if name:
return soup.select_one(f"meta[name='{name}']")
if prop:
return soup.select_one(f"meta[property='{prop}']")
return None
title = attr_or_none(meta(prop="og:title"), "content")
url = attr_or_none(meta(prop="og:url"), "content")
price = attr_or_none(meta(prop="product:price:amount"), "content")
currency = attr_or_none(meta(prop="product:price:currency"), "content")
return {
"title": title,
"canonical_url": url,
"price": price,
"currency": currency,
}
2.2 Extract embedded JSON when present
Many modern e-commerce pages embed a JSON blob (for hydration).
On Shopee, a practical technique is:
- search for
<script type="application/ld+json">(structured data) - search for any script tags that contain product-like keys
def parse_ld_json(soup: BeautifulSoup) -> dict:
out = {}
for s in soup.select("script[type='application/ld+json']"):
try:
data = json.loads(s.get_text(strip=True))
except Exception:
continue
# Sometimes it's a list, sometimes an object
if isinstance(data, list):
candidates = data
else:
candidates = [data]
for obj in candidates:
if not isinstance(obj, dict):
continue
if obj.get("@type") in ("Product", "ItemPage") or "offers" in obj:
out["ldjson"] = obj
# Try to read price
offers = obj.get("offers")
if isinstance(offers, dict):
out["price"] = offers.get("price") or out.get("price")
out["currency"] = offers.get("priceCurrency") or out.get("currency")
out["title"] = obj.get("name") or out.get("title")
return out
return out
2.3 Sold count (best-effort)
“Sold count” is frequently rendered as text like:
"2.1k sold""12 sold"
If it’s in the HTML, we can extract it with a regex search.
def parse_sold_count(html: str) -> str | None:
# Keep it conservative: capture the most obvious pattern.
m = re.search(r"\b(\d+(?:\.\d+)?\s*(?:k|m)?\s*)sold\b", html, flags=re.I)
if not m:
return None
return m.group(0).strip()
Step 3: Build a complete product scraper
Now we combine the fetch + parse layers into a function that takes a list of product URLs and returns normalized rows.
from datetime import datetime
def scrape_shopee_products(urls: list[str]) -> list[dict]:
rows = []
for url in urls:
html = fetch_html(url)
soup = BeautifulSoup(html, "lxml")
meta = parse_meta(soup)
ld = parse_ld_json(soup)
title = ld.get("title") or meta.get("title")
price = ld.get("price") or meta.get("price")
currency = ld.get("currency") or meta.get("currency")
canonical_url = meta.get("canonical_url") or url
sold = parse_sold_count(html)
rows.append({
"input_url": url,
"canonical_url": canonical_url,
"title": title,
"price": price,
"currency": currency,
"sold": sold,
"scraped_at": datetime.utcnow().isoformat() + "Z",
})
return rows
Example run
urls = [
"https://shopee.sg/", # replace with a real product URL
]
rows = scrape_shopee_products(urls)
print(rows[0])
Step 4: Export to CSV
import csv
def export_csv(rows: list[dict], path: str = "shopee_products.csv") -> None:
if not rows:
raise ValueError("No rows to export")
fieldnames = list(rows[0].keys())
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()
w.writerows(rows)
print("wrote", path, "rows:", len(rows))
Put it together:
if __name__ == "__main__":
urls = [
# Use real Shopee product URLs for your region.
"https://shopee.sg/",
]
rows = scrape_shopee_products(urls)
export_csv(rows)
Practical tips for scraping Shopee without getting blocked
- Use product pages, not search pages. Search/category often requires JS.
- Throttle requests. Even with proxies, hitting hundreds of pages/minute is asking for captchas.
- Cache results. If you re-scrape the same URL hourly, store raw HTML or parsed JSON to avoid waste.
- Validate data. Spot-check 10 products in a browser and compare.
- Handle “empty HTML”. Very small responses are often blocks; treat them as retriable errors.
Where ProxiesAPI helps (honest version)
ProxiesAPI doesn’t magically make every Shopee page scrapeable.
What it does help with:
- routing through proxies without you managing a proxy pool
- keeping your request layer consistent across sites
- improving resilience when your crawler runs at scale
If you hit a wall on a specific Shopee surface (especially category/search pages), the next step is usually a browser-based approach (Playwright) or a dedicated API integration.
QA checklist
- Open your product URL in a normal browser and confirm title/price/sold exist
- Fetch via ProxiesAPI and confirm
len(html)is not tiny - Print extracted fields for 3–5 products
- Export CSV and open it (values in correct columns)
- Add retry/backoff logs to monitor failures
Shopee is a high-demand e-commerce target. ProxiesAPI gives you a simple way to route requests through proxies and keep your scraper stable as you scale to more products and more categories.