Scrape Vinted Listings with Python: Search → Listings → Images (with ProxiesAPI)
Vinted is a goldmine of secondhand fashion data: pricing, condition, brand, size, seller metadata, and—crucially—high-quality item photos.
In this guide, we’ll build a real Python scraper that follows the exact flow you’d use in production:
- Search Vinted for items (e.g., “nike dunk”, “patagonia fleece”)
- Paginate through results safely
- Open listing pages to extract richer fields
- Collect image URLs (and optionally download them)
We’ll also show where ProxiesAPI fits in: not as “magic”, but as a network layer that helps keep crawls stable as volume grows.

Marketplaces like Vinted can rate-limit or challenge repeated requests. ProxiesAPI gives you a stable proxy layer and consistent request behavior when you scale from a few pages to thousands of listings.
What we’re scraping (Vinted page structure)
Vinted is a modern web app. In many locales, the search results page is server-rendered enough to scrape the listing cards and links, but details (and some attributes) can vary by region and A/B tests.
The safe approach is:
- use the search page HTML to find listing URLs
- for each URL, fetch the listing detail page and parse consistent fields
Target URLs
Typical entry points:
- Home:
https://www.vinted.com/ - Search:
https://www.vinted.com/catalog?search_text=... - Listing:
https://www.vinted.com/items/<id>-<slug>
Vinted’s exact query parameters may differ by region, but the scraper below is resilient because it:
- extracts listing links rather than relying on guessed API endpoints
- parses JSON embedded in the HTML when available
- falls back to HTML selectors for core fields
Setup
Create a virtualenv and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml python-dotenv
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for HTML parsing.envfor configuration
Create a .env file:
PROXIESAPI_KEY="YOUR_PROXIESAPI_KEY"
Step 1: Build a fetcher (timeouts, retries, headers)
Scrapers fail in boring ways: timeouts, 429s, 5xx, and occasional HTML that changes. Start with a fetcher you can trust.
import os
import time
from dataclasses import dataclass
from typing import Optional
import requests
from dotenv import load_dotenv
load_dotenv()
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "").strip()
TIMEOUT = (10, 30) # connect, read
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
}
@dataclass
class FetchResult:
url: str
status_code: int
text: str
final_url: str
class HttpClient:
def __init__(self):
self.s = requests.Session()
self.s.headers.update(DEFAULT_HEADERS)
def _via_proxiesapi(self, url: str) -> str:
"""Wrap a target URL through ProxiesAPI.
NOTE: Keep this conservative and transparent. We just build a proxy URL.
If ProxiesAPI is not configured, we fetch directly.
"""
if not PROXIESAPI_KEY:
return url
# Common pattern: pass the destination as a query param.
# If your ProxiesAPI account uses a different format, adjust here.
return f"https://api.proxiesapi.com/?auth_key={PROXIESAPI_KEY}&url={requests.utils.quote(url, safe='')}"
def get_html(self, url: str, *, use_proxy: bool = True, max_retries: int = 3) -> FetchResult:
last_exc: Optional[Exception] = None
for attempt in range(1, max_retries + 1):
try:
fetch_url = self._via_proxiesapi(url) if use_proxy else url
r = self.s.get(fetch_url, timeout=TIMEOUT, allow_redirects=True)
# If ProxiesAPI is used, r.url will be the proxy URL; keep both.
if r.status_code in (429, 500, 502, 503, 504):
backoff = min(2 ** attempt, 10)
time.sleep(backoff)
continue
r.raise_for_status()
return FetchResult(url=url, status_code=r.status_code, text=r.text, final_url=r.url)
except Exception as e:
last_exc = e
time.sleep(min(2 ** attempt, 10))
raise RuntimeError(f"Failed to fetch {url} after {max_retries} retries: {last_exc}")
Why this structure works:
- Timeouts prevent hangs
- Retries with backoff smooth temporary bans / spikes
- ProxiesAPI wrapper is contained to one function
Step 2: Scrape Vinted search results (listing cards → URLs)
The first job is to convert a keyword into listing URLs.
Build a search URL
from urllib.parse import urlencode
BASE = "https://www.vinted.com"
def build_search_url(query: str, page: int = 1) -> str:
params = {
"search_text": query,
"page": page,
}
return f"{BASE}/catalog?{urlencode(params)}"
Extract listing URLs from HTML
Vinted’s markup can change, so we use a hybrid strategy:
- Collect all links that look like listing URLs (
/items/…) - De-duplicate
- Filter out non-item links
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
ITEM_PATH_RE = re.compile(r"^/items/\d+")
def extract_listing_urls_from_search(html: str) -> list[str]:
soup = BeautifulSoup(html, "lxml")
urls: list[str] = []
seen = set()
for a in soup.select("a[href]"):
href = a.get("href")
if not href:
continue
if ITEM_PATH_RE.match(href):
full = urljoin(BASE, href)
if full not in seen:
seen.add(full)
urls.append(full)
return urls
Putting it together (paginate)
def crawl_search(query: str, pages: int = 3, *, use_proxy: bool = True) -> list[str]:
client = HttpClient()
all_urls: list[str] = []
seen = set()
for page in range(1, pages + 1):
url = build_search_url(query, page=page)
res = client.get_html(url, use_proxy=use_proxy)
batch = extract_listing_urls_from_search(res.text)
print(f"page {page}: found {len(batch)} listing urls")
# Some pages may contain repeated links; dedupe globally.
for u in batch:
if u in seen:
continue
seen.add(u)
all_urls.append(u)
# Be polite; tune for your needs.
time.sleep(1.0)
return all_urls
if __name__ == "__main__":
urls = crawl_search("patagonia fleece", pages=2)
print("unique listing urls:", len(urls))
print(urls[:5])
Step 3: Scrape a listing detail page (title, price, brand, images)
Now the interesting part: extract structured data from a listing page.
On many modern sites, listing pages include embedded JSON (often in a script tag). When it exists, parsing that JSON is more stable than scraping spans.
We’ll try two approaches:
- Approach A: parse embedded JSON if present
- Approach B: fallback to HTML selectors for title/price
Extract images + basic fields
import json
def _find_embedded_json(soup: BeautifulSoup) -> dict | None:
# Vinted (and many Next.js apps) may embed state in script tags.
# This function is defensive: it searches for JSON blobs and returns the first parseable dict.
scripts = soup.select("script")
for sc in scripts:
txt = sc.string
if not txt:
continue
t = txt.strip()
if not t:
continue
if t.startswith("{") and t.endswith("}") and len(t) > 200:
try:
obj = json.loads(t)
if isinstance(obj, dict):
return obj
except Exception:
pass
return None
def parse_listing(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
data = {
"url": url,
"title": None,
"price": None,
"currency": None,
"brand": None,
"size": None,
"condition": None,
"images": [],
}
# Try JSON first
js = _find_embedded_json(soup)
if js:
# We can’t assume exact schema (varies by deployment/locale).
# So we search for image URLs anywhere in the JSON.
imgs = []
def walk(x):
if isinstance(x, dict):
for k, v in x.items():
if isinstance(v, (dict, list)):
walk(v)
else:
if isinstance(v, str) and ("vinted" in v) and (".jpg" in v or ".png" in v):
imgs.append(v)
elif isinstance(x, list):
for i in x:
walk(i)
walk(js)
# De-dupe while preserving order
seen = set()
for u in imgs:
if u in seen:
continue
seen.add(u)
data["images"].append(u)
# Fallback HTML selectors for title/price if JSON wasn’t helpful
if not data["title"]:
h1 = soup.select_one("h1")
if h1:
data["title"] = h1.get_text(" ", strip=True)
# Price: try meta first
price_meta = soup.select_one('meta[property="product:price:amount"], meta[itemprop="price"]')
if price_meta and price_meta.get("content"):
data["price"] = price_meta.get("content")
currency_meta = soup.select_one('meta[property="product:price:currency"], meta[itemprop="priceCurrency"]')
if currency_meta and currency_meta.get("content"):
data["currency"] = currency_meta.get("content")
# If we didn’t find image URLs from JSON, also try og:image
if not data["images"]:
og = soup.select_one('meta[property="og:image"]')
if og and og.get("content"):
data["images"] = [og.get("content")]
return data
Fetch + parse listing details at scale
def crawl_listing_details(urls: list[str], *, use_proxy: bool = True, limit: int = 30) -> list[dict]:
client = HttpClient()
out: list[dict] = []
for i, url in enumerate(urls[:limit], start=1):
res = client.get_html(url, use_proxy=use_proxy)
item = parse_listing(res.text, url)
out.append(item)
print(f"{i}/{min(limit, len(urls))} title={item.get('title')!r} images={len(item.get('images') or [])}")
time.sleep(1.0)
return out
Step 4 (optional): Download listing images
Once you have image URLs, downloading is straightforward. The main thing is respecting bandwidth and timeouts.
from pathlib import Path
def download_images(items: list[dict], out_dir: str = "vinted_images") -> None:
client = HttpClient()
base = Path(out_dir)
base.mkdir(parents=True, exist_ok=True)
for item in items:
url = item.get("url")
images = item.get("images") or []
if not images:
continue
# Make a stable folder name
safe = (url.split("/items/")[-1] if "/items/" in url else "item").split("?")[0]
folder = base / safe
folder.mkdir(parents=True, exist_ok=True)
for idx, img_url in enumerate(images[:10], start=1):
try:
r = client.s.get(img_url, timeout=TIMEOUT)
r.raise_for_status()
ext = ".jpg" if ".jpg" in img_url else ".png" if ".png" in img_url else ".bin"
path = folder / f"{idx:02d}{ext}"
path.write_bytes(r.content)
except Exception as e:
print("failed image", img_url, e)
time.sleep(0.5)
Practical notes (what breaks, and how to fix it)
1) Pagination isn’t always “page=2”
Some locales or experiments may use different params. If you notice you’re getting the same results on every page:
- print the search URL you’re hitting
- print the first 3 listing URLs on each page
- check whether the HTML contains a “next page” link, then follow it
A robust improvement is to parse a “next” URL from the HTML (when present) instead of constructing it.
2) Bot challenges / rate limiting
If you start seeing:
- HTTP 429
- 403/401
- HTML that looks like a challenge page
Then you need to reduce concurrency, add delays, and use a stable proxy layer. That’s where ProxiesAPI helps.
3) Always scrape “cards → details”
Card data is often incomplete. Details pages are richer and closer to the source-of-truth.
End-to-end example (search → details → JSON export)
import json
def main():
query = "patagonia fleece"
urls = crawl_search(query, pages=2, use_proxy=True)
items = crawl_listing_details(urls, use_proxy=True, limit=25)
with open("vinted_listings.json", "w", encoding="utf-8") as f:
json.dump(items, f, ensure_ascii=False, indent=2)
print("wrote vinted_listings.json", len(items))
if __name__ == "__main__":
main()
Where ProxiesAPI fits (honestly)
You can often scrape a few pages of Vinted directly.
But if you’re building a dataset (hundreds/thousands of listing pages), the failure modes stack up:
- inconsistent rate limits
- IP reputation decay during long crawls
- intermittent 5xx/429 bursts
ProxiesAPI helps by giving you a consistent proxy layer you can route requests through—without rewriting your scraper.
QA checklist
- Search crawl returns unique
/items/…URLs - Listing parser extracts at least:
title,price(when available), andimages - JSON export loads cleanly and matches your expectations
- You have delays/timeouts (no infinite hangs)
- You can re-run without duplicating data (add a
seenset / persistent store when scaling)
Marketplaces like Vinted can rate-limit or challenge repeated requests. ProxiesAPI gives you a stable proxy layer and consistent request behavior when you scale from a few pages to thousands of listings.