Scrape Marktplaats Listings with Python (Search + Pagination + CSV Export)
Marktplaats (Netherlands’ biggest classifieds marketplace) is a great real‑world scraping target because:
- results are list-based and repeatable
- pagination is explicit
- listing cards contain the exact fields you want (title, price, location)
In this tutorial we’ll build a practical scraper in Python that:
- fetches a Marktplaats search results page
- parses listing cards (title, price, location, url)
- follows pagination for N pages
- exports to CSV
- optionally routes requests through ProxiesAPI for stability

As you move from one search page to dozens (and from one keyword to many), the network layer becomes your bottleneck. ProxiesAPI gives you a simple fetch URL wrapper so your Python extraction code stays focused on parsing — not blocks and flaky responses.
What we’re scraping (page structure)
A Marktplaats search is typically reached from the website UI, but the end result is a URL with a query.
Example (illustrative):
https://www.marktplaats.nl/q/iphone/
On the search results page you’ll usually see:
- a repeating “card” per listing (title, price, location)
- a link to the listing detail page
- pagination controls (next page)
Because Marktplaats can change its HTML and can vary by category, the right approach is to inspect the page and write selectors that match actual attributes (and to keep a couple of fallbacks).
Setup
Create a virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for reliable parsingcsv(stdlib) for export
Step 1: Fetch HTML with timeouts (and a real User-Agent)
A surprising number of scrapers “work” until they hang. Always set timeouts.
import requests
TIMEOUT = (10, 30) # connect, read
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
"Accept-Language": "en-US,en;q=0.9",
})
def fetch_html(url: str) -> str:
r = session.get(url, timeout=TIMEOUT)
r.raise_for_status()
return r.text
url = "https://www.marktplaats.nl/q/iphone/"
html = fetch_html(url)
print("bytes:", len(html))
print(html[:200])
Terminal sanity check
curl -s "https://www.marktplaats.nl/q/iphone/" | head -n 5
Step 2: Parse listing cards (title, price, location, url)
Marktplaats HTML is not guaranteed stable forever, so we’ll parse using a strategy:
- find candidate card containers
- within each card, find the main link + visible title
- extract price and location if present
We’ll also normalize whitespace and build absolute URLs.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://www.marktplaats.nl"
def clean(text: str | None) -> str | None:
if not text:
return None
t = " ".join(text.split())
return t or None
def parse_listings(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
listings: list[dict] = []
# Heuristic: search result pages tend to have many links to ".../a/..." style paths.
# We'll look for anchors that look like listing links and climb to a card container.
anchors = soup.select("a[href]")
seen_urls = set()
for a in anchors:
href = a.get("href") or ""
# Marktplaats listing URLs often contain "/a/". This is a heuristic.
if "/a/" not in href:
continue
url = urljoin(BASE, href)
if url in seen_urls:
continue
title = clean(a.get_text(" ", strip=True))
if not title or len(title) < 6:
# too short; often nav/ads
continue
# Try to locate a reasonable container to search for price/location nearby.
card = a
for _ in range(5):
if not card or not getattr(card, "name", None):
break
if card.name in {"article", "li", "div"}:
# stop at a common container
break
card = card.parent
container = card if card else a
text_blob = clean(container.get_text(" ", strip=True)) or ""
# Price heuristic: € symbol or "EUR".
price = None
if "€" in text_blob:
# pick the first token that contains €
for token in text_blob.split():
if "€" in token:
# might be like "€250" or "€ 250"
price = token if token != "€" else None
break
# Location heuristic: Marktplaats cards often show a city + date.
# We'll try to capture a short chunk near the end; this is intentionally conservative.
location = None
parts = text_blob.split(" · ")
if len(parts) >= 2:
# often "City · Today" or similar
location = clean(parts[0])
listings.append({
"title": title,
"price": price,
"location": location,
"url": url,
})
seen_urls.add(url)
return listings
listings = parse_listings(html)
print("parsed:", len(listings))
print(listings[0] if listings else None)
Why this approach?
For a tutorial that survives minor site changes, it’s better to:
- anchor on “listing links” (the most stable concept)
- extract nearby text
- keep heuristics minimal and transparent
If you need perfect extraction (e.g., separating “price” from “bidding” labels), the next step is to inspect the DOM and tighten selectors for the specific page layout you’re targeting.
Step 3: Pagination (crawl N result pages)
The cleanest pagination strategy is:
- start from a search URL
- parse a “next page” link, follow it
- stop after
max_pagesor when no next link exists
from urllib.parse import urlparse, urlunparse, parse_qs, urlencode
def find_next_page_url(current_url: str, html: str) -> str | None:
soup = BeautifulSoup(html, "lxml")
# Attempt 1: rel=next (best case)
ln = soup.select_one("link[rel='next'][href]")
if ln:
return urljoin(BASE, ln.get("href"))
# Attempt 2: anchor with aria-label or text that implies next
a = soup.select_one("a[rel='next'][href]")
if a:
return urljoin(BASE, a.get("href"))
# Fallback: if no explicit next link exists, try incrementing a common query param.
# Some sites use ?p=2 or ?page=2. We'll only do this if a param exists already.
parsed = urlparse(current_url)
qs = parse_qs(parsed.query)
for key in ("p", "page"):
if key in qs:
try:
n = int(qs[key][0])
qs[key] = [str(n + 1)]
new_query = urlencode(qs, doseq=True)
return urlunparse(parsed._replace(query=new_query))
except Exception:
pass
return None
def crawl_search(start_url: str, max_pages: int = 3) -> list[dict]:
all_rows: list[dict] = []
seen = set()
url = start_url
for page in range(1, max_pages + 1):
html = fetch_html(url)
batch = parse_listings(html)
for row in batch:
u = row.get("url")
if not u or u in seen:
continue
seen.add(u)
all_rows.append(row)
print(f"page={page} url={url} batch={len(batch)} total={len(all_rows)}")
nxt = find_next_page_url(url, html)
if not nxt:
break
url = nxt
return all_rows
rows = crawl_search("https://www.marktplaats.nl/q/iphone/", max_pages=5)
print("total unique:", len(rows))
Step 4: Export to CSV
import csv
def export_csv(rows: list[dict], path: str) -> None:
fieldnames = ["title", "price", "location", "url"]
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()
for r in rows:
w.writerow({k: r.get(k) for k in fieldnames})
export_csv(rows, "marktplaats_listings.csv")
print("wrote marktplaats_listings.csv", len(rows))
Step 5: Route fetches through ProxiesAPI (optional)
When you scale scraping (more pages, more keywords, more frequent runs), failures come from the network layer:
- intermittent timeouts
- inconsistent responses
- blocked requests
With ProxiesAPI you can fetch through a simple URL wrapper:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://www.marktplaats.nl/q/iphone/" | head
In Python, wrap any target URL and reuse the same parsing code:
from urllib.parse import quote
def proxiesapi_wrap(target_url: str, api_key: str) -> str:
# ProxiesAPI uses a simple querystring wrapper.
# Keep the target URL URL-encoded.
return f"http://api.proxiesapi.com/?key={api_key}&url={quote(target_url, safe='')}"
API_KEY = "API_KEY"
start = "https://www.marktplaats.nl/q/iphone/"
wrapped = proxiesapi_wrap(start, API_KEY)
html = fetch_html(wrapped)
print("bytes via proxies:", len(html))
Notice the win: your parser doesn’t change. Only the fetch URL changes.
Common pitfalls (and how to avoid them)
- No timeouts → crawls hang forever.
- Too-specific selectors → break as soon as the site A/B tests layout.
- No dedupe → pagination repeats items, exports get messy.
- Not saving raw HTML samples → debugging becomes guesswork.
A simple production habit: when a parse returns zero rows, save the HTML to a debug/ folder and inspect it.
QA checklist
- First page returns a non-zero count of listings
- URLs are absolute and unique
- Pagination increases total rows
- CSV opens cleanly in Excel/Sheets
- ProxiesAPI wrapper fetch returns HTML (even if you don’t always need it)
As you move from one search page to dozens (and from one keyword to many), the network layer becomes your bottleneck. ProxiesAPI gives you a simple fetch URL wrapper so your Python extraction code stays focused on parsing — not blocks and flaky responses.