Scrape Funda.nl Property Listings with Python (Search + Pagination + Detail Pages)

Funda.nl is one of the most popular real-estate portals in the Netherlands. It’s a perfect “real world” scraping target because you need to handle:

  • search result pages (with filters)
  • pagination
  • de-duplication
  • detail pages (where the rich data lives)

In this guide we’ll build a production-shaped Python scraper that:

  1. fetches Funda search pages
  2. paginates until it runs out of results (with hard safety limits)
  3. extracts listing URLs + basic card fields
  4. visits each listing detail page and extracts structured fields
  5. exports JSONL/CSV

We’ll keep it honest: site structures change. The key is writing selectors that are explainable and easy to update.

Funda search results page (we’ll scrape cards + pagination)

Make your real-estate crawl reliable with ProxiesAPI

Funda pages can be sensitive to repetitive traffic and geo/rate limits. ProxiesAPI helps you keep the network layer stable as you scale search pages + detail pages to thousands of listings.


What we’re scraping (site shape)

Funda has:

  • Search results for a city/area + filters (price, size, property type)
  • Listing detail pages for each property

Typical flow:

  1. Start from a search URL (you can create this manually in your browser)
  2. On each results page, collect listing links
  3. Follow each listing and extract fields (address, price, bedrooms, area, features, agent)

Always review the site’s Terms of Service and respect robots/rate limits. For data you plan to resell or use commercially, get legal advice.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • pandas (optional) for CSV

Step 1: A fetch layer you can swap to ProxiesAPI

Scrapers fail most often at the network layer (timeouts, blocks, inconsistent responses). So we’ll isolate fetching.

from __future__ import annotations

import os
import time
import random
from urllib.parse import urljoin

import requests

BASE = "https://www.funda.nl"
TIMEOUT = (10, 30)  # connect, read

session = requests.Session()

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9,nl;q=0.8",
}


def polite_sleep(min_s: float = 0.7, max_s: float = 1.8) -> None:
    time.sleep(random.uniform(min_s, max_s))


def fetch(url: str) -> str:
    """Fetch HTML. If you use ProxiesAPI, this is the only function you change."""

    # Option A (direct):
    r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)

    # Option B (ProxiesAPI):
    # If ProxiesAPI provides a proxy endpoint or an API fetch endpoint in your account,
    # wire it here. Keep the rest of the scraper unchanged.
    # Example patterns (illustrative; use the exact method from your ProxiesAPI docs):
    #
    # PROXY = os.environ.get("PROXIESAPI_HTTP_PROXY")
    # proxies = {"http": PROXY, "https": PROXY} if PROXY else None
    # r = session.get(url, headers=DEFAULT_HEADERS, proxies=proxies, timeout=TIMEOUT)

    r.raise_for_status()
    return r.text

If you later need retries, add them here (with exponential backoff).


Step 2: Pick a stable search URL

The easiest way is:

  1. Open Funda in your browser
  2. Set the area (e.g., Amsterdam)
  3. Apply a filter (e.g., “for sale”)
  4. Copy the URL from the address bar

You’ll end up with a URL that looks like a city path plus query params.

In code, we’ll treat it as a start_url.


Step 3: Parse search result cards (URLs + summary fields)

Funda’s DOM changes over time, so don’t rely on one fragile CSS class.

A practical strategy:

  • find result containers that contain links to /koop/ (buy) or /huur/ (rent)
  • from those links, build absolute URLs
  • also parse nearby text for price/title when present
import re
from bs4 import BeautifulSoup


def absolute(href: str) -> str:
    return href if href.startswith("http") else urljoin(BASE, href)


def parse_search_page(html: str) -> tuple[list[dict], str | None]:
    """Return (listings, next_page_url)."""

    soup = BeautifulSoup(html, "lxml")

    listings: list[dict] = []

    # Heuristic: listing links usually contain /koop/ or /huur/ and end with a slash.
    anchors = soup.select("a[href]")
    for a in anchors:
        href = a.get("href") or ""
        if not href.startswith("/"):
            continue
        if "/koop/" not in href and "/huur/" not in href:
            continue
        if "#" in href:
            continue

        url = absolute(href)

        # Basic de-dup within page; we’ll also de-dup globally.
        listings.append({
            "url": url,
            "card_text": a.get_text(" ", strip=True) or None,
        })

    # De-dup URLs while preserving order
    seen = set()
    uniq = []
    for it in listings:
        if it["url"] in seen:
            continue
        seen.add(it["url"])
        uniq.append(it)

    # Try to find a "next" link. Many sites use rel="next".
    next_link = soup.select_one('a[rel="next"][href]')
    next_url = absolute(next_link.get("href")) if next_link else None

    return uniq, next_url

This “wide net” approach often catches extra navigation links. That’s OK because we’ll validate listing pages in the detail parser.


Step 4: Parse a listing detail page

On detail pages you want stable fields. Commonly available:

  • full address
  • price
  • living area (m²)
  • plot area (m²)
  • number of rooms
  • energy label
  • agent/broker

We’ll implement parsing defensively:

  • Try a few selectors
  • If missing, keep None
from bs4 import BeautifulSoup


def text_or_none(el) -> str | None:
    if not el:
        return None
    t = el.get_text(" ", strip=True)
    return t if t else None


def parse_listing_detail(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Examples of robust-ish strategies:
    # - use <h1> for title/address
    # - find price by searching for currency-like patterns

    h1 = soup.select_one("h1")
    title = text_or_none(h1)

    # Price: look for elements that contain €
    price = None
    for el in soup.select("*[class], *[data-test], p, span, div"):
        t = el.get_text(" ", strip=True)
        if "€" in t and len(t) < 60:
            price = t
            break

    # Key-value features: many property sites show a definition list or a table.
    features = {}

    # Definition lists
    for dl in soup.select("dl"):
        dts = dl.select("dt")
        dds = dl.select("dd")
        if len(dts) != len(dds) or len(dts) == 0:
            continue
        for dt, dd in zip(dts, dds):
            k = dt.get_text(" ", strip=True)
            v = dd.get_text(" ", strip=True)
            if k and v:
                features[k] = v

    # Fallback: table rows
    for tr in soup.select("table tr"):
        tds = tr.select("td")
        if len(tds) >= 2:
            k = tds[0].get_text(" ", strip=True)
            v = tds[1].get_text(" ", strip=True)
            if k and v and k not in features:
                features[k] = v

    return {
        "url": url,
        "title": title,
        "price": price,
        "features": features,
    }

This parser is intentionally “generic” because Funda’s markup can change and may differ across listing types.

If you want very stable extraction, a better long-term strategy is to:

  • locate the JSON embedded in the page (often in a <script type="application/ld+json"> or a Next.js state script)
  • parse it and use its fields

Step 5: Crawl search → pagination → details

Now we wire it all together with safety controls:

  • max_pages to avoid infinite loops
  • max_listings to cap run size
  • de-dup URLs
import json


def crawl(start_url: str, max_pages: int = 10, max_listings: int = 200) -> list[dict]:
    out: list[dict] = []
    seen_urls: set[str] = set()

    url = start_url
    pages = 0

    while url and pages < max_pages and len(out) < max_listings:
        pages += 1
        print(f"[search] page {pages}: {url}")
        html = fetch(url)
        listings, next_url = parse_search_page(html)

        print(f"  found {len(listings)} candidate links")

        for it in listings:
            if len(out) >= max_listings:
                break

            u = it["url"]
            if u in seen_urls:
                continue
            seen_urls.add(u)

            polite_sleep()

            try:
                detail_html = fetch(u)
            except Exception as e:
                print("  [detail] failed", u, e)
                continue

            data = parse_listing_detail(detail_html, u)

            # Minimal validation: if we can’t find a title, it might not be a listing.
            if not data.get("title"):
                continue

            out.append(data)
            print(f"  [detail] ok: {data.get('title')}")

        url = next_url
        polite_sleep(1.0, 2.5)

    return out


if __name__ == "__main__":
    START_URL = "https://www.funda.nl/koop/amsterdam/"  # replace with your filtered URL

    rows = crawl(START_URL, max_pages=5, max_listings=50)
    print("listings:", len(rows))

    with open("funda_listings.jsonl", "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

    print("wrote funda_listings.jsonl")

Making selectors actually stable

A quick checklist that saves hours:

  • Prefer attributes like data-* when present (sites use them for testing)
  • Prefer semantic tags: h1, dl dt/dd, script[type="application/ld+json"]
  • Avoid long “CSS class chains” that look auto-generated
  • Validate by spot-checking 3–5 pages
  • Log missing fields so you know what broke

Where ProxiesAPI fits

Real-estate crawls tend to:

  • hit the same domain repeatedly
  • trigger rate limits
  • face geo restrictions

ProxiesAPI is useful when you:

  • increase crawl speed and parallelism
  • crawl multiple cities and filters
  • run scheduled refreshes (daily/weekly)

Keep the integration limited to the fetch() layer so you can swap proxy settings, rotate identities, or add retries without rewriting your parsers.


QA checklist

  • Search parsing finds listing URLs (print first 5)
  • Pagination stops correctly (no infinite loop)
  • Detail parsing returns non-empty title for most rows
  • Output file is valid JSONL
  • Run size capped with max_pages / max_listings

Next upgrades

  • Extract structured JSON embedded in listing pages (often more stable than HTML)
  • Store runs in SQLite with last_seen timestamps
  • Add incremental crawling (only new listings)
  • Add concurrency with httpx + asyncio once your network layer is stable
Make your real-estate crawl reliable with ProxiesAPI

Funda pages can be sensitive to repetitive traffic and geo/rate limits. ProxiesAPI helps you keep the network layer stable as you scale search pages + detail pages to thousands of listings.

Related guides

Scrape UK Property Prices from Rightmove with Python (Sold Prices Dataset + Screenshots)
Build a Rightmove sold-prices dataset builder in Python: fetch HTML reliably, parse listing cards, follow pagination, enrich details pages, and export a clean CSV/JSONL. Includes proof screenshots and a resilient request layer with ProxiesAPI.
tutorial#python#rightmove#real-estate
Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)
Build a repeatable sold-prices dataset from Rightmove: search pages → listing IDs → sold history. Includes pagination, dedupe, retries, and an honest ProxiesAPI integration for stability.
tutorial#python#rightmove#real-estate
Scrape Google Scholar Search Results with Python (Authors, Citations, and Year)
Build a repeatable Scholar scraper for queries + pagination, extracting title, authors, venue, year, and citation count. Includes anti-block hygiene and honest notes on limits.
tutorial#python#google-scholar#web-scraping
Scrape TripAdvisor Hotel Reviews with Python (Pagination + Rate Limits)
Extract TripAdvisor hotel review text, ratings, dates, and reviewer metadata with a resilient Python scraper (pagination, retries, and a proxy-backed fetch layer via ProxiesAPI).
tutorial#python#tripadvisor#reviews