Scrape Idealista Property Listings with Python

Idealista is one of the most useful public real-estate datasets in Europe, but it is not a beginner-friendly scrape.

You usually want it for:

  • property market tracking
  • lead generation and enrichment
  • monitoring new listings in a neighborhood
  • building internal price comps

The catch is that Idealista aggressively defends its search pages. In practice, that means you should design the scraper in two layers:

  1. a parser that knows how to read listing cards
  2. a fetch layer that can swap between direct requests, a proxy/unblocker, or browser automation when traffic gets challenged

In this guide, we will scrape:

  • listing title
  • listing URL
  • cover image
  • price
  • currency
  • location text
  • property details such as beds / square meters
  • short description
  • tags such as "luxury" or "sea views"

Idealista anti-bot verification encountered during live capture

Keep Idealista collection stable with ProxiesAPI

Idealista is quick to challenge repetitive traffic. A ProxiesAPI-backed fetch layer gives you a cleaner way to rotate requests and keep your parser focused on real listing pages instead of verification walls.


What makes Idealista tricky

Idealista search pages are still very parser-friendly once you have the HTML, but reaching that HTML reliably is the hard part.

Common failure modes:

  • a "please enable JS" style interstitial
  • a slider or bot-verification screen
  • geo-sensitive behavior
  • different HTML depending on language or country

That is why this tutorial keeps the parsing logic pure and makes the network layer replaceable.


Install the dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install httpx parsel

We will use:

  • httpx for fast HTTP requests
  • parsel for CSS/XPath extraction

Step 1: Build a fetch layer that can route through ProxiesAPI

Do not bury anti-block behavior inside your parser. Keep it in one place.

from __future__ import annotations

import os
import random
import time
from typing import Optional

import httpx

HEADERS = {
    "user-agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/126.0.0.0 Safari/537.36"
    ),
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
}

TIMEOUT = httpx.Timeout(30.0, connect=15.0)
BLOCK_MARKERS = [
    "enable js",
    "disable any ad blocker",
    "captcha",
    "verify you are human",
    "desliza hacia la derecha",
]


def looks_blocked(html: str) -> bool:
    lowered = html.lower()
    return any(marker in lowered for marker in BLOCK_MARKERS)


def backoff(attempt: int) -> None:
    time.sleep(min(2 ** attempt, 20) + random.uniform(0.2, 0.8))


def fetch_html(url: str, proxiesapi_template: Optional[str] = None, retries: int = 3) -> str:
    target = proxiesapi_template.format(url=url) if proxiesapi_template else url

    with httpx.Client(headers=HEADERS, follow_redirects=True, timeout=TIMEOUT) as client:
        last_error = None
        for attempt in range(retries + 1):
            try:
                response = client.get(target)
                response.raise_for_status()
                html = response.text
                if looks_blocked(html):
                    raise RuntimeError("Idealista returned a verification page")
                return html
            except Exception as exc:
                last_error = exc
                if attempt == retries:
                    break
                backoff(attempt)

    raise RuntimeError(f"Failed to fetch {url}: {last_error}")


if __name__ == "__main__":
    template = os.getenv("PROXIESAPI_URL_TEMPLATE")
    html = fetch_html(
        "https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
        proxiesapi_template=template,
    )
    print(html[:500])

Why this integration pattern is better than hardcoding an endpoint

Different ProxiesAPI accounts are often configured in one of two styles:

  • a target-URL template such as https://...&url={url}
  • a traditional proxy passed through the HTTP client

The parser below does not care which one you use.


Step 2: Parse one Idealista search page

Idealista listing cards are usually grouped under article.item, which makes the page much easier to parse than the anti-bot reputation suggests.

from __future__ import annotations

from urllib.parse import urljoin

from parsel import Selector


def clean_text(value: str | None) -> str | None:
    if not value:
        return None
    return " ".join(value.split())


def parse_search_page(html: str, base_url: str = "https://www.idealista.com") -> list[dict]:
    sel = Selector(text=html)
    listings = []

    for card in sel.css("section.items-list article.item"):
        # Skip promoted or ad units when they carry ad text blocks.
        if card.css("p.adv_txt"):
            continue

        relative_url = card.css("a.item-link::attr(href)").get()
        title = clean_text(card.css("a.item-link::attr(title)").get())
        price_text = clean_text(card.css("span.item-price::text").get())
        currency = clean_text(card.css("span.item-price span::text").get())
        location = clean_text(card.css("p.item-location::text, p.highlight-phrase::text").get())
        description = clean_text(card.css("div.item-description p::text").get())
        details = [clean_text(x) for x in card.css("div.item-detail-char span::text").getall()]
        details = [x for x in details if x]
        tags = [clean_text(x) for x in card.css("div.listing-tags-container span::text").getall()]
        tags = [x for x in tags if x]

        listings.append(
            {
                "title": title,
                "url": urljoin(base_url, relative_url) if relative_url else None,
                "image": card.css("img::attr(src), img::attr(data-src)").get(),
                "price_text": price_text,
                "currency": currency,
                "location": location,
                "details": details,
                "description": description,
                "tags": tags,
            }
        )

    return listings

What these selectors capture well

FieldSelector
titlea.item-link::attr(title)
listing URLa.item-link::attr(href)
pricespan.item-price::text
detailsdiv.item-detail-char span::text
tagsdiv.listing-tags-container span::text

If Idealista changes a class name, the parser breaks in one place instead of throughout your whole script.


Step 3: Add pagination

Idealista search results are commonly paginated with pagina-{n}.htm.

import math
import re

from parsel import Selector


def extract_total_pages(html: str) -> int:
    sel = Selector(text=html)
    heading = sel.css("h1#h1-container::text").get("") or ""

    # Example shapes vary by locale, so keep the regex loose.
    match = re.search(r"([\d,\.]+)", heading)
    total_results = int(match.group(1).replace(",", "").replace(".", "")) if match else 30
    return max(1, min(math.ceil(total_results / 30), 60))


def scrape_search_results(search_url: str, max_pages: int = 3, proxiesapi_template: str | None = None) -> list[dict]:
    first_html = fetch_html(search_url, proxiesapi_template=proxiesapi_template)
    total_pages = min(extract_total_pages(first_html), max_pages)

    all_rows = parse_search_page(first_html)

    for page_num in range(2, total_pages + 1):
        page_url = f"{search_url.rstrip('/')}/pagina-{page_num}.htm"
        html = fetch_html(page_url, proxiesapi_template=proxiesapi_template)
        all_rows.extend(parse_search_page(html))
        time.sleep(random.uniform(2.0, 4.5))

    return all_rows

That short sleep matters. When teams say "scraping stopped working," the real cause is often request shape and pacing, not parsing.


Step 4: Export clean JSON or CSV

import csv
import json


def write_json(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as fh:
        json.dump(rows, fh, indent=2, ensure_ascii=False)


def write_csv(rows: list[dict], path: str) -> None:
    fieldnames = ["title", "url", "image", "price_text", "currency", "location", "details", "description", "tags"]
    with open(path, "w", newline="", encoding="utf-8") as fh:
        writer = csv.DictWriter(fh, fieldnames=fieldnames)
        writer.writeheader()
        for row in rows:
            writer.writerow(
                {
                    **row,
                    "details": " | ".join(row["details"]),
                    "tags": " | ".join(row["tags"]),
                }
            )

Run the full scrape:

if __name__ == "__main__":
    rows = scrape_search_results(
        "https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
        max_pages=2,
        proxiesapi_template=os.getenv("PROXIESAPI_URL_TEMPLATE"),
    )
    print(f"scraped {len(rows)} listings")
    write_json(rows, "idealista_listings.json")
    write_csv(rows, "idealista_listings.csv")

Handling block pages without poisoning your dataset

The worst scraper bug is not a crash. It is silently saving junk.

Add these checks before parsing:

  • HTML length is unexpectedly tiny
  • page title contains verification language
  • no article.item cards found on a page that should contain listings
  • too many consecutive retries from the same route

A good production pattern is:

  1. try direct fetch
  2. if block detected, retry through ProxiesAPI
  3. if that still fails, queue the URL for browser capture later

That way you do not spend browser resources on every request.


When to use browser automation instead of HTML parsing

Use a browser only when one of these is true:

  • you need to clear a challenge page
  • you need network requests that only appear after client-side hydration
  • you need screenshots or visual verification

For bulk search-result scraping, parsed HTML is cheaper and easier to maintain.


Final thoughts

Idealista is a classic example of a target where parsing is easy but collection is hard. Once you separate those concerns, the project becomes much more manageable.

The parser in this guide is intentionally boring:

  • stable selectors
  • explicit block detection
  • replaceable ProxiesAPI fetch layer
  • JSON/CSV export you can hand to analytics or ops

That is exactly what you want for a scraper that needs to run tomorrow, not just today.

Keep Idealista collection stable with ProxiesAPI

Idealista is quick to challenge repetitive traffic. A ProxiesAPI-backed fetch layer gives you a cleaner way to rotate requests and keep your parser focused on real listing pages instead of verification walls.

Related guides

Scrape UK Property Prices from Rightmove (Dataset Builder)
Build a sold-price dataset from Rightmove: crawl results, follow listing links, extract key fields, handle retries, and export to CSV using ProxiesAPI.
tutorial#python#rightmove#real-estate
Scrape Zillow Property Listings (Python + ProxiesAPI)
How to extract listing URLs + core fields (price, beds, baths, address) from Zillow search pages, with pagination, retries, and export. Plus realistic notes on blocking and alternatives.
tutorial#python#zillow#real-estate
Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder)
Build a repeatable Rightmove sold-prices dataset with pagination, retries, and screenshot proof. Includes a production-ready Python scraper and export to CSV/JSON.
tutorial#python#rightmove#real-estate
Scrape Real Estate Listings from Realtor.com (Python + ProxiesAPI)
Extract listing URLs and key fields (price, beds, baths, address) from Realtor.com search results with pagination, retries, and a ProxiesAPI-backed fetch layer. Includes selectors, CSV export, and a screenshot.
tutorial#python#real-estate#realtor