Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)

Apr 23, 2026 · tutorial · #python, #rightmove, #real-estate, #web-scraping, #requests, #beautifulsoup, #csv

Rightmove is one of the most-used property portals in the UK. If you’re trying to build a pricing model, track neighborhood trends, or just analyze the market, the sold prices pages are a gold mine.

In this tutorial we’ll build a repeatable dataset builder that:

crawls a Rightmove sold-prices search
paginates through result pages
extracts each listing’s key fields
deduplicates by a stable ID
writes a clean CSV you can re-run daily/weekly

We’ll keep it practical: real selectors, defensive parsing, and “don’t hang forever” networking.

Rightmove sold prices search results (we’ll scrape results + drill into listing pages)

Make your dataset runs stable with ProxiesAPI

Property portals can throttle aggressively when you paginate and fan out into detail pages. ProxiesAPI helps keep the network layer consistent so your dataset builds finish reliably.

Get 1,000 free API calls View pricing

What we’re scraping (site structure)

Rightmove sold listings typically follow this pattern:

Search results page (sold prices): a URL with query parameters + pagination.
Each result links to a property page.
The property page includes address, property type, and a sold price history section (when available).

Important: Rightmove’s HTML changes over time. The goal is to build a scraper that fails loudly (so you notice) instead of silently writing garbage.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
tenacity for robust retries with backoff

Step 1: A network layer that won’t betray you

You want three things:

real timeouts (connect + read)
retries on transient failures (429/5xx)
a single place to add ProxiesAPI later

from __future__ import annotations

import os
import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

TIMEOUT = (10, 30)  # connect, read
BASE_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-GB,en;q=0.9",
}


class FetchError(RuntimeError):
    pass


@dataclass
class HttpClient:
    session: requests.Session
    proxiesapi_url: str | None = None

    def build_url(self, url: str) -> str:
        """Optionally route the request via ProxiesAPI.

        Keep this honest: ProxiesAPI is for *reliability* when you scale.
        Your code should still work without it.
        """
        if not self.proxiesapi_url:
            return url

        # Example pattern (adjust to your ProxiesAPI docs):
        # proxiesapi_url might be something like:
        # https://api.proxiesapi.com/v1/?api_key=...&url=
        return f"{self.proxiesapi_url}{requests.utils.quote(url, safe='')}"

    @retry(
        reraise=True,
        stop=stop_after_attempt(6),
        wait=wait_exponential_jitter(initial=1, max=20),
        retry=retry_if_exception_type((requests.RequestException, FetchError)),
    )
    def get(self, url: str) -> str:
        target = self.build_url(url)

        # small jitter reduces bursts when you paginate
        time.sleep(random.uniform(0.2, 0.8))

        r = self.session.get(target, headers=BASE_HEADERS, timeout=TIMEOUT)

        # Treat rate limiting and server errors as retryable.
        if r.status_code in (429, 500, 502, 503, 504):
            raise FetchError(f"retryable status={r.status_code} url={url}")

        r.raise_for_status()
        return r.text


def make_client() -> HttpClient:
    s = requests.Session()
    proxiesapi_url = os.getenv("PROXIESAPI_URL")  # optional
    return HttpClient(session=s, proxiesapi_url=proxiesapi_url)

Configure ProxiesAPI (optional)

Create a .env file:

PROXIESAPI_URL="https://api.proxiesapi.com/v1/?api_key=YOUR_KEY&url="

If you don’t set it, requests go directly to Rightmove.

Step 2: Start from a sold-prices search URL

Rightmove has many query parameters. The simplest workflow is:

perform a sold-prices search manually in your browser
copy the resulting URL
use it as the seed URL for your dataset run

Example (your parameters will differ):

https://www.rightmove.co.uk/house-prices/area.html?locationIdentifier=REGION%5E87490

Pagination is often represented by a start index or page param.

Because this can change, we’ll implement pagination by:

fetching the first page
extracting “next page” link if present
continuing until no next link

Step 3: Parse result pages (listing URLs + stable IDs)

Rightmove pages usually contain property links that include a numeric ID.

We’ll extract:

listing_id
listing_url

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.rightmove.co.uk"

LISTING_ID_RE = re.compile(r"(\d{6,})")


def parse_results_page(html: str) -> tuple[list[dict], str | None]:
    soup = BeautifulSoup(html, "lxml")

    # Try multiple selector strategies; Rightmove changes markup.
    links = []

    for a in soup.select("a[href*='/house-prices/']"):
        href = a.get("href")
        if not href:
            continue
        url = urljoin(BASE, href)
        m = LISTING_ID_RE.search(url)
        if not m:
            continue
        links.append({"listing_id": m.group(1), "url": url})

    # Also include generic property links if present
    for a in soup.select("a[href*='/properties/']"):
        href = a.get("href")
        if not href:
            continue
        url = urljoin(BASE, href)
        m = LISTING_ID_RE.search(url)
        if not m:
            continue
        links.append({"listing_id": m.group(1), "url": url})

    # de-dupe within page
    seen = set()
    out = []
    for item in links:
        if item["listing_id"] in seen:
            continue
        seen.add(item["listing_id"])
        out.append(item)

    # Find next page link (best-effort)
    next_a = soup.select_one("a[rel='next']")
    if not next_a:
        next_a = soup.find("a", string=re.compile(r"Next", re.I))

    next_url = None
    if next_a and next_a.get("href"):
        next_url = urljoin(BASE, next_a.get("href"))

    return out, next_url

If you run this and get zero results, inspect the HTML you’re receiving (you might be getting a bot check page). That’s where a proxy layer (or ProxiesAPI) often becomes necessary.

Step 4: Parse a listing page (sold history + core fields)

For a dataset, you want clean, typed fields:

address
property_type
bedrooms (when available)
sold_date
sold_price

Rightmove pages tend to expose structured data in JSON inside <script> tags (often application/ld+json). We’ll try that first, then fall back to HTML selectors.

import json
from datetime import datetime


def extract_json_ld(soup: BeautifulSoup) -> list[dict]:
    out = []
    for script in soup.select("script[type='application/ld+json']"):
        try:
            data = json.loads(script.get_text(strip=True) or "{}")
        except json.JSONDecodeError:
            continue
        if isinstance(data, dict):
            out.append(data)
        elif isinstance(data, list):
            out.extend([d for d in data if isinstance(d, dict)])
    return out


def parse_listing_page(html: str, listing_url: str, listing_id: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    address = None
    property_type = None
    bedrooms = None
    sold_date = None
    sold_price = None

    # 1) JSON-LD (best)
    for blob in extract_json_ld(soup):
        # common keys: "address", "name", "offers" etc.
        if not address:
            addr = blob.get("address")
            if isinstance(addr, dict):
                address = addr.get("streetAddress") or addr.get("name")
            elif isinstance(addr, str):
                address = addr

        if not property_type:
            property_type = blob.get("@type") if isinstance(blob.get("@type"), str) else None

    # 2) HTML fallbacks
    if not address:
        h1 = soup.select_one("h1")
        if h1:
            address = h1.get_text(" ", strip=True)

    # Sold price/date often appear in a summary block.
    # Use regex to avoid brittle classnames.
    text = soup.get_text("\n", strip=True)

    m_price = re.search(r"Sold price\s*£?([\d,]+)", text, re.I)
    if m_price:
        sold_price = int(m_price.group(1).replace(",", ""))

    m_date = re.search(r"Sold on\s*(\d{1,2}\s+[A-Za-z]+\s+\d{4})", text, re.I)
    if m_date:
        try:
            sold_date = datetime.strptime(m_date.group(1), "%d %B %Y").date().isoformat()
        except ValueError:
            sold_date = m_date.group(1)

    return {
        "listing_id": listing_id,
        "url": listing_url,
        "address": address,
        "property_type": property_type,
        "bedrooms": bedrooms,
        "sold_date": sold_date,
        "sold_price_gbp": sold_price,
    }

This parser is intentionally conservative. If you need richer sold history (multiple transactions), inspect the page HTML/JSON and extend the extraction.

Step 5: The dataset builder (paginate → fan out → write CSV)

Now we can build the full pipeline:

start at a seed sold-prices URL
collect listing IDs/URLs across pages
de-dupe IDs
fetch each listing page
write a CSV

import csv
from pathlib import Path


def build_dataset(seed_url: str, out_csv: str = "rightmove_sold_prices.csv", max_pages: int = 25):
    client = make_client()

    # 1) crawl results pages
    all_links: list[dict] = []
    seen_ids: set[str] = set()

    next_url = seed_url
    page = 0

    while next_url and page < max_pages:
        page += 1
        html = client.get(next_url)
        links, next_url = parse_results_page(html)

        added = 0
        for item in links:
            lid = item["listing_id"]
            if lid in seen_ids:
                continue
            seen_ids.add(lid)
            all_links.append(item)
            added += 1

        print(f"page={page} scraped_links={len(links)} added={added} total_unique={len(all_links)}")

        if added == 0 and page >= 2:
            # If we stop discovering new listings, stop early.
            break

    print("total listing urls:", len(all_links))

    # 2) fetch listing pages
    rows: list[dict] = []
    for i, item in enumerate(all_links, start=1):
        html = client.get(item["url"])
        row = parse_listing_page(html, item["url"], item["listing_id"])
        rows.append(row)
        if i % 25 == 0:
            print(f"fetched {i}/{len(all_links)}")

    # 3) write CSV
    out_path = Path(out_csv)
    fieldnames = [
        "listing_id",
        "url",
        "address",
        "property_type",
        "bedrooms",
        "sold_date",
        "sold_price_gbp",
    ]

    with out_path.open("w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for r in rows:
            w.writerow(r)

    print("wrote", out_path, "rows=", len(rows))


if __name__ == "__main__":
    # Paste a Rightmove sold-prices search URL here.
    seed = "https://www.rightmove.co.uk/house-prices/area.html?locationIdentifier=REGION%5E87490"
    build_dataset(seed_url=seed, out_csv="rightmove_sold_prices.csv", max_pages=15)

Debugging checklist (Rightmove-specific)

If you get blocked or parse zero links, check:

Are you receiving a bot-check/consent page instead of results?
Does parse_results_page() find any property links?
Did Rightmove change the pagination pattern?

Practical fix order:

Print the first 500 chars of the HTML you fetched.
Save it to debug.html and open it locally.
Add/adjust selectors based on the real markup.
If responses vary (sometimes HTML, sometimes blocks), add ProxiesAPI routing.

Where ProxiesAPI fits (honestly)

For small runs (one area, a few pages), you might get away without proxies.

But the moment you:

paginate deeper
run multiple areas
re-run on a schedule
parallelize listing fetches

…you’ll hit throttling.

ProxiesAPI is useful here because it makes the network layer more stable (fewer random failures), so your dataset job finishes consistently.

Next upgrades

store results in SQLite with listing_id as the primary key (incremental updates)
normalize addresses with a geocoder (careful with rate limits)
extract full sold history (multiple transactions) if present
add a “resume” mode that skips already-scraped IDs

Make your dataset runs stable with ProxiesAPI

Property portals can throttle aggressively when you paginate and fan out into detail pages. ProxiesAPI helps keep the network layer consistent so your dataset builds finish reliably.

Get 1,000 free API calls View pricing

Build a repeatable Rightmove sold-price dataset pipeline in Python: crawl result pages, extract listing URLs, parse sold-price details, and export clean CSV/JSON with retries and politeness.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove with Python (Sold Prices Dataset + Screenshots)

Build a Rightmove sold-prices dataset builder in Python: fetch HTML reliably, parse listing cards, follow pagination, enrich details pages, and export a clean CSV/JSONL. Includes proof screenshots and a resilient request layer with ProxiesAPI.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove (Dataset Builder)

Build a sold-price dataset from Rightmove: crawl results, follow listing links, extract key fields, handle retries, and export to CSV using ProxiesAPI.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove with Python (Dataset Builder + Screenshots)

Build a repeatable Rightmove dataset pipeline (search → listings → detail pages) using Python + ProxiesAPI. Includes selectors, retries, and screenshot proof.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)

Related guides