Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)

Rightmove is one of the most-used property portals in the UK. If you’re trying to build a pricing model, track neighborhood trends, or just analyze the market, the sold prices pages are a gold mine.

In this tutorial we’ll build a repeatable dataset builder that:

  • crawls a Rightmove sold-prices search
  • paginates through result pages
  • extracts each listing’s key fields
  • deduplicates by a stable ID
  • writes a clean CSV you can re-run daily/weekly

We’ll keep it practical: real selectors, defensive parsing, and “don’t hang forever” networking.

Rightmove sold prices search results (we’ll scrape results + drill into listing pages)

Make your dataset runs stable with ProxiesAPI

Property portals can throttle aggressively when you paginate and fan out into detail pages. ProxiesAPI helps keep the network layer consistent so your dataset builds finish reliably.


What we’re scraping (site structure)

Rightmove sold listings typically follow this pattern:

  • Search results page (sold prices): a URL with query parameters + pagination.
  • Each result links to a property page.
  • The property page includes address, property type, and a sold price history section (when available).

Important: Rightmove’s HTML changes over time. The goal is to build a scraper that fails loudly (so you notice) instead of silently writing garbage.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • tenacity for robust retries with backoff

Step 1: A network layer that won’t betray you

You want three things:

  1. real timeouts (connect + read)
  2. retries on transient failures (429/5xx)
  3. a single place to add ProxiesAPI later
from __future__ import annotations

import os
import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

TIMEOUT = (10, 30)  # connect, read
BASE_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-GB,en;q=0.9",
}


class FetchError(RuntimeError):
    pass


@dataclass
class HttpClient:
    session: requests.Session
    proxiesapi_url: str | None = None

    def build_url(self, url: str) -> str:
        """Optionally route the request via ProxiesAPI.

        Keep this honest: ProxiesAPI is for *reliability* when you scale.
        Your code should still work without it.
        """
        if not self.proxiesapi_url:
            return url

        # Example pattern (adjust to your ProxiesAPI docs):
        # proxiesapi_url might be something like:
        # https://api.proxiesapi.com/v1/?api_key=...&url=
        return f"{self.proxiesapi_url}{requests.utils.quote(url, safe='')}"

    @retry(
        reraise=True,
        stop=stop_after_attempt(6),
        wait=wait_exponential_jitter(initial=1, max=20),
        retry=retry_if_exception_type((requests.RequestException, FetchError)),
    )
    def get(self, url: str) -> str:
        target = self.build_url(url)

        # small jitter reduces bursts when you paginate
        time.sleep(random.uniform(0.2, 0.8))

        r = self.session.get(target, headers=BASE_HEADERS, timeout=TIMEOUT)

        # Treat rate limiting and server errors as retryable.
        if r.status_code in (429, 500, 502, 503, 504):
            raise FetchError(f"retryable status={r.status_code} url={url}")

        r.raise_for_status()
        return r.text


def make_client() -> HttpClient:
    s = requests.Session()
    proxiesapi_url = os.getenv("PROXIESAPI_URL")  # optional
    return HttpClient(session=s, proxiesapi_url=proxiesapi_url)

Configure ProxiesAPI (optional)

Create a .env file:

PROXIESAPI_URL="https://api.proxiesapi.com/v1/?api_key=YOUR_KEY&url="

If you don’t set it, requests go directly to Rightmove.


Step 2: Start from a sold-prices search URL

Rightmove has many query parameters. The simplest workflow is:

  1. perform a sold-prices search manually in your browser
  2. copy the resulting URL
  3. use it as the seed URL for your dataset run

Example (your parameters will differ):

https://www.rightmove.co.uk/house-prices/area.html?locationIdentifier=REGION%5E87490

Pagination is often represented by a start index or page param.

Because this can change, we’ll implement pagination by:

  • fetching the first page
  • extracting “next page” link if present
  • continuing until no next link

Step 3: Parse result pages (listing URLs + stable IDs)

Rightmove pages usually contain property links that include a numeric ID.

We’ll extract:

  • listing_id
  • listing_url
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.rightmove.co.uk"

LISTING_ID_RE = re.compile(r"(\d{6,})")


def parse_results_page(html: str) -> tuple[list[dict], str | None]:
    soup = BeautifulSoup(html, "lxml")

    # Try multiple selector strategies; Rightmove changes markup.
    links = []

    for a in soup.select("a[href*='/house-prices/']"):
        href = a.get("href")
        if not href:
            continue
        url = urljoin(BASE, href)
        m = LISTING_ID_RE.search(url)
        if not m:
            continue
        links.append({"listing_id": m.group(1), "url": url})

    # Also include generic property links if present
    for a in soup.select("a[href*='/properties/']"):
        href = a.get("href")
        if not href:
            continue
        url = urljoin(BASE, href)
        m = LISTING_ID_RE.search(url)
        if not m:
            continue
        links.append({"listing_id": m.group(1), "url": url})

    # de-dupe within page
    seen = set()
    out = []
    for item in links:
        if item["listing_id"] in seen:
            continue
        seen.add(item["listing_id"])
        out.append(item)

    # Find next page link (best-effort)
    next_a = soup.select_one("a[rel='next']")
    if not next_a:
        next_a = soup.find("a", string=re.compile(r"Next", re.I))

    next_url = None
    if next_a and next_a.get("href"):
        next_url = urljoin(BASE, next_a.get("href"))

    return out, next_url

If you run this and get zero results, inspect the HTML you’re receiving (you might be getting a bot check page). That’s where a proxy layer (or ProxiesAPI) often becomes necessary.


Step 4: Parse a listing page (sold history + core fields)

For a dataset, you want clean, typed fields:

  • address
  • property_type
  • bedrooms (when available)
  • sold_date
  • sold_price

Rightmove pages tend to expose structured data in JSON inside <script> tags (often application/ld+json). We’ll try that first, then fall back to HTML selectors.

import json
from datetime import datetime


def extract_json_ld(soup: BeautifulSoup) -> list[dict]:
    out = []
    for script in soup.select("script[type='application/ld+json']"):
        try:
            data = json.loads(script.get_text(strip=True) or "{}")
        except json.JSONDecodeError:
            continue
        if isinstance(data, dict):
            out.append(data)
        elif isinstance(data, list):
            out.extend([d for d in data if isinstance(d, dict)])
    return out


def parse_listing_page(html: str, listing_url: str, listing_id: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    address = None
    property_type = None
    bedrooms = None
    sold_date = None
    sold_price = None

    # 1) JSON-LD (best)
    for blob in extract_json_ld(soup):
        # common keys: "address", "name", "offers" etc.
        if not address:
            addr = blob.get("address")
            if isinstance(addr, dict):
                address = addr.get("streetAddress") or addr.get("name")
            elif isinstance(addr, str):
                address = addr

        if not property_type:
            property_type = blob.get("@type") if isinstance(blob.get("@type"), str) else None

    # 2) HTML fallbacks
    if not address:
        h1 = soup.select_one("h1")
        if h1:
            address = h1.get_text(" ", strip=True)

    # Sold price/date often appear in a summary block.
    # Use regex to avoid brittle classnames.
    text = soup.get_text("\n", strip=True)

    m_price = re.search(r"Sold price\s*£?([\d,]+)", text, re.I)
    if m_price:
        sold_price = int(m_price.group(1).replace(",", ""))

    m_date = re.search(r"Sold on\s*(\d{1,2}\s+[A-Za-z]+\s+\d{4})", text, re.I)
    if m_date:
        try:
            sold_date = datetime.strptime(m_date.group(1), "%d %B %Y").date().isoformat()
        except ValueError:
            sold_date = m_date.group(1)

    return {
        "listing_id": listing_id,
        "url": listing_url,
        "address": address,
        "property_type": property_type,
        "bedrooms": bedrooms,
        "sold_date": sold_date,
        "sold_price_gbp": sold_price,
    }

This parser is intentionally conservative. If you need richer sold history (multiple transactions), inspect the page HTML/JSON and extend the extraction.


Step 5: The dataset builder (paginate → fan out → write CSV)

Now we can build the full pipeline:

  1. start at a seed sold-prices URL
  2. collect listing IDs/URLs across pages
  3. de-dupe IDs
  4. fetch each listing page
  5. write a CSV
import csv
from pathlib import Path


def build_dataset(seed_url: str, out_csv: str = "rightmove_sold_prices.csv", max_pages: int = 25):
    client = make_client()

    # 1) crawl results pages
    all_links: list[dict] = []
    seen_ids: set[str] = set()

    next_url = seed_url
    page = 0

    while next_url and page < max_pages:
        page += 1
        html = client.get(next_url)
        links, next_url = parse_results_page(html)

        added = 0
        for item in links:
            lid = item["listing_id"]
            if lid in seen_ids:
                continue
            seen_ids.add(lid)
            all_links.append(item)
            added += 1

        print(f"page={page} scraped_links={len(links)} added={added} total_unique={len(all_links)}")

        if added == 0 and page >= 2:
            # If we stop discovering new listings, stop early.
            break

    print("total listing urls:", len(all_links))

    # 2) fetch listing pages
    rows: list[dict] = []
    for i, item in enumerate(all_links, start=1):
        html = client.get(item["url"])
        row = parse_listing_page(html, item["url"], item["listing_id"])
        rows.append(row)
        if i % 25 == 0:
            print(f"fetched {i}/{len(all_links)}")

    # 3) write CSV
    out_path = Path(out_csv)
    fieldnames = [
        "listing_id",
        "url",
        "address",
        "property_type",
        "bedrooms",
        "sold_date",
        "sold_price_gbp",
    ]

    with out_path.open("w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for r in rows:
            w.writerow(r)

    print("wrote", out_path, "rows=", len(rows))


if __name__ == "__main__":
    # Paste a Rightmove sold-prices search URL here.
    seed = "https://www.rightmove.co.uk/house-prices/area.html?locationIdentifier=REGION%5E87490"
    build_dataset(seed_url=seed, out_csv="rightmove_sold_prices.csv", max_pages=15)

Debugging checklist (Rightmove-specific)

If you get blocked or parse zero links, check:

  • Are you receiving a bot-check/consent page instead of results?
  • Does parse_results_page() find any property links?
  • Did Rightmove change the pagination pattern?

Practical fix order:

  1. Print the first 500 chars of the HTML you fetched.
  2. Save it to debug.html and open it locally.
  3. Add/adjust selectors based on the real markup.
  4. If responses vary (sometimes HTML, sometimes blocks), add ProxiesAPI routing.

Where ProxiesAPI fits (honestly)

For small runs (one area, a few pages), you might get away without proxies.

But the moment you:

  • paginate deeper
  • run multiple areas
  • re-run on a schedule
  • parallelize listing fetches

…you’ll hit throttling.

ProxiesAPI is useful here because it makes the network layer more stable (fewer random failures), so your dataset job finishes consistently.


Next upgrades

  • store results in SQLite with listing_id as the primary key (incremental updates)
  • normalize addresses with a geocoder (careful with rate limits)
  • extract full sold history (multiple transactions) if present
  • add a “resume” mode that skips already-scraped IDs
Make your dataset runs stable with ProxiesAPI

Property portals can throttle aggressively when you paginate and fan out into detail pages. ProxiesAPI helps keep the network layer consistent so your dataset builds finish reliably.

Related guides

Scrape Rightmove Sold Prices (Second Angle): Price History Dataset Builder
Build a clean Rightmove sold-price history dataset with dedupe + incremental updates, plus a screenshot of the sold-price flow and ProxiesAPI-backed fetching.
tutorial#python#rightmove#web-scraping
How to Scrape Apartment Listings from Apartments.com (Python + ProxiesAPI)
Scrape Apartments.com listing cards and detail-page fields with Python. Includes pagination, resilient parsing, retries, and clean JSON/CSV exports.
tutorial#python#apartments#real-estate
Scrape Government Contract Opportunities from SAM.gov (Python + ProxiesAPI)
Pull contract opportunity listings from SAM.gov into a clean CSV: pagination, robust retries, request headers, and an honest ProxiesAPI integration to reduce throttling.
tutorial#python#sam-gov#government-contracts
Scrape UK Property Prices from Rightmove with Python (Green List #17): Dataset Builder
Build a sold-price dataset from Rightmove: crawl Sold House Prices results, paginate, fetch property pages, and export a clean CSV/JSON. Includes a target-page screenshot and ProxiesAPI integration.
tutorial#python#rightmove#property-data