Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)

Rightmove is one of the biggest UK property portals. If you’re building a market research dataset (sold prices, property attributes, locations), you typically need two layers of scraping:

  1. search / results pages → discover listing URLs
  2. detail pages → extract structured fields (price, address, bedrooms, sold date, etc.)

In this guide, we’ll build a production-grade Rightmove sold-price dataset builder in Python:

  • pagination with repeatable URL building
  • robust HTML parsing (no “magic” selectors you can’t explain)
  • retries + backoff for transient errors
  • exports to CSV and JSON Lines
  • (optional) ProxiesAPI integration for more stable crawling

Rightmove results page (we’ll crawl results → then listing details)

Keep Rightmove crawls stable with ProxiesAPI

Real estate sites often throttle aggressive crawls. ProxiesAPI helps you keep your dataset builder reliable when you scale to many pages and detail URLs.


Important note (ethics + stability)

Property sites are sensitive to heavy traffic. Be respectful:

  • scrape only what you need
  • add delays and caching
  • prefer off-peak runs
  • don’t hammer detail pages with high concurrency

This tutorial is meant for legitimate use cases (analytics, research, internal tooling). Always check the site’s terms and applicable law.


What we’re scraping (Rightmove pages)

Rightmove has multiple “surfaces” (for sale, to rent, sold). The exact URLs change over time, but the overall shape stays the same:

  • results pages with a list of properties
  • property detail pages with the fields you actually want

Your scraper should be written so that:

  • you can swap out the start URL (a results page you captured)
  • your parser is resilient to minor DOM changes

In practice you’ll start with a known-good results URL (from your browser) and treat it as configuration.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • tenacity for clean retries

Step 1: A reliable fetch() with headers, timeouts, retries

Rightmove (like many high-traffic sites) can return:

  • 403/429 if you look bot-like
  • 5xx occasionally
  • HTML that differs slightly per request

Start with a solid network layer.

import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 30)  # connect, read

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}


class FetchError(Exception):
    pass


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type(FetchError),
)
def fetch_html(session: requests.Session, url: str) -> FetchResult:
    r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)

    if r.status_code in (403, 429):
        # Treat as retryable — you may want to rotate IPs here.
        raise FetchError(f"blocked: {r.status_code}")

    if r.status_code >= 500:
        raise FetchError(f"server error: {r.status_code}")

    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)


def polite_sleep(min_s: float = 1.0, max_s: float = 2.5) -> None:
    time.sleep(random.uniform(min_s, max_s))

Step 2: Parse listing URLs from a results page

The most stable approach is:

  1. parse all links on the results page
  2. keep only links that match the property detail URL pattern
  3. normalize to absolute URLs
  4. de-duplicate

Because Rightmove changes CSS class names, avoid relying on single fragile selectors.

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

RIGHTMOVE_BASE = "https://www.rightmove.co.uk"

# Rightmove property links often contain "/properties/".
PROPERTY_PATH_RE = re.compile(r"/properties/\d+")


def extract_property_urls(results_html: str) -> list[str]:
    soup = BeautifulSoup(results_html, "lxml")

    out: list[str] = []
    seen: set[str] = set()

    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue

        m = PROPERTY_PATH_RE.search(href)
        if not m:
            continue

        # keep only the matching path portion
        path = m.group(0)
        abs_url = urljoin(RIGHTMOVE_BASE, path)

        if abs_url not in seen:
            seen.add(abs_url)
            out.append(abs_url)

    return out

Sanity check

session = requests.Session()
start_url = "PASTE_A_RIGHTMOVE_SOLD_PRICE_RESULTS_URL_HERE"

res = fetch_html(session, start_url)
urls = extract_property_urls(res.text)
print("found", len(urls), "property urls")
print(urls[:5])

If this returns zero, it usually means:

  • you pasted a URL that requires JS rendering or consent state
  • you got served a block page
  • Rightmove changed URL patterns (update the regex)

Step 3: Parse a Rightmove property detail page

Detail pages usually include both visible text and embedded JSON.

A robust strategy:

  • try to extract fields from embedded JSON first (if present)
  • fall back to HTML selectors for a small set of important fields

Below is a “hybrid” parser that extracts:

  • address
  • price_text (sold / guide price)
  • property_type
  • bedrooms (if available)
  • agent (if visible)
import json
from typing import Any


def _first_text(el) -> str | None:
    if not el:
        return None
    t = el.get_text(" ", strip=True)
    return t or None


def try_extract_embedded_json(soup: BeautifulSoup) -> dict[str, Any] | None:
    # Rightmove pages often have JSON in <script> tags.
    # We search for a tag that looks like JSON (heuristic), then parse.
    for s in soup.select("script"):
        txt = (s.string or "").strip()
        if not txt:
            continue

        # Heuristic: some pages embed a JSON blob with "property" keys.
        if "\"property\"" in txt and txt.startswith("{"):
            try:
                return json.loads(txt)
            except Exception:
                continue

    return None


def parse_property_detail(html: str, url: str) -> dict[str, Any]:
    soup = BeautifulSoup(html, "lxml")

    data: dict[str, Any] = {"url": url}

    embedded = try_extract_embedded_json(soup)
    if embedded:
        data["embedded_json_keys"] = list(embedded.keys())[:20]

    # HTML fallbacks (keep them minimal + explainable)
    # Address often appears in a prominent heading.
    address = _first_text(soup.select_one("h1"))

    # Price text (varies). We try common patterns.
    price = _first_text(soup.select_one("[data-test='property-price']"))
    if not price:
        price = _first_text(soup.select_one("span[property='price']"))

    # Type/bedrooms often appear in key facts.
    keyfacts = [_first_text(x) for x in soup.select("li")]

    bedrooms = None
    property_type = None

    for t in keyfacts:
        if not t:
            continue
        if bedrooms is None and re.search(r"\b(\d+)\s+bed\b", t, re.I):
            bedrooms = int(re.search(r"(\d+)", t).group(1))
        if property_type is None and any(k in t.lower() for k in ["flat", "apartment", "terraced", "semi-detached", "detached", "bungalow"]):
            property_type = t

    data.update({
        "address": address,
        "price_text": price,
        "bedrooms": bedrooms,
        "property_type": property_type,
    })

    return data

Sanity check (single page)

url = "PASTE_A_RIGHTMOVE_PROPERTY_URL_HERE"
res = fetch_html(session, url)
row = parse_property_detail(res.text, url)
print(row)

Step 4: Crawl results pages → then crawl details

Your pipeline:

  1. fetch results page
  2. extract property URLs
  3. for each URL: fetch + parse detail
  4. export

We’ll keep it sequential (simpler, fewer blocks). You can add concurrency later.

from pathlib import Path


def build_dataset(start_results_url: str, max_properties: int = 200) -> list[dict]:
    session = requests.Session()

    results = fetch_html(session, start_results_url)
    property_urls = extract_property_urls(results.text)

    rows: list[dict] = []

    for i, url in enumerate(property_urls[:max_properties], start=1):
        try:
            polite_sleep(1.0, 2.5)
            detail = fetch_html(session, url)
            rows.append(parse_property_detail(detail.text, url))
            print(f"[{i}/{min(len(property_urls), max_properties)}] ok {url}")
        except Exception as e:
            print(f"[{i}] failed {url}: {e}")
            continue

    return rows


rows = build_dataset("PASTE_RESULTS_URL_HERE", max_properties=50)
print("rows:", len(rows))

Step 5: Export to CSV + JSONL

import csv
import json


def export_csv(rows: list[dict], path: str) -> None:
    if not rows:
        return

    keys = sorted({k for r in rows for k in r.keys()})
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=keys)
        w.writeheader()
        w.writerows(rows)


def export_jsonl(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")


export_csv(rows, "rightmove_sold_prices.csv")
export_jsonl(rows, "rightmove_sold_prices.jsonl")
print("wrote exports")

Where ProxiesAPI fits (honestly)

Rightmove can be sensitive to repeated requests.

When you scale from “50 properties once” to “50,000 properties nightly”, your biggest problems become:

  • block pages / throttling
  • uneven latency
  • higher failure rates on retries

That’s where ProxiesAPI can help — as a network reliability layer.

A simple integration pattern is to route your GET through ProxiesAPI while keeping your parsing code unchanged.

import os

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")


def fetch_html_via_proxiesapi(session: requests.Session, url: str) -> FetchResult:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY in your environment")

    # Example pattern: pass target URL as a parameter to ProxiesAPI.
    # Adjust the endpoint/params to match your ProxiesAPI account/docs.
    proxiesapi_url = "https://api.proxiesapi.com"
    params = {"auth_key": PROXIESAPI_KEY, "url": url}

    r = session.get(proxiesapi_url, params=params, headers=DEFAULT_HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)

Use it for the fetch step only. Keep your parsers site-specific and testable.


QA checklist

  • Start results URL returns HTML (not a block page)
  • extract_property_urls() finds non-zero URLs
  • parse_property_detail() returns address/price for at least 3 spot checks
  • exports open cleanly in Excel/Sheets
  • you’re sleeping between requests

Next upgrades

  • add pagination across multiple results pages (by iterating your results URL parameters)
  • store seen URLs in SQLite so reruns only fetch new ones
  • build a “sold prices delta” job that tracks changes over time
  • add Playwright for pages that require JS or consent flows
Keep Rightmove crawls stable with ProxiesAPI

Real estate sites often throttle aggressive crawls. ProxiesAPI helps you keep your dataset builder reliable when you scale to many pages and detail URLs.

Related guides

Scrape UK Property Prices from Rightmove with Python (Sold Prices Dataset + Screenshots)
Build a Rightmove sold-prices dataset builder in Python: fetch HTML reliably, parse listing cards, follow pagination, enrich details pages, and export a clean CSV/JSONL. Includes proof screenshots and a resilient request layer with ProxiesAPI.
tutorial#python#rightmove#real-estate
Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)
Build a repeatable sold-prices dataset from Rightmove: search pages → listing IDs → sold history. Includes pagination, dedupe, retries, and an honest ProxiesAPI integration for stability.
tutorial#python#rightmove#real-estate
How to Scrape Apartment Listings from Apartments.com (Python + ProxiesAPI)
Scrape Apartments.com listing cards and detail-page fields with Python. Includes pagination, resilient parsing, retries, and clean JSON/CSV exports.
tutorial#python#apartments#real-estate
Scrape UK Property Prices from Rightmove (Dataset Builder)
Build a sold-price dataset from Rightmove: crawl results, follow listing links, extract key fields, handle retries, and export to CSV using ProxiesAPI.
tutorial#python#rightmove#real-estate