Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)

May 03, 2026 · tutorial · #python, #rightmove, #real-estate, #web-scraping, #requests, #beautifulsoup, #csv, #json

Rightmove is one of the biggest UK property portals. If you’re building a market research dataset (sold prices, property attributes, locations), you typically need two layers of scraping:

search / results pages → discover listing URLs
detail pages → extract structured fields (price, address, bedrooms, sold date, etc.)

In this guide, we’ll build a production-grade Rightmove sold-price dataset builder in Python:

pagination with repeatable URL building
robust HTML parsing (no “magic” selectors you can’t explain)
retries + backoff for transient errors
exports to CSV and JSON Lines
(optional) ProxiesAPI integration for more stable crawling

Rightmove results page (we’ll crawl results → then listing details)

Keep Rightmove crawls stable with ProxiesAPI

Real estate sites often throttle aggressive crawls. ProxiesAPI helps you keep your dataset builder reliable when you scale to many pages and detail URLs.

Get 1,000 free API calls View pricing

Important note (ethics + stability)

Property sites are sensitive to heavy traffic. Be respectful:

scrape only what you need
add delays and caching
prefer off-peak runs
don’t hammer detail pages with high concurrency

This tutorial is meant for legitimate use cases (analytics, research, internal tooling). Always check the site’s terms and applicable law.

What we’re scraping (Rightmove pages)

Rightmove has multiple “surfaces” (for sale, to rent, sold). The exact URLs change over time, but the overall shape stays the same:

results pages with a list of properties
property detail pages with the fields you actually want

Your scraper should be written so that:

you can swap out the start URL (a results page you captured)
your parser is resilient to minor DOM changes

In practice you’ll start with a known-good results URL (from your browser) and treat it as configuration.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
tenacity for clean retries

Step 1: A reliable fetch() with headers, timeouts, retries

Rightmove (like many high-traffic sites) can return:

403/429 if you look bot-like
5xx occasionally
HTML that differs slightly per request

Start with a solid network layer.

import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 30)  # connect, read

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}


class FetchError(Exception):
    pass


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type(FetchError),
)
def fetch_html(session: requests.Session, url: str) -> FetchResult:
    r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)

    if r.status_code in (403, 429):
        # Treat as retryable — you may want to rotate IPs here.
        raise FetchError(f"blocked: {r.status_code}")

    if r.status_code >= 500:
        raise FetchError(f"server error: {r.status_code}")

    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)


def polite_sleep(min_s: float = 1.0, max_s: float = 2.5) -> None:
    time.sleep(random.uniform(min_s, max_s))

Step 2: Parse listing URLs from a results page

The most stable approach is:

parse all links on the results page
keep only links that match the property detail URL pattern
normalize to absolute URLs
de-duplicate

Because Rightmove changes CSS class names, avoid relying on single fragile selectors.

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

RIGHTMOVE_BASE = "https://www.rightmove.co.uk"

# Rightmove property links often contain "/properties/".
PROPERTY_PATH_RE = re.compile(r"/properties/\d+")


def extract_property_urls(results_html: str) -> list[str]:
    soup = BeautifulSoup(results_html, "lxml")

    out: list[str] = []
    seen: set[str] = set()

    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue

        m = PROPERTY_PATH_RE.search(href)
        if not m:
            continue

        # keep only the matching path portion
        path = m.group(0)
        abs_url = urljoin(RIGHTMOVE_BASE, path)

        if abs_url not in seen:
            seen.add(abs_url)
            out.append(abs_url)

    return out

Sanity check

session = requests.Session()
start_url = "PASTE_A_RIGHTMOVE_SOLD_PRICE_RESULTS_URL_HERE"

res = fetch_html(session, start_url)
urls = extract_property_urls(res.text)
print("found", len(urls), "property urls")
print(urls[:5])

If this returns zero, it usually means:

you pasted a URL that requires JS rendering or consent state
you got served a block page
Rightmove changed URL patterns (update the regex)

Step 3: Parse a Rightmove property detail page

Detail pages usually include both visible text and embedded JSON.

A robust strategy:

try to extract fields from embedded JSON first (if present)
fall back to HTML selectors for a small set of important fields

Below is a “hybrid” parser that extracts:

address
price_text (sold / guide price)
property_type
bedrooms (if available)
agent (if visible)

import json
from typing import Any


def _first_text(el) -> str | None:
    if not el:
        return None
    t = el.get_text(" ", strip=True)
    return t or None


def try_extract_embedded_json(soup: BeautifulSoup) -> dict[str, Any] | None:
    # Rightmove pages often have JSON in <script> tags.
    # We search for a tag that looks like JSON (heuristic), then parse.
    for s in soup.select("script"):
        txt = (s.string or "").strip()
        if not txt:
            continue

        # Heuristic: some pages embed a JSON blob with "property" keys.
        if "\"property\"" in txt and txt.startswith("{"):
            try:
                return json.loads(txt)
            except Exception:
                continue

    return None


def parse_property_detail(html: str, url: str) -> dict[str, Any]:
    soup = BeautifulSoup(html, "lxml")

    data: dict[str, Any] = {"url": url}

    embedded = try_extract_embedded_json(soup)
    if embedded:
        data["embedded_json_keys"] = list(embedded.keys())[:20]

    # HTML fallbacks (keep them minimal + explainable)
    # Address often appears in a prominent heading.
    address = _first_text(soup.select_one("h1"))

    # Price text (varies). We try common patterns.
    price = _first_text(soup.select_one("[data-test='property-price']"))
    if not price:
        price = _first_text(soup.select_one("span[property='price']"))

    # Type/bedrooms often appear in key facts.
    keyfacts = [_first_text(x) for x in soup.select("li")]

    bedrooms = None
    property_type = None

    for t in keyfacts:
        if not t:
            continue
        if bedrooms is None and re.search(r"\b(\d+)\s+bed\b", t, re.I):
            bedrooms = int(re.search(r"(\d+)", t).group(1))
        if property_type is None and any(k in t.lower() for k in ["flat", "apartment", "terraced", "semi-detached", "detached", "bungalow"]):
            property_type = t

    data.update({
        "address": address,
        "price_text": price,
        "bedrooms": bedrooms,
        "property_type": property_type,
    })

    return data

Sanity check (single page)

url = "PASTE_A_RIGHTMOVE_PROPERTY_URL_HERE"
res = fetch_html(session, url)
row = parse_property_detail(res.text, url)
print(row)

Step 4: Crawl results pages → then crawl details

Your pipeline:

fetch results page
extract property URLs
for each URL: fetch + parse detail
export

We’ll keep it sequential (simpler, fewer blocks). You can add concurrency later.

from pathlib import Path


def build_dataset(start_results_url: str, max_properties: int = 200) -> list[dict]:
    session = requests.Session()

    results = fetch_html(session, start_results_url)
    property_urls = extract_property_urls(results.text)

    rows: list[dict] = []

    for i, url in enumerate(property_urls[:max_properties], start=1):
        try:
            polite_sleep(1.0, 2.5)
            detail = fetch_html(session, url)
            rows.append(parse_property_detail(detail.text, url))
            print(f"[{i}/{min(len(property_urls), max_properties)}] ok {url}")
        except Exception as e:
            print(f"[{i}] failed {url}: {e}")
            continue

    return rows


rows = build_dataset("PASTE_RESULTS_URL_HERE", max_properties=50)
print("rows:", len(rows))

Step 5: Export to CSV + JSONL

import csv
import json


def export_csv(rows: list[dict], path: str) -> None:
    if not rows:
        return

    keys = sorted({k for r in rows for k in r.keys()})
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=keys)
        w.writeheader()
        w.writerows(rows)


def export_jsonl(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")


export_csv(rows, "rightmove_sold_prices.csv")
export_jsonl(rows, "rightmove_sold_prices.jsonl")
print("wrote exports")

Where ProxiesAPI fits (honestly)

Rightmove can be sensitive to repeated requests.

When you scale from “50 properties once” to “50,000 properties nightly”, your biggest problems become:

block pages / throttling
uneven latency
higher failure rates on retries

That’s where ProxiesAPI can help — as a network reliability layer.

A simple integration pattern is to route your GET through ProxiesAPI while keeping your parsing code unchanged.

import os

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")


def fetch_html_via_proxiesapi(session: requests.Session, url: str) -> FetchResult:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY in your environment")

    # Example pattern: pass target URL as a parameter to ProxiesAPI.
    # Adjust the endpoint/params to match your ProxiesAPI account/docs.
    proxiesapi_url = "https://api.proxiesapi.com"
    params = {"auth_key": PROXIESAPI_KEY, "url": url}

    r = session.get(proxiesapi_url, params=params, headers=DEFAULT_HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)

Use it for the fetch step only. Keep your parsers site-specific and testable.

QA checklist

Start results URL returns HTML (not a block page)
extract_property_urls() finds non-zero URLs
parse_property_detail() returns address/price for at least 3 spot checks
exports open cleanly in Excel/Sheets
you’re sleeping between requests

Next upgrades

add pagination across multiple results pages (by iterating your results URL parameters)
store seen URLs in SQLite so reruns only fetch new ones
build a “sold prices delta” job that tracks changes over time
add Playwright for pages that require JS or consent flows

Keep Rightmove crawls stable with ProxiesAPI

Real estate sites often throttle aggressive crawls. ProxiesAPI helps you keep your dataset builder reliable when you scale to many pages and detail URLs.

Get 1,000 free API calls View pricing

Build a Rightmove sold-prices dataset builder in Python: fetch HTML reliably, parse listing cards, follow pagination, enrich details pages, and export a clean CSV/JSONL. Includes proof screenshots and a resilient request layer with ProxiesAPI.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)

Build a repeatable sold-prices dataset from Rightmove: search pages → listing IDs → sold history. Includes pagination, dedupe, retries, and an honest ProxiesAPI integration for stability.

tutorial#python#rightmove#real-estate

How to Scrape Apartment Listings from Apartments.com (Python + ProxiesAPI)

Scrape Apartments.com listing cards and detail-page fields with Python. Includes pagination, resilient parsing, retries, and clean JSON/CSV exports.

tutorial#python#apartments#real-estate

Scrape Numbeo City Cost-of-Living Comparisons (2-City Diff Tables) with Python

Extract Numbeo city-vs-city cost of living comparison rows into a clean dataset (item, city1, city2, percent diff). Includes screenshot, URL builder, and robust table parsing.

tutorial#python#numbeo#web-scraping

Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)

Related guides