Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder + Screenshots)

May 08, 2026 · tutorial · #python, #rightmove, #real-estate, #web-scraping, #beautifulsoup, #csv, #proxies

Rightmove is one of the most useful sources for UK property research. The challenge is that the Sold Prices experience is built for humans (search, filters, pagination), not for exporting a dataset.

In this guide you’ll build a repeatable dataset builder that:

starts from a Sold Prices search URL
follows pagination safely
extracts sold-price cards (address, price, sold date, property type where available)
de-duplicates results
exports a clean CSV you can load into a spreadsheet / database

We’ll keep the scraping honest:

we’ll parse real server-rendered HTML (no “guessing” selectors)
we’ll use timeouts + retries
we’ll show exactly where ProxiesAPI fits (network stability), without pretending it “unblocks everything”

Rightmove Sold Prices search results (we’ll scrape listing cards + pagination)

Keep Rightmove crawls stable with ProxiesAPI

Property sites can rate-limit, geo-fence, or intermittently block requests. ProxiesAPI gives you a consistent proxy layer so your dataset jobs finish reliably, even as URL counts grow.

Get 1,000 free API calls View pricing

What we’re scraping (Rightmove Sold Prices)

Rightmove has multiple surfaces. For this tutorial we focus on:

Sold Prices search results pages (multiple result cards)
pagination (next page / page index)

You’ll typically start from a URL you can produce manually by applying filters in your browser.

Quick sanity check (HTML is there)

Before writing any parser, confirm Rightmove returns HTML (not a blank JS shell):

curl -sL "https://www.rightmove.co.uk/house-prices.html" | head -n 5

If you get an HTML document, you can parse it with BeautifulSoup.

Note: Rightmove’s exact Sold Prices URLs can change over time. The scraper below is structured so you only need to update a small set of CSS selectors if Rightmove tweaks markup.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for stable parsing
tenacity for retry/backoff

The fetch layer (with ProxiesAPI + retries)

Most scraping failures are network-ish:

transient 5xx
timeouts
occasional 403/429 bursts

So we make fetching robust first.

Option A: Direct requests (baseline)

import random
import time
import requests

TIMEOUT = (10, 30)  # connect, read

session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0 Safari/537.36"
    ),
    "Accept-Language": "en-GB,en;q=0.9",
})


def fetch_html(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT, allow_redirects=True)
    r.raise_for_status()
    return r.text

Option B: Same fetch, routed via ProxiesAPI

ProxiesAPI is typically used by pointing your HTTP client at a proxy endpoint.

Because proxy providers differ in exact connection details, this snippet is intentionally “drop-in”: you set your proxy URL in an environment variable and the rest of your code stays the same.

import os

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")  # e.g. http://USER:PASS@gateway.proxiesapi.com:1234

proxies = None
if PROXY_URL:
    proxies = {
        "http": PROXY_URL,
        "https": PROXY_URL,
    }


def fetch_html(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)
    r.raise_for_status()
    return r.text

If PROXIESAPI_PROXY_URL is not set, you’ll run direct.

Add retries (recommended)

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch_html(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)

    # Treat common block-ish responses as retryable.
    if r.status_code in (403, 429, 500, 502, 503):
        raise requests.HTTPError(f"status={r.status_code}")

    r.raise_for_status()
    return r.text

Step 1: Identify listing cards + fields

Rightmove’s Sold Prices results are laid out as repeated “cards” (list items / divs) containing:

address text
sold price
sold date (or transaction date)
a details link

Because markup changes, don’t hardcode one brittle selector and pray.

Instead, build your parser around:

finding the card container
extracting fields using a small set of fallback selectors

Here’s a practical parser you can adapt quickly.

import re
from bs4 import BeautifulSoup


def clean_text(s: str | None) -> str | None:
    if not s:
        return None
    s = re.sub(r"\s+", " ", s).strip()
    return s or None


def parse_money(text: str | None) -> int | None:
    if not text:
        return None
    # e.g. "£425,000" → 425000
    m = re.search(r"£\s*([\d,]+)", text)
    if not m:
        return None
    return int(m.group(1).replace(",", ""))


def parse_sold_results(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Card candidates — try common patterns.
    # Update these if Rightmove changes markup.
    card_selectors = [
        "div[class*='soldPrice']",
        "div[class*='SoldPrice']",
        "li[class*='soldPrice']",
        "div[class*='propertyCard']",
    ]

    cards = []
    for sel in card_selectors:
        found = soup.select(sel)
        if len(found) >= 5:
            cards = found
            break

    out = []
    for c in cards:
        # Address
        address = None
        for sel in ["address", "h2", "h3", "span[class*='address']"]:
            el = c.select_one(sel)
            t = clean_text(el.get_text(" ", strip=True) if el else None)
            if t and len(t) > 6:
                address = t
                break

        # Price
        price_text = None
        for sel in ["span:contains('£')", "div:contains('£')"]:
            # BeautifulSoup doesn't support :contains reliably across parsers.
            # We'll just scan the card text for a £ value.
            pass
        money = parse_money(c.get_text(" ", strip=True))

        # Sold date — look for something that resembles a month/year.
        sold_date = None
        txt = clean_text(c.get_text(" ", strip=True)) or ""
        # Example patterns: "Sold on 12 Jan 2024" or "Jan 2024"
        m = re.search(r"Sold\s+on\s+([0-9]{1,2}\s+\w+\s+\d{4})", txt)
        if m:
            sold_date = m.group(1)
        else:
            m2 = re.search(r"\b(\w+\s+\d{4})\b", txt)
            sold_date = m2.group(1) if m2 else None

        # Details link (if present)
        link = None
        a = c.select_one("a[href]")
        if a and a.get("href"):
            href = a.get("href")
            if href.startswith("http"):
                link = href
            else:
                link = "https://www.rightmove.co.uk" + href

        out.append({
            "address": address,
            "sold_price_gbp": money,
            "sold_date": sold_date,
            "details_url": link,
        })

    # Filter obviously-bad rows
    out = [r for r in out if r.get("sold_price_gbp") and r.get("address")]
    return out

Why this “selector-light” approach works

For production scraping, you want the fewest selectors you can maintain.

If you tie your scraper to 12 class names, a minor CSS refactor breaks you.
If you identify cards in a resilient way and extract values from text, you have fewer moving parts.

Step 2: Pagination (crawl multiple result pages)

The crawl shape is:

Fetch start URL
Parse result cards
Find “next page” URL
Repeat until you hit page limit / no next link

Because pagination markup changes, we’ll implement a couple of strategies:

look for a link whose rel/name indicates next
fallback to searching for “Next” anchor text

from urllib.parse import urljoin


def find_next_page_url(html: str, current_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Strategy 1: rel=next
    a = soup.select_one("a[rel='next'][href]")
    if a:
        return urljoin(current_url, a.get("href"))

    # Strategy 2: explicit 'Next' label
    for a in soup.select("a[href]"):
        t = (a.get_text(" ", strip=True) or "").lower()
        if t in ("next", "next page", "next >", ">"):
            return urljoin(current_url, a.get("href"))

    return None

And the crawl loop:

import csv


def crawl_sold_prices(start_url: str, max_pages: int = 10) -> list[dict]:
    url = start_url
    page = 0
    seen = set()
    all_rows: list[dict] = []

    while url and page < max_pages:
        page += 1
        html = fetch_html(url)
        rows = parse_sold_results(html)

        new_count = 0
        for r in rows:
            key = (r.get("address"), r.get("sold_price_gbp"), r.get("sold_date"))
            if key in seen:
                continue
            seen.add(key)
            all_rows.append(r)
            new_count += 1

        print(f"page {page}: parsed={len(rows)} new={new_count} total={len(all_rows)}")

        url = find_next_page_url(html, url)

        # polite pacing (tune for your use case)
        time.sleep(random.uniform(1.0, 2.5))

    return all_rows


def write_csv(rows: list[dict], path: str) -> None:
    cols = ["address", "sold_price_gbp", "sold_date", "details_url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in cols})


if __name__ == "__main__":
    START = "https://www.rightmove.co.uk/house-prices.html"  # replace with a Sold Prices result URL
    data = crawl_sold_prices(START, max_pages=5)
    write_csv(data, "rightmove_sold_prices.csv")
    print("wrote rightmove_sold_prices.csv", len(data))

Screenshot proof (why it matters)

When you’re building a dataset pipeline, screenshots are useful for:

auditing what your scraper saw on day 0
debugging when parsing drops (markup changed)
sharing evidence with stakeholders

In this post we captured the Sold Prices results page:

/public/images/posts/scrape-rightmove-sold-prices-dataset/rightmove-sold-prices-results.jpg

Common Rightmove scraping pitfalls (and fixes)

1) You get intermittent 403/429

Fix:

add retries + exponential backoff
reduce concurrency
route traffic through a proxy layer (ProxiesAPI)

2) Your selectors stop matching

Fix:

log a small HTML sample on failure
keep selectors centralized (one file / one section)
prefer resilient extraction from text when possible

3) Pagination is inconsistent

Fix:

implement multiple “find next” strategies
cap pages per run
maintain a queue of discovered URLs

Where ProxiesAPI fits (honestly)

Rightmove (like many property portals) can be sensitive to:

repeated requests from one IP
bursty traffic patterns
long crawls that run for hours

ProxiesAPI doesn’t magically guarantee access, but it improves crawl stability by:

giving you a consistent proxy endpoint
enabling IP rotation (depending on your plan/config)
reducing the impact of per-IP throttling

You still need good scraping hygiene: timeouts, retries, pacing, and respectful volume.

QA checklist

Start URL loads in a browser
First page yields at least 10 rows with price + address
Pagination advances and total rows increases
CSV opens cleanly in Excel/Sheets
On network failure, your retries recover

Next upgrades

Add a details-page fetch (bedrooms, tenure, agent, EPC) with a second-stage crawler
Store into SQLite (incremental updates)
Add a “changed since last run” diff so you only process new transactions

Keep Rightmove crawls stable with ProxiesAPI

Property sites can rate-limit, geo-fence, or intermittently block requests. ProxiesAPI gives you a consistent proxy layer so your dataset jobs finish reliably, even as URL counts grow.

Get 1,000 free API calls View pricing

Build a repeatable Rightmove sold-prices dataset with pagination, retries, and screenshot proof. Includes a production-ready Python scraper and export to CSV/JSON.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)

Build a repeatable Rightmove sold-price dataset pipeline in Python: crawl result pages, extract listing URLs, parse sold-price details, and export clean CSV/JSON with retries and politeness.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove with Python (Sold Prices Dataset + Screenshots)

Build a Rightmove sold-prices dataset builder in Python: fetch HTML reliably, parse listing cards, follow pagination, enrich details pages, and export a clean CSV/JSONL. Includes proof screenshots and a resilient request layer with ProxiesAPI.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)

Build a repeatable sold-prices dataset from Rightmove: search pages → listing IDs → sold history. Includes pagination, dedupe, retries, and an honest ProxiesAPI integration for stability.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder + Screenshots)

Related guides