Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder + Screenshots)

Rightmove is one of the most useful sources for UK property research. The challenge is that the Sold Prices experience is built for humans (search, filters, pagination), not for exporting a dataset.

In this guide you’ll build a repeatable dataset builder that:

  • starts from a Sold Prices search URL
  • follows pagination safely
  • extracts sold-price cards (address, price, sold date, property type where available)
  • de-duplicates results
  • exports a clean CSV you can load into a spreadsheet / database

We’ll keep the scraping honest:

  • we’ll parse real server-rendered HTML (no “guessing” selectors)
  • we’ll use timeouts + retries
  • we’ll show exactly where ProxiesAPI fits (network stability), without pretending it “unblocks everything”

Rightmove Sold Prices search results (we’ll scrape listing cards + pagination)

Keep Rightmove crawls stable with ProxiesAPI

Property sites can rate-limit, geo-fence, or intermittently block requests. ProxiesAPI gives you a consistent proxy layer so your dataset jobs finish reliably, even as URL counts grow.


What we’re scraping (Rightmove Sold Prices)

Rightmove has multiple surfaces. For this tutorial we focus on:

  • Sold Prices search results pages (multiple result cards)
  • pagination (next page / page index)

You’ll typically start from a URL you can produce manually by applying filters in your browser.

Quick sanity check (HTML is there)

Before writing any parser, confirm Rightmove returns HTML (not a blank JS shell):

curl -sL "https://www.rightmove.co.uk/house-prices.html" | head -n 5

If you get an HTML document, you can parse it with BeautifulSoup.

Note: Rightmove’s exact Sold Prices URLs can change over time. The scraper below is structured so you only need to update a small set of CSS selectors if Rightmove tweaks markup.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for stable parsing
  • tenacity for retry/backoff

The fetch layer (with ProxiesAPI + retries)

Most scraping failures are network-ish:

  • transient 5xx
  • timeouts
  • occasional 403/429 bursts

So we make fetching robust first.

Option A: Direct requests (baseline)

import random
import time
import requests

TIMEOUT = (10, 30)  # connect, read

session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0 Safari/537.36"
    ),
    "Accept-Language": "en-GB,en;q=0.9",
})


def fetch_html(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT, allow_redirects=True)
    r.raise_for_status()
    return r.text

Option B: Same fetch, routed via ProxiesAPI

ProxiesAPI is typically used by pointing your HTTP client at a proxy endpoint.

Because proxy providers differ in exact connection details, this snippet is intentionally “drop-in”: you set your proxy URL in an environment variable and the rest of your code stays the same.

import os

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")  # e.g. http://USER:PASS@gateway.proxiesapi.com:1234

proxies = None
if PROXY_URL:
    proxies = {
        "http": PROXY_URL,
        "https": PROXY_URL,
    }


def fetch_html(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)
    r.raise_for_status()
    return r.text

If PROXIESAPI_PROXY_URL is not set, you’ll run direct.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch_html(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)

    # Treat common block-ish responses as retryable.
    if r.status_code in (403, 429, 500, 502, 503):
        raise requests.HTTPError(f"status={r.status_code}")

    r.raise_for_status()
    return r.text

Step 1: Identify listing cards + fields

Rightmove’s Sold Prices results are laid out as repeated “cards” (list items / divs) containing:

  • address text
  • sold price
  • sold date (or transaction date)
  • a details link

Because markup changes, don’t hardcode one brittle selector and pray.

Instead, build your parser around:

  1. finding the card container
  2. extracting fields using a small set of fallback selectors

Here’s a practical parser you can adapt quickly.

import re
from bs4 import BeautifulSoup


def clean_text(s: str | None) -> str | None:
    if not s:
        return None
    s = re.sub(r"\s+", " ", s).strip()
    return s or None


def parse_money(text: str | None) -> int | None:
    if not text:
        return None
    # e.g. "£425,000" → 425000
    m = re.search(r"£\s*([\d,]+)", text)
    if not m:
        return None
    return int(m.group(1).replace(",", ""))


def parse_sold_results(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Card candidates — try common patterns.
    # Update these if Rightmove changes markup.
    card_selectors = [
        "div[class*='soldPrice']",
        "div[class*='SoldPrice']",
        "li[class*='soldPrice']",
        "div[class*='propertyCard']",
    ]

    cards = []
    for sel in card_selectors:
        found = soup.select(sel)
        if len(found) >= 5:
            cards = found
            break

    out = []
    for c in cards:
        # Address
        address = None
        for sel in ["address", "h2", "h3", "span[class*='address']"]:
            el = c.select_one(sel)
            t = clean_text(el.get_text(" ", strip=True) if el else None)
            if t and len(t) > 6:
                address = t
                break

        # Price
        price_text = None
        for sel in ["span:contains('£')", "div:contains('£')"]:
            # BeautifulSoup doesn't support :contains reliably across parsers.
            # We'll just scan the card text for a £ value.
            pass
        money = parse_money(c.get_text(" ", strip=True))

        # Sold date — look for something that resembles a month/year.
        sold_date = None
        txt = clean_text(c.get_text(" ", strip=True)) or ""
        # Example patterns: "Sold on 12 Jan 2024" or "Jan 2024"
        m = re.search(r"Sold\s+on\s+([0-9]{1,2}\s+\w+\s+\d{4})", txt)
        if m:
            sold_date = m.group(1)
        else:
            m2 = re.search(r"\b(\w+\s+\d{4})\b", txt)
            sold_date = m2.group(1) if m2 else None

        # Details link (if present)
        link = None
        a = c.select_one("a[href]")
        if a and a.get("href"):
            href = a.get("href")
            if href.startswith("http"):
                link = href
            else:
                link = "https://www.rightmove.co.uk" + href

        out.append({
            "address": address,
            "sold_price_gbp": money,
            "sold_date": sold_date,
            "details_url": link,
        })

    # Filter obviously-bad rows
    out = [r for r in out if r.get("sold_price_gbp") and r.get("address")]
    return out

Why this “selector-light” approach works

For production scraping, you want the fewest selectors you can maintain.

  • If you tie your scraper to 12 class names, a minor CSS refactor breaks you.
  • If you identify cards in a resilient way and extract values from text, you have fewer moving parts.

Step 2: Pagination (crawl multiple result pages)

The crawl shape is:

  1. Fetch start URL
  2. Parse result cards
  3. Find “next page” URL
  4. Repeat until you hit page limit / no next link

Because pagination markup changes, we’ll implement a couple of strategies:

  • look for a link whose rel/name indicates next
  • fallback to searching for “Next” anchor text
from urllib.parse import urljoin


def find_next_page_url(html: str, current_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Strategy 1: rel=next
    a = soup.select_one("a[rel='next'][href]")
    if a:
        return urljoin(current_url, a.get("href"))

    # Strategy 2: explicit 'Next' label
    for a in soup.select("a[href]"):
        t = (a.get_text(" ", strip=True) or "").lower()
        if t in ("next", "next page", "next >", ">"):
            return urljoin(current_url, a.get("href"))

    return None

And the crawl loop:

import csv


def crawl_sold_prices(start_url: str, max_pages: int = 10) -> list[dict]:
    url = start_url
    page = 0
    seen = set()
    all_rows: list[dict] = []

    while url and page < max_pages:
        page += 1
        html = fetch_html(url)
        rows = parse_sold_results(html)

        new_count = 0
        for r in rows:
            key = (r.get("address"), r.get("sold_price_gbp"), r.get("sold_date"))
            if key in seen:
                continue
            seen.add(key)
            all_rows.append(r)
            new_count += 1

        print(f"page {page}: parsed={len(rows)} new={new_count} total={len(all_rows)}")

        url = find_next_page_url(html, url)

        # polite pacing (tune for your use case)
        time.sleep(random.uniform(1.0, 2.5))

    return all_rows


def write_csv(rows: list[dict], path: str) -> None:
    cols = ["address", "sold_price_gbp", "sold_date", "details_url"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in cols})


if __name__ == "__main__":
    START = "https://www.rightmove.co.uk/house-prices.html"  # replace with a Sold Prices result URL
    data = crawl_sold_prices(START, max_pages=5)
    write_csv(data, "rightmove_sold_prices.csv")
    print("wrote rightmove_sold_prices.csv", len(data))

Screenshot proof (why it matters)

When you’re building a dataset pipeline, screenshots are useful for:

  • auditing what your scraper saw on day 0
  • debugging when parsing drops (markup changed)
  • sharing evidence with stakeholders

In this post we captured the Sold Prices results page:

  • /public/images/posts/scrape-rightmove-sold-prices-dataset/rightmove-sold-prices-results.jpg

Common Rightmove scraping pitfalls (and fixes)

1) You get intermittent 403/429

Fix:

  • add retries + exponential backoff
  • reduce concurrency
  • route traffic through a proxy layer (ProxiesAPI)

2) Your selectors stop matching

Fix:

  • log a small HTML sample on failure
  • keep selectors centralized (one file / one section)
  • prefer resilient extraction from text when possible

3) Pagination is inconsistent

Fix:

  • implement multiple “find next” strategies
  • cap pages per run
  • maintain a queue of discovered URLs

Where ProxiesAPI fits (honestly)

Rightmove (like many property portals) can be sensitive to:

  • repeated requests from one IP
  • bursty traffic patterns
  • long crawls that run for hours

ProxiesAPI doesn’t magically guarantee access, but it improves crawl stability by:

  • giving you a consistent proxy endpoint
  • enabling IP rotation (depending on your plan/config)
  • reducing the impact of per-IP throttling

You still need good scraping hygiene: timeouts, retries, pacing, and respectful volume.


QA checklist

  • Start URL loads in a browser
  • First page yields at least 10 rows with price + address
  • Pagination advances and total rows increases
  • CSV opens cleanly in Excel/Sheets
  • On network failure, your retries recover

Next upgrades

  • Add a details-page fetch (bedrooms, tenure, agent, EPC) with a second-stage crawler
  • Store into SQLite (incremental updates)
  • Add a “changed since last run” diff so you only process new transactions
Keep Rightmove crawls stable with ProxiesAPI

Property sites can rate-limit, geo-fence, or intermittently block requests. ProxiesAPI gives you a consistent proxy layer so your dataset jobs finish reliably, even as URL counts grow.

Related guides

Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder)
Build a repeatable Rightmove sold-prices dataset with pagination, retries, and screenshot proof. Includes a production-ready Python scraper and export to CSV/JSON.
tutorial#python#rightmove#real-estate
Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)
Build a repeatable Rightmove sold-price dataset pipeline in Python: crawl result pages, extract listing URLs, parse sold-price details, and export clean CSV/JSON with retries and politeness.
tutorial#python#rightmove#real-estate
Scrape UK Property Prices from Rightmove with Python (Sold Prices Dataset + Screenshots)
Build a Rightmove sold-prices dataset builder in Python: fetch HTML reliably, parse listing cards, follow pagination, enrich details pages, and export a clean CSV/JSONL. Includes proof screenshots and a resilient request layer with ProxiesAPI.
tutorial#python#rightmove#real-estate
Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)
Build a repeatable sold-prices dataset from Rightmove: search pages → listing IDs → sold history. Includes pagination, dedupe, retries, and an honest ProxiesAPI integration for stability.
tutorial#python#rightmove#real-estate