Scrape Rightmove Sold Prices (Second Angle): Price History Dataset Builder

Rightmove sold prices are one of the most useful public datasets for UK property analysis.

If you’re building anything in proptech—valuation models, neighborhood dashboards, lead scoring, market trend reports—the shape of the problem is always the same:

  • collect sold listings across many postcodes/areas
  • normalize and de-duplicate
  • refresh regularly (incremental updates)
  • keep it reliable under rate limits and transient blocks

This guide shows a practical “dataset builder” approach.

You’ll end up with:

  • a crawler that paginates sold listings for an area
  • a normalized record schema
  • a SQLite database to dedupe + support incremental runs
  • an exporter to CSV

Rightmove sold prices flow (example)

Scale Rightmove crawling more safely with ProxiesAPI

Rightmove is a high-value dataset and a high-friction target. ProxiesAPI helps keep your crawl stable as you paginate, refresh areas, and run nightly incremental updates.


What we’re scraping (and why it’s tricky)

Rightmove pages can vary based on:

  • geo / consent flows
  • A/B tests
  • anti-bot measures

So we’ll use two principles:

  1. Screenshot-first: capture the pages you’re targeting so selector changes are easy to debug.
  2. Stability-first: retries, timeouts, dedupe, and incremental updates.

This tutorial focuses on HTML parsing patterns. If you find the content is loaded via XHR in your region, you can adapt the “fetch + parse + store” pipeline to that endpoint too.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

Create .env:

PROXIESAPI_KEY=your_api_key_here

Step 1: Pick an area URL and take a screenshot

Rightmove sold listings are usually discoverable via the UI:

  • search an area (postcode/town)
  • filter to Sold STC / Sold Prices

Save a screenshot of the sold-price listing page.

We’ll store it at:

  • public/images/posts/<slug>/rightmove-sold-flow.jpg

(We’ll capture this in the publish step using the browser tool.)


Step 2: A robust ProxiesAPI fetch function

import os
import time
import random
from urllib.parse import quote

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


class TransientHTTPError(RuntimeError):
    pass


def proxiesapi_url(target_url: str, api_key: str) -> str:
    return f"https://api.proxiesapi.com/?auth_key={api_key}&url={quote(target_url, safe='')}"


@retry(
    reraise=True,
    stop=stop_after_attempt(6),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException, TransientHTTPError)),
)
def fetch_html(url: str, api_key: str, session: requests.Session | None = None) -> str:
    s = session or requests.Session()

    time.sleep(random.uniform(0.3, 0.9))

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-GB,en;q=0.9",
    }

    gateway = proxiesapi_url(url, api_key)
    r = s.get(gateway, headers=headers, timeout=(10, 40))

    if r.status_code in (403, 408, 429, 500, 502, 503, 504):
        raise TransientHTTPError(f"Transient status {r.status_code}")

    r.raise_for_status()
    return r.text

Step 3: Extract sold listings from the HTML

Rightmove’s markup changes, so we’ll design for resilience:

  • don’t rely on “generated” classnames
  • prefer embedded JSON blocks when present
  • fall back to HTML scanning

Option A: Parse embedded JSON (preferred)

Many listing pages embed a JSON blob (often in a <script> tag) containing listing cards.

Here’s a helper that searches for large JSON objects and then extracts listing-like entries.

import json
import re
from bs4 import BeautifulSoup


def find_json_blobs(html: str) -> list[dict]:
    blobs = []
    for m in re.finditer(r"<script[^>]*>(.*?)</script>", html, flags=re.S | re.I):
        body = m.group(1).strip()
        if len(body) < 2000:
            continue

        # Try direct JSON
        try:
            j = json.loads(body)
            if isinstance(j, dict):
                blobs.append(j)
        except Exception:
            pass

    return blobs


def normalize_price(price_text: str | None) -> int | None:
    if not price_text:
        return None
    m = re.search(r"([0-9][0-9,]*)", price_text.replace("£", ""))
    return int(m.group(1).replace(",", "")) if m else None


def extract_listings_best_effort(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # 1) Try JSON blobs
    for blob in find_json_blobs(html):
        # This is intentionally heuristic—Rightmove blob schema can vary.
        # We look for any dicts containing keys that smell like a property card.
        stack = [blob]
        while stack:
            cur = stack.pop()
            if isinstance(cur, dict):
                keys = set(cur.keys())
                if {"price", "displayAddress"}.issubset(keys) or {"displayAddress", "propertyUrl"}.issubset(keys):
                    price = cur.get("price")
                    if isinstance(price, dict):
                        price_text = price.get("displayPrices", [{}])[0].get("displayPrice") if price.get("displayPrices") else None
                    else:
                        price_text = None

                    listings = {
                        "address": cur.get("displayAddress"),
                        "price_text": price_text,
                        "price_gbp": normalize_price(price_text),
                        "property_url": cur.get("propertyUrl") or cur.get("property_url"),
                        "id": cur.get("id") or cur.get("propertyId"),
                    }
                    # only keep if we have something meaningful
                    if listings.get("address") and (listings.get("property_url") or listings.get("id")):
                        return [listings]  # a minimal proof; we’ll rely on HTML fallback for full pages

                for v in cur.values():
                    if isinstance(v, (dict, list)):
                        stack.append(v)
            elif isinstance(cur, list):
                stack.extend(cur)

    # 2) HTML fallback: find card links
    out = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue
        if "property-for-sale" in href or "properties" in href or "property" in href:
            text = a.get_text(" ", strip=True)
            if not text:
                continue
            out.append({"link_text": text, "href": href})

    return out

This shows the pattern: attempt JSON first, then fall back.

In your real run, you’ll likely adapt this extractor to the exact blob key structure you see in your screenshot/HTML.


Step 4: Store in SQLite for dedupe + incremental updates

The simplest way to do incremental dataset building is SQLite.

We’ll store a normalized table keyed by a stable identifier (property URL or listing ID).

import sqlite3
from datetime import datetime


def init_db(path: str = "rightmove_sold.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS sold_listings (
            id TEXT PRIMARY KEY,
            address TEXT,
            price_gbp INTEGER,
            price_text TEXT,
            property_url TEXT,
            first_seen TEXT,
            last_seen TEXT
        )
        """
    )
    return conn


def upsert_listing(conn: sqlite3.Connection, row: dict):
    now = datetime.utcnow().isoformat()
    lid = row.get("id") or row.get("property_url")
    if not lid:
        return

    conn.execute(
        """
        INSERT INTO sold_listings (id, address, price_gbp, price_text, property_url, first_seen, last_seen)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(id) DO UPDATE SET
          address=excluded.address,
          price_gbp=excluded.price_gbp,
          price_text=excluded.price_text,
          property_url=excluded.property_url,
          last_seen=excluded.last_seen
        """,
        (
            lid,
            row.get("address"),
            row.get("price_gbp"),
            row.get("price_text"),
            row.get("property_url"),
            now,
            now,
        ),
    )
    conn.commit()

Why this works

  • First run: inserts everything.
  • Next run: updates last_seen and any changed fields.
  • You can later detect deltas (new listings since yesterday) via first_seen.

Step 5: Pagination strategy

Rightmove listing pages typically support pagination.

Your exact pagination parameters may be:

  • a ?index= offset
  • a ?page= number
  • an internal path segment

Use your browser’s address bar while paging to learn the URL pattern.

Then implement:

from urllib.parse import urlencode, urlparse, parse_qs, urlunparse


def with_query(url: str, **params) -> str:
    u = urlparse(url)
    q = parse_qs(u.query)
    for k, v in params.items():
        q[k] = [str(v)]
    new_q = urlencode({k: v[0] for k, v in q.items()})
    return urlunparse((u.scheme, u.netloc, u.path, u.params, new_q, u.fragment))


def crawl_pages(base_url: str, pages: int, api_key: str):
    session = requests.Session()
    for p in range(1, pages + 1):
        url = with_query(base_url, page=p)
        html = fetch_html(url, api_key, session=session)
        yield p, html

Adjust page to whatever parameter your screenshot reveals.


Step 6: Full runnable dataset builder

Putting it together:

import os
import csv


def build_dataset(area_url: str, pages: int = 3, db_path: str = "rightmove_sold.db"):
    api_key = os.environ.get("PROXIESAPI_KEY")
    assert api_key, "Missing PROXIESAPI_KEY"

    conn = init_db(db_path)

    for p, html in crawl_pages(area_url, pages=pages, api_key=api_key):
        listings = extract_listings_best_effort(html)
        print("page", p, "items", len(listings))

        for row in listings:
            # if you're using the JSON extractor, ensure `id` or `property_url` exists
            upsert_listing(conn, row)

    print("done")


def export_csv(db_path: str = "rightmove_sold.db", out_path: str = "rightmove_sold.csv"):
    conn = sqlite3.connect(db_path)
    cur = conn.execute(
        "SELECT id, address, price_gbp, price_text, property_url, first_seen, last_seen FROM sold_listings"
    )

    with open(out_path, "w", newline="", encoding="utf-8") as f:
        w = csv.writer(f)
        w.writerow(["id", "address", "price_gbp", "price_text", "property_url", "first_seen", "last_seen"])
        for row in cur:
            w.writerow(row)

    print("wrote", out_path)


if __name__ == "__main__":
    # Replace with a real sold-price area URL you captured in your screenshot
    AREA_URL = "https://www.rightmove.co.uk/house-prices.html"

    build_dataset(AREA_URL, pages=2)
    export_csv()

Practical advice: keep the dataset clean

A solid “price history dataset builder” lives on boring details:

  • Normalize addresses (casefold, strip punctuation, keep postcode separately)
  • Store raw + normalized fields
  • Keep first_seen/last_seen timestamps
  • De-dupe early using URL or a stable listing ID
  • Write small incremental runs (nightly) rather than giant re-crawls

Where ProxiesAPI fits (honestly)

Rightmove can be unreliable without a stability layer.

ProxiesAPI helps by:

  • smoothing out transient 403/429 spikes
  • improving success rates on long pagination runs
  • giving you a consistent network interface while you iterate on parsing

It won’t remove the need for good engineering—timeouts, retries, dedupe—but it makes those efforts pay off at scale.

Scale Rightmove crawling more safely with ProxiesAPI

Rightmove is a high-value dataset and a high-friction target. ProxiesAPI helps keep your crawl stable as you paginate, refresh areas, and run nightly incremental updates.

Related guides

Scrape Rightmove Sold Prices with Python: Sold Listings + Price History Dataset (with ProxiesAPI)
Build a Rightmove Sold Prices scraper: crawl sold-property results, paginate, fetch property detail pages, and normalize into a clean dataset. Includes a target-page screenshot and ProxiesAPI integration.
tutorial#python#rightmove#property-data
Scrape Stock Prices and Financial Data with Python (Yahoo Finance) + ProxiesAPI
Build a daily stock-price dataset from Yahoo Finance: quote pages → parsed fields → CSV/SQLite, with retries, proxy rotation, and polite pacing.
tutorial#python#yahoo-finance#stocks
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Live Stock Prices from Yahoo Finance (Python + ProxiesAPI)
Fetch Yahoo Finance quote pages via ProxiesAPI, parse price + change + market cap, and export clean rows to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks