Scrape Rightmove Sold Prices (Second Angle): Price History Dataset Builder

Apr 15, 2026 · tutorial · #python, #rightmove, #web-scraping, #requests, #beautifulsoup, #csv, #sqlite, #proxies

Rightmove sold prices are one of the most useful public datasets for UK property analysis.

If you’re building anything in proptech—valuation models, neighborhood dashboards, lead scoring, market trend reports—the shape of the problem is always the same:

collect sold listings across many postcodes/areas
normalize and de-duplicate
refresh regularly (incremental updates)
keep it reliable under rate limits and transient blocks

This guide shows a practical “dataset builder” approach.

You’ll end up with:

a crawler that paginates sold listings for an area
a normalized record schema
a SQLite database to dedupe + support incremental runs
an exporter to CSV

Rightmove sold prices flow (example)

Scale Rightmove crawling more safely with ProxiesAPI

Rightmove is a high-value dataset and a high-friction target. ProxiesAPI helps keep your crawl stable as you paginate, refresh areas, and run nightly incremental updates.

Get 1,000 free API calls View pricing

What we’re scraping (and why it’s tricky)

Rightmove pages can vary based on:

geo / consent flows
A/B tests
anti-bot measures

So we’ll use two principles:

Screenshot-first: capture the pages you’re targeting so selector changes are easy to debug.
Stability-first: retries, timeouts, dedupe, and incremental updates.

This tutorial focuses on HTML parsing patterns. If you find the content is loaded via XHR in your region, you can adapt the “fetch + parse + store” pipeline to that endpoint too.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

Create .env:

PROXIESAPI_KEY=your_api_key_here

Step 1: Pick an area URL and take a screenshot

Rightmove sold listings are usually discoverable via the UI:

search an area (postcode/town)
filter to Sold STC / Sold Prices

Save a screenshot of the sold-price listing page.

We’ll store it at:

public/images/posts/<slug>/rightmove-sold-flow.jpg

(We’ll capture this in the publish step using the browser tool.)

Step 2: A robust ProxiesAPI fetch function

import os
import time
import random
from urllib.parse import quote

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


class TransientHTTPError(RuntimeError):
    pass


def proxiesapi_url(target_url: str, api_key: str) -> str:
    return f"https://api.proxiesapi.com/?auth_key={api_key}&url={quote(target_url, safe='')}"


@retry(
    reraise=True,
    stop=stop_after_attempt(6),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException, TransientHTTPError)),
)
def fetch_html(url: str, api_key: str, session: requests.Session | None = None) -> str:
    s = session or requests.Session()

    time.sleep(random.uniform(0.3, 0.9))

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-GB,en;q=0.9",
    }

    gateway = proxiesapi_url(url, api_key)
    r = s.get(gateway, headers=headers, timeout=(10, 40))

    if r.status_code in (403, 408, 429, 500, 502, 503, 504):
        raise TransientHTTPError(f"Transient status {r.status_code}")

    r.raise_for_status()
    return r.text

Step 3: Extract sold listings from the HTML

Rightmove’s markup changes, so we’ll design for resilience:

don’t rely on “generated” classnames
prefer embedded JSON blocks when present
fall back to HTML scanning

Option A: Parse embedded JSON (preferred)

Many listing pages embed a JSON blob (often in a <script> tag) containing listing cards.

Here’s a helper that searches for large JSON objects and then extracts listing-like entries.

import json
import re
from bs4 import BeautifulSoup


def find_json_blobs(html: str) -> list[dict]:
    blobs = []
    for m in re.finditer(r"<script[^>]*>(.*?)</script>", html, flags=re.S | re.I):
        body = m.group(1).strip()
        if len(body) < 2000:
            continue

        # Try direct JSON
        try:
            j = json.loads(body)
            if isinstance(j, dict):
                blobs.append(j)
        except Exception:
            pass

    return blobs


def normalize_price(price_text: str | None) -> int | None:
    if not price_text:
        return None
    m = re.search(r"([0-9][0-9,]*)", price_text.replace("£", ""))
    return int(m.group(1).replace(",", "")) if m else None


def extract_listings_best_effort(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # 1) Try JSON blobs
    for blob in find_json_blobs(html):
        # This is intentionally heuristic—Rightmove blob schema can vary.
        # We look for any dicts containing keys that smell like a property card.
        stack = [blob]
        while stack:
            cur = stack.pop()
            if isinstance(cur, dict):
                keys = set(cur.keys())
                if {"price", "displayAddress"}.issubset(keys) or {"displayAddress", "propertyUrl"}.issubset(keys):
                    price = cur.get("price")
                    if isinstance(price, dict):
                        price_text = price.get("displayPrices", [{}])[0].get("displayPrice") if price.get("displayPrices") else None
                    else:
                        price_text = None

                    listings = {
                        "address": cur.get("displayAddress"),
                        "price_text": price_text,
                        "price_gbp": normalize_price(price_text),
                        "property_url": cur.get("propertyUrl") or cur.get("property_url"),
                        "id": cur.get("id") or cur.get("propertyId"),
                    }
                    # only keep if we have something meaningful
                    if listings.get("address") and (listings.get("property_url") or listings.get("id")):
                        return [listings]  # a minimal proof; we’ll rely on HTML fallback for full pages

                for v in cur.values():
                    if isinstance(v, (dict, list)):
                        stack.append(v)
            elif isinstance(cur, list):
                stack.extend(cur)

    # 2) HTML fallback: find card links
    out = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue
        if "property-for-sale" in href or "properties" in href or "property" in href:
            text = a.get_text(" ", strip=True)
            if not text:
                continue
            out.append({"link_text": text, "href": href})

    return out

This shows the pattern: attempt JSON first, then fall back.

In your real run, you’ll likely adapt this extractor to the exact blob key structure you see in your screenshot/HTML.

Step 4: Store in SQLite for dedupe + incremental updates

The simplest way to do incremental dataset building is SQLite.

We’ll store a normalized table keyed by a stable identifier (property URL or listing ID).

import sqlite3
from datetime import datetime


def init_db(path: str = "rightmove_sold.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS sold_listings (
            id TEXT PRIMARY KEY,
            address TEXT,
            price_gbp INTEGER,
            price_text TEXT,
            property_url TEXT,
            first_seen TEXT,
            last_seen TEXT
        )
        """
    )
    return conn


def upsert_listing(conn: sqlite3.Connection, row: dict):
    now = datetime.utcnow().isoformat()
    lid = row.get("id") or row.get("property_url")
    if not lid:
        return

    conn.execute(
        """
        INSERT INTO sold_listings (id, address, price_gbp, price_text, property_url, first_seen, last_seen)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(id) DO UPDATE SET
          address=excluded.address,
          price_gbp=excluded.price_gbp,
          price_text=excluded.price_text,
          property_url=excluded.property_url,
          last_seen=excluded.last_seen
        """,
        (
            lid,
            row.get("address"),
            row.get("price_gbp"),
            row.get("price_text"),
            row.get("property_url"),
            now,
            now,
        ),
    )
    conn.commit()

Why this works

First run: inserts everything.
Next run: updates last_seen and any changed fields.
You can later detect deltas (new listings since yesterday) via first_seen.

Step 5: Pagination strategy

Rightmove listing pages typically support pagination.

Your exact pagination parameters may be:

a ?index= offset
a ?page= number
an internal path segment

Use your browser’s address bar while paging to learn the URL pattern.

Then implement:

from urllib.parse import urlencode, urlparse, parse_qs, urlunparse


def with_query(url: str, **params) -> str:
    u = urlparse(url)
    q = parse_qs(u.query)
    for k, v in params.items():
        q[k] = [str(v)]
    new_q = urlencode({k: v[0] for k, v in q.items()})
    return urlunparse((u.scheme, u.netloc, u.path, u.params, new_q, u.fragment))


def crawl_pages(base_url: str, pages: int, api_key: str):
    session = requests.Session()
    for p in range(1, pages + 1):
        url = with_query(base_url, page=p)
        html = fetch_html(url, api_key, session=session)
        yield p, html

Adjust page to whatever parameter your screenshot reveals.

Step 6: Full runnable dataset builder

Putting it together:

import os
import csv


def build_dataset(area_url: str, pages: int = 3, db_path: str = "rightmove_sold.db"):
    api_key = os.environ.get("PROXIESAPI_KEY")
    assert api_key, "Missing PROXIESAPI_KEY"

    conn = init_db(db_path)

    for p, html in crawl_pages(area_url, pages=pages, api_key=api_key):
        listings = extract_listings_best_effort(html)
        print("page", p, "items", len(listings))

        for row in listings:
            # if you're using the JSON extractor, ensure `id` or `property_url` exists
            upsert_listing(conn, row)

    print("done")


def export_csv(db_path: str = "rightmove_sold.db", out_path: str = "rightmove_sold.csv"):
    conn = sqlite3.connect(db_path)
    cur = conn.execute(
        "SELECT id, address, price_gbp, price_text, property_url, first_seen, last_seen FROM sold_listings"
    )

    with open(out_path, "w", newline="", encoding="utf-8") as f:
        w = csv.writer(f)
        w.writerow(["id", "address", "price_gbp", "price_text", "property_url", "first_seen", "last_seen"])
        for row in cur:
            w.writerow(row)

    print("wrote", out_path)


if __name__ == "__main__":
    # Replace with a real sold-price area URL you captured in your screenshot
    AREA_URL = "https://www.rightmove.co.uk/house-prices.html"

    build_dataset(AREA_URL, pages=2)
    export_csv()

Practical advice: keep the dataset clean

A solid “price history dataset builder” lives on boring details:

Normalize addresses (casefold, strip punctuation, keep postcode separately)
Store raw + normalized fields
Keep first_seen/last_seen timestamps
De-dupe early using URL or a stable listing ID
Write small incremental runs (nightly) rather than giant re-crawls

Where ProxiesAPI fits (honestly)

Rightmove can be unreliable without a stability layer.

ProxiesAPI helps by:

smoothing out transient 403/429 spikes
improving success rates on long pagination runs
giving you a consistent network interface while you iterate on parsing

It won’t remove the need for good engineering—timeouts, retries, dedupe—but it makes those efforts pay off at scale.

Scale Rightmove crawling more safely with ProxiesAPI

Rightmove is a high-value dataset and a high-friction target. ProxiesAPI helps keep your crawl stable as you paginate, refresh areas, and run nightly incremental updates.

Get 1,000 free API calls View pricing

Show how to collect Rightmove listing prices, addresses, agent names, and URLs into a reusable UK property dataset with Python and ProxiesAPI.

tutorial#python#rightmove#real-estate

Scrape GitHub Topic Pages with Python + ProxiesAPI

Collect repository cards, stars, languages, repo URLs, and update timestamps from GitHub topic pages into a niche-watch dataset.

tutorial#python#github#web-scraping

Scrape Book Data from Goodreads

Build a Goodreads dataset with book titles, authors, ratings, and review counts from a public list page using Python and an optional ProxiesAPI fetch layer.

tutorial#python#goodreads#books

Scrape Rightmove Sold Prices

Build a sold-price dataset with Rightmove property cards, detail pages, sale dates, and historical prices using real selectors and a screenshot.

tutorial#python#rightmove#real-estate

Scrape Rightmove Sold Prices (Second Angle): Price History Dataset Builder

Related guides