Scrape UK Property Prices from Rightmove with Python (Green List #17): Dataset Builder

Rightmove’s Sold House Prices section is a goldmine if you’re doing UK property research — but turning it into a usable dataset means doing three things well:

  1. crawl the sold results pages
  2. paginate safely (without duplicates)
  3. fetch each property detail page and normalize the fields

In this guide we’ll build a production-style Python scraper that exports a clean dataset (CSV + JSON). We’ll also show exactly where ProxiesAPI fits: as a wrapper around the HTTP fetch so your parsing logic doesn’t change.

Rightmove sold prices page (we’ll scrape the result cards + property links)

Scale Rightmove crawling reliably with ProxiesAPI

Rightmove scrapes often fail when you paginate and open lots of listing pages. ProxiesAPI gives you a simple fetch wrapper so your scraper stays focused on parsing — while the network layer stays stable.


What we’re scraping (Rightmove Sold House Prices)

Rightmove’s sold prices experience typically starts at:

  • https://www.rightmove.co.uk/house-prices.html

From there, you click into an area/street and you’ll land on a sold results page that contains:

  • a list of sold properties (cards / rows)
  • links to property detail pages
  • pagination (often as a “next” link or page numbers)

Important: Rightmove’s HTML can vary slightly between areas and over time. The workflow below is robust because it:

  • extracts links based on stable anchors (not brittle nth-child selectors)
  • deduplicates by property URL/id
  • retries network failures

Before we code: grab one real start URL

Open Rightmove in your browser and navigate to a sold prices results page you care about.

Example shape (yours will differ):

  • https://www.rightmove.co.uk/house-prices/London.html?type=DETACHED&soldIn=1 (illustrative)

Copy that URL — we’ll use it as START_URL.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity pandas

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for HTML parsing
  • tenacity for clean retries
  • pandas for CSV output (optional but convenient)

Step 1: A fetch() function with timeouts + retries

Scrapers fail more often because of networking than parsing. Start with a solid fetch.

from __future__ import annotations

import random
import time
from urllib.parse import quote

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 40)  # connect, read
UA = "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"

session = requests.Session()
session.headers.update({"User-Agent": UA, "Accept-Language": "en-GB,en;q=0.9"})


class FetchError(RuntimeError):
    pass


@retry(
    reraise=True,
    stop=stop_after_attempt(6),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT)

    # basic anti-bot / unexpected response handling
    if r.status_code in (403, 429):
        raise FetchError(f"blocked status={r.status_code}")

    r.raise_for_status()

    if not r.text or len(r.text) < 5000:
        # tiny pages can be interstitials / error shells
        raise FetchError("unexpectedly small HTML")

    # polite jitter
    time.sleep(0.4 + random.random() * 0.6)

    return r.text

This is the baseline. Next, we’ll drop in ProxiesAPI with zero parser changes.


ProxiesAPI works as a URL wrapper:

http://api.proxiesapi.com/?key=API_KEY&url=ENCODED_TARGET_URL

In Python:

def proxiesapi_url(target_url: str, api_key: str) -> str:
    return "http://api.proxiesapi.com/?key=" + quote(api_key) + "&url=" + quote(target_url, safe="")

# example
# wrapped = proxiesapi_url("https://www.rightmove.co.uk/house-prices.html", "API_KEY")

To use ProxiesAPI, you only change the URL you pass to fetch().


Rightmove sold results pages contain links to property pages. We’ll extract unique property URLs.

from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

BASE = "https://www.rightmove.co.uk"


def parse_property_links(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")

    links: set[str] = set()

    # Strategy: collect anchors that look like property pages.
    # Rightmove often uses /house-prices/ or /property/ style paths.
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue

        abs_url = urljoin(BASE, href)
        path = urlparse(abs_url).path

        # Heuristic filter (adjust if your observed paths differ)
        if "/house-prices/" in path or "/property/" in path:
            # avoid index-like pages
            if abs_url.startswith(BASE):
                links.add(abs_url)

    return sorted(links)

Pagination: find the “next page” URL

Rather than guessing page numbers, try to find a “Next” link (common pattern on many sites).

import re


def parse_next_page(html: str, current_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Look for anchor text that suggests next.
    for a in soup.select("a[href]"):
        txt = a.get_text(" ", strip=True).lower()
        if txt in {"next", "next page", "›"} or re.fullmatch(r"next\s*›?", txt):
            return urljoin(current_url, a.get("href"))

    # Fallback: common rel attribute
    rel = soup.select_one("a[rel='next']")
    if rel and rel.get("href"):
        return urljoin(current_url, rel.get("href"))

    return None

Because Rightmove can change markup, you may need to tweak this function after a quick HTML inspection.


Step 4: Parse a property detail page (sold price + date + address)

Property detail pages are where the real value lives. We’ll parse a handful of fields:

  • address
  • sold price
  • sold date
  • property type (if present)
import re


def clean_money(text: str) -> int | None:
    if not text:
        return None
    digits = re.sub(r"[^0-9]", "", text)
    return int(digits) if digits else None


def parse_property_page(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Address is often in a heading near the top
    h1 = soup.select_one("h1")
    address = h1.get_text(" ", strip=True) if h1 else None

    text = soup.get_text("\n", strip=True)

    # Heuristics for sold price/date inside page text.
    # (You should verify these against the live page and refine selectors.)
    price = None
    m = re.search(r"Sold\s+for\s+£\s*([0-9,]+)", text, flags=re.IGNORECASE)
    if m:
        price = clean_money("£" + m.group(1))

    sold_date = None
    m2 = re.search(r"Sold\s+on\s+([0-9]{1,2}\s+[A-Za-z]+\s+[0-9]{4})", text, flags=re.IGNORECASE)
    if m2:
        sold_date = m2.group(1)

    return {
        "url": url,
        "address": address,
        "sold_price_gbp": price,
        "sold_date": sold_date,
    }

This is intentionally conservative: HTML structure varies. If you inspect the page and find stable attributes (like data-test), prefer those over regex.


Step 5: Crawl results → fetch property pages → export a dataset

Now we wire it up:

  • Start at START_URL
  • Extract property links on each page
  • Follow pagination until max_pages
  • Fetch each property page and parse fields
  • Write JSON + CSV
import json
from dataclasses import dataclass


@dataclass
class CrawlConfig:
    start_url: str
    max_pages: int = 10
    max_properties: int = 200
    proxiesapi_key: str | None = None


def maybe_wrap(url: str, api_key: str | None) -> str:
    if not api_key:
        return url
    return proxiesapi_url(url, api_key)


def crawl(config: CrawlConfig) -> list[dict]:
    current = config.start_url
    page = 0

    seen_properties: set[str] = set()
    rows: list[dict] = []

    while current and page < config.max_pages and len(rows) < config.max_properties:
        page += 1
        html = fetch(maybe_wrap(current, config.proxiesapi_key))

        prop_links = parse_property_links(html)
        next_url = parse_next_page(html, current)

        print(f"page={page} properties_found={len(prop_links)} next={bool(next_url)}")

        for url in prop_links:
            if url in seen_properties:
                continue
            seen_properties.add(url)

            p_html = fetch(maybe_wrap(url, config.proxiesapi_key))
            row = parse_property_page(p_html, url)
            rows.append(row)

            if len(rows) >= config.max_properties:
                break

        current = next_url

    return rows


if __name__ == "__main__":
    START_URL = "PASTE_YOUR_RIGHTMOVE_SOLD_RESULTS_URL_HERE"

    cfg = CrawlConfig(
        start_url=START_URL,
        max_pages=8,
        max_properties=150,
        proxiesapi_key=None,  # set to "YOUR_KEY" to use ProxiesAPI
    )

    data = crawl(cfg)
    print("rows:", len(data))

    with open("rightmove_sold_prices.json", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    try:
        import pandas as pd

        pd.DataFrame(data).to_csv("rightmove_sold_prices.csv", index=False)
        print("wrote rightmove_sold_prices.csv")
    except Exception as e:
        print("CSV export skipped:", e)

    print("wrote rightmove_sold_prices.json")

Practical notes (Rightmove scraping hygiene)

1) Respect crawl limits

Even if you can crawl thousands of pages, you probably don’t need to.

Start with:

  • max_pages=3
  • max_properties=50

Validate your extraction, then scale.

2) Deduplicate aggressively

Rightmove pages can repeat listings or show the same property in multiple contexts.

Always dedupe by:

  • property URL
  • (or) property id if you can reliably extract it

3) Expect some empty fields

Some properties won’t show a “Sold on” date or will have content loaded differently.

That’s okay — build a dataset that tolerates None.


Where ProxiesAPI fits (honestly)

Rightmove is not a “hello world” site — when you:

  • paginate
  • open lots of details pages
  • run the job repeatedly

…you’ll hit more throttling and flaky responses.

ProxiesAPI helps by giving you a consistent fetch wrapper so you can keep your code focused on parsing + data modeling.


QA checklist

  • Start URL opens in your browser and shows sold listings
  • parse_property_links() returns real property links (print first 5)
  • Pagination finds a next page (or you tweak parse_next_page())
  • Parsed rows contain plausible addresses + prices
  • JSON/CSV files write successfully
Scale Rightmove crawling reliably with ProxiesAPI

Rightmove scrapes often fail when you paginate and open lots of listing pages. ProxiesAPI gives you a simple fetch wrapper so your scraper stays focused on parsing — while the network layer stays stable.

Related guides

Scrape Rightmove Sold Prices with Python: Sold Listings + Price History Dataset (with ProxiesAPI)
Build a Rightmove Sold Prices scraper: crawl sold-property results, paginate, fetch property detail pages, and normalize into a clean dataset. Includes a target-page screenshot and ProxiesAPI integration.
tutorial#python#rightmove#property-data
Scrape Government Contract Data from SAM.gov with Python (Green List #4)
Extract contract opportunity listings from SAM.gov: build a resilient scraper with pagination, retries, and clean JSON/CSV output. Includes a target-page screenshot and ProxiesAPI integration.
tutorial#python#sam-gov#government-contracts
Scrape Rightmove Sold Prices (Second Angle): Price History Dataset Builder
Build a clean Rightmove sold-price history dataset with dedupe + incremental updates, plus a screenshot of the sold-price flow and ProxiesAPI-backed fetching.
tutorial#python#rightmove#web-scraping
Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
Build a routes→prices dataset from Google Flights with pagination-safe requests, retries, and a proof screenshot. Includes export to CSV/JSON and pragmatic anti-blocking guidance.
tutorial#python#google#google-flights