Scrape UK Property Prices from Rightmove with Python (Green List #17): Dataset Builder

Apr 20, 2026 · tutorial · #python, #rightmove, #property-data, #web-scraping, #beautifulsoup, #requests, #dataset, #proxies

Rightmove’s Sold House Prices section is a goldmine if you’re doing UK property research — but turning it into a usable dataset means doing three things well:

crawl the sold results pages
paginate safely (without duplicates)
fetch each property detail page and normalize the fields

In this guide we’ll build a production-style Python scraper that exports a clean dataset (CSV + JSON). We’ll also show exactly where ProxiesAPI fits: as a wrapper around the HTTP fetch so your parsing logic doesn’t change.

Rightmove sold prices page (we’ll scrape the result cards + property links)

Scale Rightmove crawling reliably with ProxiesAPI

Rightmove scrapes often fail when you paginate and open lots of listing pages. ProxiesAPI gives you a simple fetch wrapper so your scraper stays focused on parsing — while the network layer stays stable.

Get 1,000 free API calls View pricing

What we’re scraping (Rightmove Sold House Prices)

Rightmove’s sold prices experience typically starts at:

https://www.rightmove.co.uk/house-prices.html

From there, you click into an area/street and you’ll land on a sold results page that contains:

a list of sold properties (cards / rows)
links to property detail pages
pagination (often as a “next” link or page numbers)

Important: Rightmove’s HTML can vary slightly between areas and over time. The workflow below is robust because it:

extracts links based on stable anchors (not brittle nth-child selectors)
deduplicates by property URL/id
retries network failures

Before we code: grab one real start URL

Open Rightmove in your browser and navigate to a sold prices results page you care about.

Example shape (yours will differ):

https://www.rightmove.co.uk/house-prices/London.html?type=DETACHED&soldIn=1 (illustrative)

Copy that URL — we’ll use it as START_URL.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity pandas

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for HTML parsing
tenacity for clean retries
pandas for CSV output (optional but convenient)

Step 1: A fetch() function with timeouts + retries

Scrapers fail more often because of networking than parsing. Start with a solid fetch.

from __future__ import annotations

import random
import time
from urllib.parse import quote

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 40)  # connect, read
UA = "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"

session = requests.Session()
session.headers.update({"User-Agent": UA, "Accept-Language": "en-GB,en;q=0.9"})


class FetchError(RuntimeError):
    pass


@retry(
    reraise=True,
    stop=stop_after_attempt(6),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch(url: str) -> str:
    r = session.get(url, timeout=TIMEOUT)

    # basic anti-bot / unexpected response handling
    if r.status_code in (403, 429):
        raise FetchError(f"blocked status={r.status_code}")

    r.raise_for_status()

    if not r.text or len(r.text) < 5000:
        # tiny pages can be interstitials / error shells
        raise FetchError("unexpectedly small HTML")

    # polite jitter
    time.sleep(0.4 + random.random() * 0.6)

    return r.text

This is the baseline. Next, we’ll drop in ProxiesAPI with zero parser changes.

Step 2: Wrap requests with ProxiesAPI (optional but recommended at scale)

ProxiesAPI works as a URL wrapper:

http://api.proxiesapi.com/?key=API_KEY&url=ENCODED_TARGET_URL

In Python:

def proxiesapi_url(target_url: str, api_key: str) -> str:
    return "http://api.proxiesapi.com/?key=" + quote(api_key) + "&url=" + quote(target_url, safe="")

# example
# wrapped = proxiesapi_url("https://www.rightmove.co.uk/house-prices.html", "API_KEY")

To use ProxiesAPI, you only change the URL you pass to fetch().

Step 3: Parse the sold results page for property links

Rightmove sold results pages contain links to property pages. We’ll extract unique property URLs.

from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

BASE = "https://www.rightmove.co.uk"


def parse_property_links(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")

    links: set[str] = set()

    # Strategy: collect anchors that look like property pages.
    # Rightmove often uses /house-prices/ or /property/ style paths.
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue

        abs_url = urljoin(BASE, href)
        path = urlparse(abs_url).path

        # Heuristic filter (adjust if your observed paths differ)
        if "/house-prices/" in path or "/property/" in path:
            # avoid index-like pages
            if abs_url.startswith(BASE):
                links.add(abs_url)

    return sorted(links)

Pagination: find the “next page” URL

Rather than guessing page numbers, try to find a “Next” link (common pattern on many sites).

import re


def parse_next_page(html: str, current_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Look for anchor text that suggests next.
    for a in soup.select("a[href]"):
        txt = a.get_text(" ", strip=True).lower()
        if txt in {"next", "next page", "›"} or re.fullmatch(r"next\s*›?", txt):
            return urljoin(current_url, a.get("href"))

    # Fallback: common rel attribute
    rel = soup.select_one("a[rel='next']")
    if rel and rel.get("href"):
        return urljoin(current_url, rel.get("href"))

    return None

Because Rightmove can change markup, you may need to tweak this function after a quick HTML inspection.

Step 4: Parse a property detail page (sold price + date + address)

Property detail pages are where the real value lives. We’ll parse a handful of fields:

address
sold price
sold date
property type (if present)

import re


def clean_money(text: str) -> int | None:
    if not text:
        return None
    digits = re.sub(r"[^0-9]", "", text)
    return int(digits) if digits else None


def parse_property_page(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Address is often in a heading near the top
    h1 = soup.select_one("h1")
    address = h1.get_text(" ", strip=True) if h1 else None

    text = soup.get_text("\n", strip=True)

    # Heuristics for sold price/date inside page text.
    # (You should verify these against the live page and refine selectors.)
    price = None
    m = re.search(r"Sold\s+for\s+£\s*([0-9,]+)", text, flags=re.IGNORECASE)
    if m:
        price = clean_money("£" + m.group(1))

    sold_date = None
    m2 = re.search(r"Sold\s+on\s+([0-9]{1,2}\s+[A-Za-z]+\s+[0-9]{4})", text, flags=re.IGNORECASE)
    if m2:
        sold_date = m2.group(1)

    return {
        "url": url,
        "address": address,
        "sold_price_gbp": price,
        "sold_date": sold_date,
    }

This is intentionally conservative: HTML structure varies. If you inspect the page and find stable attributes (like data-test), prefer those over regex.

Step 5: Crawl results → fetch property pages → export a dataset

Now we wire it up:

Start at START_URL
Extract property links on each page
Follow pagination until max_pages
Fetch each property page and parse fields
Write JSON + CSV

import json
from dataclasses import dataclass


@dataclass
class CrawlConfig:
    start_url: str
    max_pages: int = 10
    max_properties: int = 200
    proxiesapi_key: str | None = None


def maybe_wrap(url: str, api_key: str | None) -> str:
    if not api_key:
        return url
    return proxiesapi_url(url, api_key)


def crawl(config: CrawlConfig) -> list[dict]:
    current = config.start_url
    page = 0

    seen_properties: set[str] = set()
    rows: list[dict] = []

    while current and page < config.max_pages and len(rows) < config.max_properties:
        page += 1
        html = fetch(maybe_wrap(current, config.proxiesapi_key))

        prop_links = parse_property_links(html)
        next_url = parse_next_page(html, current)

        print(f"page={page} properties_found={len(prop_links)} next={bool(next_url)}")

        for url in prop_links:
            if url in seen_properties:
                continue
            seen_properties.add(url)

            p_html = fetch(maybe_wrap(url, config.proxiesapi_key))
            row = parse_property_page(p_html, url)
            rows.append(row)

            if len(rows) >= config.max_properties:
                break

        current = next_url

    return rows


if __name__ == "__main__":
    START_URL = "PASTE_YOUR_RIGHTMOVE_SOLD_RESULTS_URL_HERE"

    cfg = CrawlConfig(
        start_url=START_URL,
        max_pages=8,
        max_properties=150,
        proxiesapi_key=None,  # set to "YOUR_KEY" to use ProxiesAPI
    )

    data = crawl(cfg)
    print("rows:", len(data))

    with open("rightmove_sold_prices.json", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    try:
        import pandas as pd

        pd.DataFrame(data).to_csv("rightmove_sold_prices.csv", index=False)
        print("wrote rightmove_sold_prices.csv")
    except Exception as e:
        print("CSV export skipped:", e)

    print("wrote rightmove_sold_prices.json")