Scrape UK Property Prices from Rightmove Sold Prices (Python + Dataset Builder)
Rightmove is one of the most-used property portals in the UK. If you’re trying to build a pricing model, track neighborhood trends, or just analyze the market, the sold prices pages are a gold mine.
In this tutorial we’ll build a repeatable dataset builder that:
- crawls a Rightmove sold-prices search
- paginates through result pages
- extracts each listing’s key fields
- deduplicates by a stable ID
- writes a clean CSV you can re-run daily/weekly
We’ll keep it practical: real selectors, defensive parsing, and “don’t hang forever” networking.

Property portals can throttle aggressively when you paginate and fan out into detail pages. ProxiesAPI helps keep the network layer consistent so your dataset builds finish reliably.
What we’re scraping (site structure)
Rightmove sold listings typically follow this pattern:
- Search results page (sold prices): a URL with query parameters + pagination.
- Each result links to a property page.
- The property page includes address, property type, and a sold price history section (when available).
Important: Rightmove’s HTML changes over time. The goal is to build a scraper that fails loudly (so you notice) instead of silently writing garbage.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingtenacityfor robust retries with backoff
Step 1: A network layer that won’t betray you
You want three things:
- real timeouts (connect + read)
- retries on transient failures (429/5xx)
- a single place to add ProxiesAPI later
from __future__ import annotations
import os
import random
import time
from dataclasses import dataclass
import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
TIMEOUT = (10, 30) # connect, read
BASE_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-GB,en;q=0.9",
}
class FetchError(RuntimeError):
pass
@dataclass
class HttpClient:
session: requests.Session
proxiesapi_url: str | None = None
def build_url(self, url: str) -> str:
"""Optionally route the request via ProxiesAPI.
Keep this honest: ProxiesAPI is for *reliability* when you scale.
Your code should still work without it.
"""
if not self.proxiesapi_url:
return url
# Example pattern (adjust to your ProxiesAPI docs):
# proxiesapi_url might be something like:
# https://api.proxiesapi.com/v1/?api_key=...&url=
return f"{self.proxiesapi_url}{requests.utils.quote(url, safe='')}"
@retry(
reraise=True,
stop=stop_after_attempt(6),
wait=wait_exponential_jitter(initial=1, max=20),
retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def get(self, url: str) -> str:
target = self.build_url(url)
# small jitter reduces bursts when you paginate
time.sleep(random.uniform(0.2, 0.8))
r = self.session.get(target, headers=BASE_HEADERS, timeout=TIMEOUT)
# Treat rate limiting and server errors as retryable.
if r.status_code in (429, 500, 502, 503, 504):
raise FetchError(f"retryable status={r.status_code} url={url}")
r.raise_for_status()
return r.text
def make_client() -> HttpClient:
s = requests.Session()
proxiesapi_url = os.getenv("PROXIESAPI_URL") # optional
return HttpClient(session=s, proxiesapi_url=proxiesapi_url)
Configure ProxiesAPI (optional)
Create a .env file:
PROXIESAPI_URL="https://api.proxiesapi.com/v1/?api_key=YOUR_KEY&url="
If you don’t set it, requests go directly to Rightmove.
Step 2: Start from a sold-prices search URL
Rightmove has many query parameters. The simplest workflow is:
- perform a sold-prices search manually in your browser
- copy the resulting URL
- use it as the seed URL for your dataset run
Example (your parameters will differ):
https://www.rightmove.co.uk/house-prices/area.html?locationIdentifier=REGION%5E87490
Pagination is often represented by a start index or page param.
Because this can change, we’ll implement pagination by:
- fetching the first page
- extracting “next page” link if present
- continuing until no next link
Step 3: Parse result pages (listing URLs + stable IDs)
Rightmove pages usually contain property links that include a numeric ID.
We’ll extract:
listing_idlisting_url
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://www.rightmove.co.uk"
LISTING_ID_RE = re.compile(r"(\d{6,})")
def parse_results_page(html: str) -> tuple[list[dict], str | None]:
soup = BeautifulSoup(html, "lxml")
# Try multiple selector strategies; Rightmove changes markup.
links = []
for a in soup.select("a[href*='/house-prices/']"):
href = a.get("href")
if not href:
continue
url = urljoin(BASE, href)
m = LISTING_ID_RE.search(url)
if not m:
continue
links.append({"listing_id": m.group(1), "url": url})
# Also include generic property links if present
for a in soup.select("a[href*='/properties/']"):
href = a.get("href")
if not href:
continue
url = urljoin(BASE, href)
m = LISTING_ID_RE.search(url)
if not m:
continue
links.append({"listing_id": m.group(1), "url": url})
# de-dupe within page
seen = set()
out = []
for item in links:
if item["listing_id"] in seen:
continue
seen.add(item["listing_id"])
out.append(item)
# Find next page link (best-effort)
next_a = soup.select_one("a[rel='next']")
if not next_a:
next_a = soup.find("a", string=re.compile(r"Next", re.I))
next_url = None
if next_a and next_a.get("href"):
next_url = urljoin(BASE, next_a.get("href"))
return out, next_url
If you run this and get zero results, inspect the HTML you’re receiving (you might be getting a bot check page). That’s where a proxy layer (or ProxiesAPI) often becomes necessary.
Step 4: Parse a listing page (sold history + core fields)
For a dataset, you want clean, typed fields:
- address
- property_type
- bedrooms (when available)
- sold_date
- sold_price
Rightmove pages tend to expose structured data in JSON inside <script> tags (often application/ld+json). We’ll try that first, then fall back to HTML selectors.
import json
from datetime import datetime
def extract_json_ld(soup: BeautifulSoup) -> list[dict]:
out = []
for script in soup.select("script[type='application/ld+json']"):
try:
data = json.loads(script.get_text(strip=True) or "{}")
except json.JSONDecodeError:
continue
if isinstance(data, dict):
out.append(data)
elif isinstance(data, list):
out.extend([d for d in data if isinstance(d, dict)])
return out
def parse_listing_page(html: str, listing_url: str, listing_id: str) -> dict:
soup = BeautifulSoup(html, "lxml")
address = None
property_type = None
bedrooms = None
sold_date = None
sold_price = None
# 1) JSON-LD (best)
for blob in extract_json_ld(soup):
# common keys: "address", "name", "offers" etc.
if not address:
addr = blob.get("address")
if isinstance(addr, dict):
address = addr.get("streetAddress") or addr.get("name")
elif isinstance(addr, str):
address = addr
if not property_type:
property_type = blob.get("@type") if isinstance(blob.get("@type"), str) else None
# 2) HTML fallbacks
if not address:
h1 = soup.select_one("h1")
if h1:
address = h1.get_text(" ", strip=True)
# Sold price/date often appear in a summary block.
# Use regex to avoid brittle classnames.
text = soup.get_text("\n", strip=True)
m_price = re.search(r"Sold price\s*£?([\d,]+)", text, re.I)
if m_price:
sold_price = int(m_price.group(1).replace(",", ""))
m_date = re.search(r"Sold on\s*(\d{1,2}\s+[A-Za-z]+\s+\d{4})", text, re.I)
if m_date:
try:
sold_date = datetime.strptime(m_date.group(1), "%d %B %Y").date().isoformat()
except ValueError:
sold_date = m_date.group(1)
return {
"listing_id": listing_id,
"url": listing_url,
"address": address,
"property_type": property_type,
"bedrooms": bedrooms,
"sold_date": sold_date,
"sold_price_gbp": sold_price,
}
This parser is intentionally conservative. If you need richer sold history (multiple transactions), inspect the page HTML/JSON and extend the extraction.
Step 5: The dataset builder (paginate → fan out → write CSV)
Now we can build the full pipeline:
- start at a seed sold-prices URL
- collect listing IDs/URLs across pages
- de-dupe IDs
- fetch each listing page
- write a CSV
import csv
from pathlib import Path
def build_dataset(seed_url: str, out_csv: str = "rightmove_sold_prices.csv", max_pages: int = 25):
client = make_client()
# 1) crawl results pages
all_links: list[dict] = []
seen_ids: set[str] = set()
next_url = seed_url
page = 0
while next_url and page < max_pages:
page += 1
html = client.get(next_url)
links, next_url = parse_results_page(html)
added = 0
for item in links:
lid = item["listing_id"]
if lid in seen_ids:
continue
seen_ids.add(lid)
all_links.append(item)
added += 1
print(f"page={page} scraped_links={len(links)} added={added} total_unique={len(all_links)}")
if added == 0 and page >= 2:
# If we stop discovering new listings, stop early.
break
print("total listing urls:", len(all_links))
# 2) fetch listing pages
rows: list[dict] = []
for i, item in enumerate(all_links, start=1):
html = client.get(item["url"])
row = parse_listing_page(html, item["url"], item["listing_id"])
rows.append(row)
if i % 25 == 0:
print(f"fetched {i}/{len(all_links)}")
# 3) write CSV
out_path = Path(out_csv)
fieldnames = [
"listing_id",
"url",
"address",
"property_type",
"bedrooms",
"sold_date",
"sold_price_gbp",
]
with out_path.open("w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()
for r in rows:
w.writerow(r)
print("wrote", out_path, "rows=", len(rows))
if __name__ == "__main__":
# Paste a Rightmove sold-prices search URL here.
seed = "https://www.rightmove.co.uk/house-prices/area.html?locationIdentifier=REGION%5E87490"
build_dataset(seed_url=seed, out_csv="rightmove_sold_prices.csv", max_pages=15)
Debugging checklist (Rightmove-specific)
If you get blocked or parse zero links, check:
- Are you receiving a bot-check/consent page instead of results?
- Does
parse_results_page()find any property links? - Did Rightmove change the pagination pattern?
Practical fix order:
- Print the first 500 chars of the HTML you fetched.
- Save it to
debug.htmland open it locally. - Add/adjust selectors based on the real markup.
- If responses vary (sometimes HTML, sometimes blocks), add ProxiesAPI routing.
Where ProxiesAPI fits (honestly)
For small runs (one area, a few pages), you might get away without proxies.
But the moment you:
- paginate deeper
- run multiple areas
- re-run on a schedule
- parallelize listing fetches
…you’ll hit throttling.
ProxiesAPI is useful here because it makes the network layer more stable (fewer random failures), so your dataset job finishes consistently.
Next upgrades
- store results in SQLite with
listing_idas the primary key (incremental updates) - normalize addresses with a geocoder (careful with rate limits)
- extract full sold history (multiple transactions) if present
- add a “resume” mode that skips already-scraped IDs
Property portals can throttle aggressively when you paginate and fan out into detail pages. ProxiesAPI helps keep the network layer consistent so your dataset builds finish reliably.