Scrape UK Property Prices from Rightmove (Sold Prices) with Python
Rightmove is one of the most useful public sources for UK property sold-price comps.
In this tutorial we’ll build a practical “dataset builder” that:
- searches Rightmove sold-property results for an area
- paginates through results (incremental)
- extracts key fields from listing cards
- deduplicates by a stable listing identifier
- exports to CSV (and JSONL if you want)
We’ll keep the scraper honest:
- Rightmove HTML and APIs change over time
- some fields may be missing on some cards
- you should respect robots / ToS and throttle requests

Property portals rate-limit aggressively once you paginate or run daily jobs. ProxiesAPI gives you a consistent, proxy-backed request layer so your dataset builds don’t die mid-crawl.
What we’re scraping (page + structure)
Rightmove sold property results are served as a search results page that contains listing cards. The key things we want:
- a stable listing id (often present in links)
- address / title
- sold price (if shown)
- sold date (if shown)
- property type
- bedrooms
- estate agent (sometimes)
- listing URL
The big win: you can build a comps dataset without visiting every detail page by extracting from results cards first.
A quick sanity check
curl -sL "https://www.rightmove.co.uk/house-prices.html" | head -n 5
If you get blocked or challenged, that’s where a proxy layer helps.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingpandasjust to make CSV export easy (optional)
ProxiesAPI request helper (drop-in)
ProxiesAPI typically fits as a network wrapper: you keep your parsing logic unchanged, but route requests through a proxy endpoint.
Below is a simple pattern that works well for “HTML-in, HTML-out” scrapers.
Create rightmove_scraper.py and set an env var:
export PROXIESAPI_KEY="YOUR_KEY"
import os
import time
import random
import urllib.parse
import requests
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "")
TIMEOUT = (15, 45) # connect, read
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
"Accept-Language": "en-GB,en;q=0.9",
})
def fetch_html(url: str, *, use_proxiesapi: bool = True) -> str:
"""Fetch HTML. If use_proxiesapi is True, route the request via ProxiesAPI."""
if use_proxiesapi:
if not PROXIESAPI_KEY:
raise RuntimeError("Set PROXIESAPI_KEY env var")
# Generic proxy pattern: ProxiesAPI fetches the target URL and returns HTML.
# Adjust the endpoint/params to your ProxiesAPI account’s format.
proxied = (
"https://api.proxiesapi.com"
f"?api_key={urllib.parse.quote(PROXIESAPI_KEY)}"
f"&url={urllib.parse.quote(url, safe='')}"
)
r = session.get(proxied, timeout=TIMEOUT)
else:
r = session.get(url, timeout=TIMEOUT)
r.raise_for_status()
return r.text
def polite_sleep(min_s=1.0, max_s=2.5):
time.sleep(random.uniform(min_s, max_s))
Notes:
- The parsing stays the same whether you use proxies or not.
- If you run this daily or at scale, add retries and exponential backoff.
Step 1: Build a sold-price search URL
Rightmove has multiple entry points (house prices pages, search results, etc.). For dataset-building, you want a URL that:
- returns a list of sold listings for a location
- supports pagination via a query param (commonly
index/start-style)
Because Rightmove URLs can be complex and change, treat the “URL builder” as a configuration step.
A pragmatic workflow:
- Go to Rightmove in your browser
- Search Sold prices for your target area
- Copy the resulting URL
- Paste it as
BASE_RESULTS_URL
In code, we’ll take a base_url and then append/replace a pagination param.
Step 2: Parse listing cards from HTML
We’ll extract:
listing_idfrom links (best-effort)address/titlepricesold_datebedsproperty_typeurl
Rightmove’s CSS selectors can change. The safest approach is:
- select “card containers”
- within each card, find the first link that looks like a property detail page
- parse text blocks defensively
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
RIGHTMOVE_BASE = "https://www.rightmove.co.uk"
def extract_int(text: str):
m = re.search(r"(\d+)", text or "")
return int(m.group(1)) if m else None
def normalize_ws(s: str) -> str:
return re.sub(r"\s+", " ", (s or "").strip())
def parse_results_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
# Heuristic: results are often in list items / divs with links to /properties/
# We’ll first collect property links, then climb to a likely container.
for a in soup.select("a[href*='/properties/']"):
href = a.get("href")
if not href:
continue
url = href if href.startswith("http") else urljoin(RIGHTMOVE_BASE, href)
# listing id is commonly in the URL path: /properties/<id>
m = re.search(r"/properties/(\d+)", url)
listing_id = m.group(1) if m else None
# Find a container node to read card text
card = a
for _ in range(5):
if card is None:
break
# stop at a block-like container
if card.name in ("div", "li", "article"):
break
card = card.parent
card_text = normalize_ws(card.get_text(" ", strip=True) if card else a.get_text(" ", strip=True))
# Best-effort extraction. These patterns are not guaranteed.
price = None
sold_date = None
beds = None
property_type = None
# Prices tend to look like £123,456
pm = re.search(r"£\s?[\d,]+", card_text)
price = pm.group(0).replace(" ", "") if pm else None
# Sold date often contains month/year or "Sold" tokens
dm = re.search(r"Sold\s+(in\s+)?([A-Za-z]{3,9}\s+\d{4})", card_text)
sold_date = dm.group(2) if dm else None
# Beds often appear as "3 bed" or "3 bedroom"
bm = re.search(r"(\d+)\s+bed", card_text, re.IGNORECASE)
beds = int(bm.group(1)) if bm else None
# Property type is usually a word like "Terraced", "Semi-Detached" etc.
tm = re.search(r"(Detached|Semi-Detached|Terraced|End of Terrace|Flat|Maisonette|Bungalow|Cottage)", card_text, re.IGNORECASE)
property_type = tm.group(1) if tm else None
title = normalize_ws(a.get_text(" ", strip=True))
out.append({
"listing_id": listing_id,
"title": title,
"price": price,
"sold_date": sold_date,
"beds": beds,
"property_type": property_type,
"url": url,
})
# De-dupe by listing_id/url within the page
uniq = []
seen = set()
for row in out:
key = row.get("listing_id") or row.get("url")
if not key or key in seen:
continue
seen.add(key)
uniq.append(row)
return uniq
This parser is designed to survive missing fields and still output something useful.
Step 3: Pagination + dedupe across pages
Rightmove results commonly use a index parameter (start offset) for paging.
We’ll implement pagination as:
- start at
index=0 - step by
page_size(often 24) - stop when a page returns 0 new listings
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
def set_query_param(url: str, key: str, value: str) -> str:
parts = urlparse(url)
q = parse_qs(parts.query)
q[key] = [str(value)]
new_query = urlencode(q, doseq=True)
return urlunparse((parts.scheme, parts.netloc, parts.path, parts.params, new_query, parts.fragment))
def crawl_sold_results(base_results_url: str, pages: int = 10, page_size: int = 24):
all_rows = []
seen = set()
for i in range(pages):
index = i * page_size
page_url = set_query_param(base_results_url, "index", str(index))
html = fetch_html(page_url, use_proxiesapi=True)
batch = parse_results_page(html)
new_count = 0
for row in batch:
key = row.get("listing_id") or row.get("url")
if not key or key in seen:
continue
seen.add(key)
all_rows.append(row)
new_count += 1
print(f"page {i+1} index={index} batch={len(batch)} new={new_count} total={len(all_rows)}")
if new_count == 0:
break
polite_sleep()
return all_rows
Step 4: Export to CSV
import pandas as pd
def export_csv(rows: list[dict], path: str = "rightmove_sold_prices.csv"):
df = pd.DataFrame(rows)
# Keep stable column order
cols = ["listing_id", "title", "price", "sold_date", "beds", "property_type", "url"]
df = df[[c for c in cols if c in df.columns]]
df.to_csv(path, index=False)
print("wrote", path, len(df))
Full runnable script
Put it all together:
if __name__ == "__main__":
# 1) In your browser, run a sold-prices search on Rightmove and paste the URL here.
BASE_RESULTS_URL = "PASTE_RIGHTMOVE_SOLD_RESULTS_URL_HERE"
rows = crawl_sold_results(BASE_RESULTS_URL, pages=20, page_size=24)
export_csv(rows)
Run:
python rightmove_scraper.py
Common issues (and how to fix them)
1) HTML looks different than in your browser
Rightmove may render different HTML depending on headers / geo / bot signals.
Fixes:
- use a real desktop User-Agent
- add
Accept-Language: en-GB - fetch via ProxiesAPI (proxy-backed requests reduce challenge frequency)
2) Your selector returns zero items
Don’t guess selectors for hours.
Instead:
- save the HTML to a file:
open("page.html", "w").write(html) - search for
/properties/in it - build your extraction around links, not brittle class names
3) Duplicates across pages
Some portals shuffle results or repeat.
That’s why we dedupe by listing_id or url.
Where ProxiesAPI helps (realistically)
For Rightmove-style sites, you tend to get blocked when you:
- paginate deeply
- run from a single IP repeatedly (cron jobs)
- hit the site from cloud/VPS IP ranges
ProxiesAPI helps you keep the network layer stable while your parsing and export logic stays unchanged.
Next upgrades
- Store results in SQLite and do incremental updates
- Enrich each record by visiting the detail page (EPC rating, history, agent)
- Add retries with exponential backoff + jitter
- Schedule daily runs and only fetch “new since last run”
Property portals rate-limit aggressively once you paginate or run daily jobs. ProxiesAPI gives you a consistent, proxy-backed request layer so your dataset builds don’t die mid-crawl.