Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder + Screenshots)
Rightmove is one of the most useful sources for UK property research. The challenge is that the Sold Prices experience is built for humans (search, filters, pagination), not for exporting a dataset.
In this guide you’ll build a repeatable dataset builder that:
- starts from a Sold Prices search URL
- follows pagination safely
- extracts sold-price cards (address, price, sold date, property type where available)
- de-duplicates results
- exports a clean CSV you can load into a spreadsheet / database
We’ll keep the scraping honest:
- we’ll parse real server-rendered HTML (no “guessing” selectors)
- we’ll use timeouts + retries
- we’ll show exactly where ProxiesAPI fits (network stability), without pretending it “unblocks everything”

Property sites can rate-limit, geo-fence, or intermittently block requests. ProxiesAPI gives you a consistent proxy layer so your dataset jobs finish reliably, even as URL counts grow.
What we’re scraping (Rightmove Sold Prices)
Rightmove has multiple surfaces. For this tutorial we focus on:
- Sold Prices search results pages (multiple result cards)
- pagination (next page / page index)
You’ll typically start from a URL you can produce manually by applying filters in your browser.
Quick sanity check (HTML is there)
Before writing any parser, confirm Rightmove returns HTML (not a blank JS shell):
curl -sL "https://www.rightmove.co.uk/house-prices.html" | head -n 5
If you get an HTML document, you can parse it with BeautifulSoup.
Note: Rightmove’s exact Sold Prices URLs can change over time. The scraper below is structured so you only need to update a small set of CSS selectors if Rightmove tweaks markup.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for stable parsingtenacityfor retry/backoff
The fetch layer (with ProxiesAPI + retries)
Most scraping failures are network-ish:
- transient 5xx
- timeouts
- occasional 403/429 bursts
So we make fetching robust first.
Option A: Direct requests (baseline)
import random
import time
import requests
TIMEOUT = (10, 30) # connect, read
session = requests.Session()
session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0 Safari/537.36"
),
"Accept-Language": "en-GB,en;q=0.9",
})
def fetch_html(url: str) -> str:
r = session.get(url, timeout=TIMEOUT, allow_redirects=True)
r.raise_for_status()
return r.text
Option B: Same fetch, routed via ProxiesAPI
ProxiesAPI is typically used by pointing your HTTP client at a proxy endpoint.
Because proxy providers differ in exact connection details, this snippet is intentionally “drop-in”: you set your proxy URL in an environment variable and the rest of your code stays the same.
import os
PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL") # e.g. http://USER:PASS@gateway.proxiesapi.com:1234
proxies = None
if PROXY_URL:
proxies = {
"http": PROXY_URL,
"https": PROXY_URL,
}
def fetch_html(url: str) -> str:
r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)
r.raise_for_status()
return r.text
If PROXIESAPI_PROXY_URL is not set, you’ll run direct.
Add retries (recommended)
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=20),
retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch_html(url: str) -> str:
r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)
# Treat common block-ish responses as retryable.
if r.status_code in (403, 429, 500, 502, 503):
raise requests.HTTPError(f"status={r.status_code}")
r.raise_for_status()
return r.text
Step 1: Identify listing cards + fields
Rightmove’s Sold Prices results are laid out as repeated “cards” (list items / divs) containing:
- address text
- sold price
- sold date (or transaction date)
- a details link
Because markup changes, don’t hardcode one brittle selector and pray.
Instead, build your parser around:
- finding the card container
- extracting fields using a small set of fallback selectors
Here’s a practical parser you can adapt quickly.
import re
from bs4 import BeautifulSoup
def clean_text(s: str | None) -> str | None:
if not s:
return None
s = re.sub(r"\s+", " ", s).strip()
return s or None
def parse_money(text: str | None) -> int | None:
if not text:
return None
# e.g. "£425,000" → 425000
m = re.search(r"£\s*([\d,]+)", text)
if not m:
return None
return int(m.group(1).replace(",", ""))
def parse_sold_results(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
# Card candidates — try common patterns.
# Update these if Rightmove changes markup.
card_selectors = [
"div[class*='soldPrice']",
"div[class*='SoldPrice']",
"li[class*='soldPrice']",
"div[class*='propertyCard']",
]
cards = []
for sel in card_selectors:
found = soup.select(sel)
if len(found) >= 5:
cards = found
break
out = []
for c in cards:
# Address
address = None
for sel in ["address", "h2", "h3", "span[class*='address']"]:
el = c.select_one(sel)
t = clean_text(el.get_text(" ", strip=True) if el else None)
if t and len(t) > 6:
address = t
break
# Price
price_text = None
for sel in ["span:contains('£')", "div:contains('£')"]:
# BeautifulSoup doesn't support :contains reliably across parsers.
# We'll just scan the card text for a £ value.
pass
money = parse_money(c.get_text(" ", strip=True))
# Sold date — look for something that resembles a month/year.
sold_date = None
txt = clean_text(c.get_text(" ", strip=True)) or ""
# Example patterns: "Sold on 12 Jan 2024" or "Jan 2024"
m = re.search(r"Sold\s+on\s+([0-9]{1,2}\s+\w+\s+\d{4})", txt)
if m:
sold_date = m.group(1)
else:
m2 = re.search(r"\b(\w+\s+\d{4})\b", txt)
sold_date = m2.group(1) if m2 else None
# Details link (if present)
link = None
a = c.select_one("a[href]")
if a and a.get("href"):
href = a.get("href")
if href.startswith("http"):
link = href
else:
link = "https://www.rightmove.co.uk" + href
out.append({
"address": address,
"sold_price_gbp": money,
"sold_date": sold_date,
"details_url": link,
})
# Filter obviously-bad rows
out = [r for r in out if r.get("sold_price_gbp") and r.get("address")]
return out
Why this “selector-light” approach works
For production scraping, you want the fewest selectors you can maintain.
- If you tie your scraper to 12 class names, a minor CSS refactor breaks you.
- If you identify cards in a resilient way and extract values from text, you have fewer moving parts.
Step 2: Pagination (crawl multiple result pages)
The crawl shape is:
- Fetch start URL
- Parse result cards
- Find “next page” URL
- Repeat until you hit page limit / no next link
Because pagination markup changes, we’ll implement a couple of strategies:
- look for a link whose rel/name indicates next
- fallback to searching for “Next” anchor text
from urllib.parse import urljoin
def find_next_page_url(html: str, current_url: str) -> str | None:
soup = BeautifulSoup(html, "lxml")
# Strategy 1: rel=next
a = soup.select_one("a[rel='next'][href]")
if a:
return urljoin(current_url, a.get("href"))
# Strategy 2: explicit 'Next' label
for a in soup.select("a[href]"):
t = (a.get_text(" ", strip=True) or "").lower()
if t in ("next", "next page", "next >", ">"):
return urljoin(current_url, a.get("href"))
return None
And the crawl loop:
import csv
def crawl_sold_prices(start_url: str, max_pages: int = 10) -> list[dict]:
url = start_url
page = 0
seen = set()
all_rows: list[dict] = []
while url and page < max_pages:
page += 1
html = fetch_html(url)
rows = parse_sold_results(html)
new_count = 0
for r in rows:
key = (r.get("address"), r.get("sold_price_gbp"), r.get("sold_date"))
if key in seen:
continue
seen.add(key)
all_rows.append(r)
new_count += 1
print(f"page {page}: parsed={len(rows)} new={new_count} total={len(all_rows)}")
url = find_next_page_url(html, url)
# polite pacing (tune for your use case)
time.sleep(random.uniform(1.0, 2.5))
return all_rows
def write_csv(rows: list[dict], path: str) -> None:
cols = ["address", "sold_price_gbp", "sold_date", "details_url"]
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=cols)
w.writeheader()
for r in rows:
w.writerow({k: r.get(k) for k in cols})
if __name__ == "__main__":
START = "https://www.rightmove.co.uk/house-prices.html" # replace with a Sold Prices result URL
data = crawl_sold_prices(START, max_pages=5)
write_csv(data, "rightmove_sold_prices.csv")
print("wrote rightmove_sold_prices.csv", len(data))
Screenshot proof (why it matters)
When you’re building a dataset pipeline, screenshots are useful for:
- auditing what your scraper saw on day 0
- debugging when parsing drops (markup changed)
- sharing evidence with stakeholders
In this post we captured the Sold Prices results page:
/public/images/posts/scrape-rightmove-sold-prices-dataset/rightmove-sold-prices-results.jpg
Common Rightmove scraping pitfalls (and fixes)
1) You get intermittent 403/429
Fix:
- add retries + exponential backoff
- reduce concurrency
- route traffic through a proxy layer (ProxiesAPI)
2) Your selectors stop matching
Fix:
- log a small HTML sample on failure
- keep selectors centralized (one file / one section)
- prefer resilient extraction from text when possible
3) Pagination is inconsistent
Fix:
- implement multiple “find next” strategies
- cap pages per run
- maintain a queue of discovered URLs
Where ProxiesAPI fits (honestly)
Rightmove (like many property portals) can be sensitive to:
- repeated requests from one IP
- bursty traffic patterns
- long crawls that run for hours
ProxiesAPI doesn’t magically guarantee access, but it improves crawl stability by:
- giving you a consistent proxy endpoint
- enabling IP rotation (depending on your plan/config)
- reducing the impact of per-IP throttling
You still need good scraping hygiene: timeouts, retries, pacing, and respectful volume.
QA checklist
- Start URL loads in a browser
- First page yields at least 10 rows with price + address
- Pagination advances and total rows increases
- CSV opens cleanly in Excel/Sheets
- On network failure, your retries recover
Next upgrades
- Add a details-page fetch (bedrooms, tenure, agent, EPC) with a second-stage crawler
- Store into SQLite (incremental updates)
- Add a “changed since last run” diff so you only process new transactions
Property sites can rate-limit, geo-fence, or intermittently block requests. ProxiesAPI gives you a consistent proxy layer so your dataset jobs finish reliably, even as URL counts grow.