Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)
Rightmove’s Sold House Prices pages are one of the most useful public sources for UK property research.
In this tutorial, we’ll build a dataset builder that:
- starts from a Sold Prices search URL you choose
- crawls multiple results pages
- extracts the property detail URLs + key fields
- fetches each property detail page to enrich the record
- exports CSV (for spreadsheets) and JSONL (for pipelines)
- takes a real screenshot of the target pages (for proof / debugging)
We’ll use plain Python with requests + BeautifulSoup, and we’ll show where ProxiesAPI fits in the network layer.

Rightmove can rate-limit or serve different markup when you crawl many result pages and property details. ProxiesAPI helps keep large crawls consistent when you move from a few searches to city-wide datasets.
Important note (data + legality)
- Always review Rightmove’s Terms and the rules in your jurisdiction.
- Use polite rates and cache results when possible.
- Don’t scrape personal data.
This guide focuses on technical robustness (timeouts, retries, pagination, defensive parsing) so your code doesn’t break the moment you scale.
What we’re scraping (Rightmove structure)
Rightmove has multiple sections; for sold prices, you’ll often see URLs under patterns like:
https://www.rightmove.co.uk/house-prices/...html(area pages)- result pages that list sold transactions / property cards
- property detail pages (sometimes on a different path)
The exact HTML can change over time and can differ by region and experiment cohort.
So the strategy is:
- Start from a seed Sold Prices page you can open in your browser
- Parse the HTML for:
- individual result cards (property rows)
- links to property details
- key visible fields (address, sold price, sold date)
- Follow the property links to enrich the record (when present)
- Keep parsing defensive (multiple selectors, fallbacks)
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas tenacity
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingtenacityfor retriespandasfor a simple CSV export (optional, but convenient)
ProxiesAPI integration (honest + minimal)
A lot of Rightmove exploration works fine from your own IP for small runs.
But datasets tend to have this shape:
- paginate through many result pages
- follow many property detail links
- re-run regularly
That’s where you start seeing:
- throttling (429)
- inconsistent responses
- intermittent blocks
We’ll implement the HTTP client so you can switch between:
- direct requests (baseline)
- ProxiesAPI-backed requests (more stable at scale)
Environment variables
Set these (example names — adapt to your ProxiesAPI account/docs):
export PROXIESAPI_KEY="YOUR_KEY"
If your ProxiesAPI integration uses a proxy URL (common pattern), you’ll pass it into requests via the proxies= argument.
Step 1: Build a robust fetch() with retries, timeouts, and optional proxy
import os
import random
import time
from urllib.parse import urljoin
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
TIMEOUT = (10, 40) # connect, read
BASE = "https://www.rightmove.co.uk"
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
]
class FetchError(Exception):
pass
def build_session() -> requests.Session:
s = requests.Session()
s.headers.update({
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-language": "en-GB,en;q=0.9",
"cache-control": "no-cache",
"pragma": "no-cache",
})
return s
def proxiesapi_proxies() -> dict | None:
"""Return a requests-compatible proxies dict if configured.
This is intentionally generic: your ProxiesAPI account may provide a proxy endpoint or a fetch API.
If you use a proxy endpoint, it often looks like:
http://USERNAME:PASSWORD@proxy.proxiesapi.com:PORT
Put the full proxy URL in an env var and we’ll use it.
"""
proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
if not proxy_url:
return None
return {"http": proxy_url, "https": proxy_url}
session = build_session()
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=20),
retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch(url: str, *, use_proxy: bool = False) -> str:
headers = {"user-agent": random.choice(USER_AGENTS)}
proxies = proxiesapi_proxies() if use_proxy else None
r = session.get(url, headers=headers, timeout=TIMEOUT, proxies=proxies)
# Common "soft block" patterns:
if r.status_code in (403, 429):
raise FetchError(f"blocked/throttled: {r.status_code}")
r.raise_for_status()
# Defensive: ensure HTML-ish content
ct = (r.headers.get("content-type") or "").lower()
if "html" not in ct and "text" not in ct:
raise FetchError(f"unexpected content-type: {ct}")
# Basic politeness (jitter)
time.sleep(0.3 + random.random() * 0.4)
return r.text
def abs_url(href: str) -> str:
return href if href.startswith("http") else urljoin(BASE, href)
Notes:
- We retry on network failures and “blocked” responses.
- We rotate a small UA list.
- We optionally route traffic via ProxiesAPI using
PROXIESAPI_PROXY_URL.
If your ProxiesAPI integration is a fetch API instead of a proxy, you’d replace session.get(url) with a call to that API and return the HTML body.
Step 2: Parse sold-price result cards (defensive selectors)
Rightmove’s sold pages can vary. Instead of relying on one brittle selector, we’ll:
- look for multiple possible card containers
- extract links that look like property detail links
- parse visible price/date/address text
import re
from bs4 import BeautifulSoup
PRICE_RE = re.compile(r"£\s?([\d,]+)")
def clean_text(x: str) -> str:
return re.sub(r"\s+", " ", (x or "").strip())
def parse_price(text: str) -> int | None:
m = PRICE_RE.search(text or "")
if not m:
return None
return int(m.group(1).replace(",", ""))
def parse_list_page(html: str) -> tuple[list[dict], str | None]:
soup = BeautifulSoup(html, "lxml")
# Candidate containers: update as Rightmove changes
cards = soup.select("[data-test='sold-history-card'], [data-testid*='sold'], li, article")
items = []
seen_urls = set()
for c in cards:
a = c.select_one("a[href]")
if not a:
continue
href = a.get("href")
if not href:
continue
url = href
# Skip non-property links quickly
if any(x in url for x in ("/mortgages/", "/agent/", "/commercial-property/")):
continue
url = abs_url(url)
if url in seen_urls:
continue
seen_urls.add(url)
text = clean_text(c.get_text(" ", strip=True))
price = parse_price(text)
# Very rough sold date extraction; expect to refine per page type
sold_date = None
m = re.search(r"Sold\s+(?:in\s+)?([A-Za-z]+\s+\d{4}|\d{1,2}\s+[A-Za-z]+\s+\d{4})", text)
if m:
sold_date = m.group(1)
items.append({
"list_url": None, # fill at call-site
"property_url": url,
"sold_price_gbp": price,
"sold_date_raw": sold_date,
"card_text": text,
})
# Pagination: look for a "next" link
next_a = soup.select_one("a[rel='next'], a:has(svg[aria-label='Next']), a[aria-label*='Next']")
next_url = abs_url(next_a.get("href")) if next_a and next_a.get("href") else None
return items, next_url
This parser is intentionally conservative. For real production use, you’ll open your target page in DevTools and tighten selectors around the actual sold-card markup you see.
Step 3: Enrich each property by fetching its detail page
On a property page, useful fields might include:
- address (canonical)
- property type
- bedrooms
- “sold price history” table rows
Again: these can move around. We’ll implement a best-effort parser that won’t crash.
def parse_property_page(html: str) -> dict:
soup = BeautifulSoup(html, "lxml")
# Address candidates
h1 = soup.select_one("h1")
address = clean_text(h1.get_text(" ", strip=True)) if h1 else None
# Try to find a simple key/value list or table
meta = {}
for row in soup.select("table tr"):
th = row.select_one("th")
td = row.select_one("td")
if not th or not td:
continue
k = clean_text(th.get_text(" ", strip=True)).lower()
v = clean_text(td.get_text(" ", strip=True))
if k and v:
meta[k] = v
# A common pattern is data embedded as JSON in script tags
scripts = "\n".join([s.get_text(" ", strip=True) for s in soup.select("script") if s.get_text(strip=True)])
return {
"address": address,
"meta": meta,
"has_scripts": bool(scripts),
}
If you want a truly stable approach, you’ll often parse embedded JSON state (when available) rather than fragile text nodes.
Step 4: Crawl pages → build dataset
Now we tie it together:
- start from a Sold Prices URL
- parse cards
- follow pagination
- then fetch each property URL to enrich
import json
from dataclasses import dataclass
@dataclass
class CrawlConfig:
start_url: str
max_pages: int = 10
max_properties: int = 200
use_proxy: bool = True
def crawl_sold_prices(cfg: CrawlConfig) -> list[dict]:
out = []
next_url = cfg.start_url
page = 0
while next_url and page < cfg.max_pages and len(out) < cfg.max_properties:
page += 1
html = fetch(next_url, use_proxy=cfg.use_proxy)
items, new_next = parse_list_page(html)
for it in items:
it["list_url"] = next_url
out.append(it)
if len(out) >= cfg.max_properties:
break
print(f"page={page} items={len(items)} total={len(out)}")
next_url = new_next
return out
def enrich_properties(rows: list[dict], *, use_proxy: bool = True) -> list[dict]:
enriched = []
for i, r in enumerate(rows, 1):
url = r["property_url"]
try:
html = fetch(url, use_proxy=use_proxy)
details = parse_property_page(html)
except Exception as e:
details = {"error": str(e)}
merged = {**r, **details}
enriched.append(merged)
if i % 10 == 0:
print(f"enriched {i}/{len(rows)}")
return enriched
if __name__ == "__main__":
seed = "https://www.rightmove.co.uk/house-prices.html" # Replace with a specific sold-prices URL
cfg = CrawlConfig(
start_url=seed,
max_pages=5,
max_properties=50,
use_proxy=True,
)
rows = crawl_sold_prices(cfg)
rows = enrich_properties(rows, use_proxy=cfg.use_proxy)
with open("rightmove_sold.jsonl", "w", encoding="utf-8") as f:
for r in rows:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
print("wrote rightmove_sold.jsonl", len(rows))
Export to CSV (optional)
import pandas as pd
def export_csv(rows: list[dict], path: str = "rightmove_sold.csv"):
df = pd.json_normalize(rows)
df.to_csv(path, index=False)
print("wrote", path, len(df))
CSV is great for analysis; JSONL is better for incremental pipelines.
Screenshot step (proof + debugging)
When you’re building scrapers, screenshots are useful for:
- confirming you’re on the right page
- capturing the UI state for clients / compliance
- debugging when markup changes
You have two good options:
Option A: Manual screenshot (fastest)
Open your seed URL in a normal browser and save a screenshot.
Store it at:
public/images/posts/scrape-uk-property-prices-from-rightmove-dataset-builder-screenshots/rightmove-sold-prices.jpg
Option B: Automated screenshot with Playwright
pip install playwright
python -m playwright install chromium
from playwright.sync_api import sync_playwright
def screenshot(url: str, out_path: str):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={"width": 1280, "height": 720})
page.goto(url, wait_until="domcontentloaded", timeout=60000)
page.wait_for_timeout(1500)
page.screenshot(path=out_path, full_page=True)
browser.close()
if __name__ == "__main__":
screenshot(
"https://www.rightmove.co.uk/house-prices.html", # replace with a sold-prices search URL
"rightmove-sold-prices.jpg",
)
If the page is heavy or geo-sensitive, pair Playwright with ProxiesAPI (proxy settings) to make the render more consistent.
QA checklist
- Seed URL opens in your browser
-
parse_list_page()returns property URLs (spot-check 5) - Pagination finds
next_url(or you provide explicit page URLs) - Enrichment succeeds for most URLs
- Export files contain expected columns
- You captured a screenshot and checked it into the right folder
Next upgrades (if you’re going big)
- parse embedded JSON state when available (more stable than HTML selectors)
- add caching (SQLite) so you don’t re-fetch the same property URLs
- add de-duplication by a stable property id
- parallelize enrichment with a small worker pool (keep rate-limits in mind)
- schedule daily updates and only crawl new sold transactions
Rightmove can rate-limit or serve different markup when you crawl many result pages and property details. ProxiesAPI helps keep large crawls consistent when you move from a few searches to city-wide datasets.