Scrape Real Estate Listings from Realtor.com (Python + ProxiesAPI)
Realtor.com is one of the biggest real-estate portals in the US — which also makes it a high-friction scraping target.
In this guide we’ll build a practical Python scraper that:
- visits a Realtor.com search results page
- extracts listing URLs + core fields (price, beds, baths, address)
- paginates through multiple result pages
- exports to CSV
- uses a ProxiesAPI-backed fetch function so you can scale more reliably

Real estate sites tend to rate-limit and fingerprint aggressively. ProxiesAPI gives you a stable network layer (rotating IPs + retries) so your scraper spends less time failing and more time collecting listings.
What we’re scraping (and what can break)
On Realtor.com, the results UI can change and it may be partially client-rendered. That means:
- selectors can shift (class names change, fields move)
- some data may be missing in HTML depending on geo/cookies
- rate limits / bot protections can trigger (timeouts, 403s, interstitials)
So the goal here is not “one magical selector”. The goal is a workflow:
- fetch reliably (timeouts + retries + proxy)
- detect what the page contains
- extract what’s available, gracefully
- iterate on selectors when the UI shifts
Prereqs
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingcsvfrom the standard library for export
Step 1: A safe fetch() with retries (and ProxiesAPI)
You’ll reuse this pattern everywhere.
Below is a drop-in fetch layer:
- sets realistic timeouts
- adds a browser-ish
User-Agent - retries transient failures
- optionally routes requests through ProxiesAPI
Note: ProxiesAPI integration depends on the exact endpoint/key format in your account. The code below is written to be explicit and easy to adapt: you only need to adjust
PROXIESAPI_URL/ parameters to match your ProxiesAPI docs.
import os
import time
import random
from urllib.parse import urlencode
import requests
TIMEOUT = (10, 35) # connect, read
MAX_RETRIES = 5
session = requests.Session()
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
}
def build_proxiesapi_url(target_url: str) -> str:
"""Return a ProxiesAPI-wrapped URL for a target.
Adapt this to your ProxiesAPI account format.
Common patterns are either:
- https://api.proxiesapi.com/?auth_key=...&url=<encoded>
- https://proxy.proxiesapi.com/?api_key=...&url=<encoded>
"""
api_key = os.environ.get("PROXIESAPI_KEY")
if not api_key:
raise RuntimeError("Missing PROXIESAPI_KEY env var")
base = os.environ.get("PROXIESAPI_URL", "https://api.proxiesapi.com")
qs = urlencode({
"api_key": api_key,
"url": target_url,
})
return f"{base}/?{qs}"
def fetch(url: str, *, use_proxiesapi: bool = True) -> str:
attempt = 0
while True:
attempt += 1
try:
final_url = build_proxiesapi_url(url) if use_proxiesapi else url
r = session.get(final_url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)
# Some anti-bot flows return 200 with an interstitial.
# We still raise for typical HTTP errors.
r.raise_for_status()
text = r.text or ""
if "unusual traffic" in text.lower() or "our systems have detected" in text.lower():
raise RuntimeError("Blocked by interstitial (detected unusual traffic)")
return text
except Exception as e:
if attempt >= MAX_RETRIES:
raise
# exponential backoff + jitter
sleep_s = min(20, (2 ** (attempt - 1))) + random.uniform(0, 0.5)
print(f"fetch failed (attempt {attempt}/{MAX_RETRIES}): {e} — sleeping {sleep_s:.1f}s")
time.sleep(sleep_s)
If you want to debug selectors without proxies, just call:
html = fetch(url, use_proxiesapi=False)
Step 2: Find a stable entry point (a search URL)
Realtor.com search URLs are typically state/city/zip-based. Example pattern:
https://www.realtor.com/realestateandhomes-search/San-Francisco_CA
Pick one location as your baseline and don’t change it while building selectors.
SEARCH_URL = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"
html = fetch(SEARCH_URL)
print(len(html))
print(html[:200])
Step 3: Parse listing cards (defensive selectors)
Instead of betting on one brittle class name, we:
- look for anchors that resemble property detail links
- try multiple ways to locate price / beds / baths / address
- keep raw HTML snippets around during dev (optional)
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://www.realtor.com"
def clean_text(x: str) -> str:
return re.sub(r"\s+", " ", (x or "").strip())
def parse_listings(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
seen = set()
# Heuristic: property detail links often contain "/realestateandhomes-detail/".
# If Realtor changes this, update the substring.
for a in soup.select("a[href]"):
href = a.get("href") or ""
if "/realestateandhomes-detail/" not in href:
continue
url = href if href.startswith("http") else urljoin(BASE, href)
if url in seen:
continue
# Walk up to a likely card container
card = a
for _ in range(6):
if not card:
break
if card.name in ("li", "div", "article"):
# stop early when we hit a container with enough text
if len(clean_text(card.get_text(" ", strip=True))) > 40:
break
card = card.parent
container = card if card else a.parent
text_blob = clean_text(container.get_text(" ", strip=True) if container else a.get_text(" ", strip=True))
# Price heuristic: "$" followed by digits/commas
price = None
m = re.search(r"\$\s?([0-9,]+)", text_blob)
if m:
price = "$" + m.group(1)
# Beds/baths heuristic (e.g., "3 bed", "2.5 bath")
beds = None
baths = None
mb = re.search(r"(\d+(?:\.\d+)?)\s*(?:bd|bed)s?", text_blob, re.I)
if mb:
beds = mb.group(1)
mba = re.search(r"(\d+(?:\.\d+)?)\s*(?:ba|bath)s?", text_blob, re.I)
if mba:
baths = mba.group(1)
# Address heuristic: look for something that resembles street + city
# This is intentionally loose. You can tighten based on your target.
address = None
# Try aria-label first
if a.get("aria-label"):
address = clean_text(a.get("aria-label"))
else:
# fallback: first ~80 chars of container text
address = text_blob[:80] if text_blob else None
out.append({
"url": url,
"price": price,
"beds": beds,
"baths": baths,
"address": address,
})
seen.add(url)
return out
listings = parse_listings(html)
print("listings:", len(listings))
print(listings[:2])
Why this approach?
Real estate result pages frequently shuffle their DOM. Anchors to detail pages are often the most stable “spine” — if you can find detail links, you can usually back into the card.
Step 4: Pagination
Realtor’s pagination can vary; sometimes it’s an explicit pg-2 style path, sometimes query params, sometimes JS.
So we’ll implement two strategies:
- try to find a “Next” link in HTML
- if not found, try a best-effort URL pattern and stop when results stop changing
from bs4 import BeautifulSoup
def find_next_url(html: str) -> str | None:
soup = BeautifulSoup(html, "lxml")
# Try common patterns: rel=next or anchor containing "Next"
a = soup.select_one('a[rel="next"][href]')
if a and a.get("href"):
href = a.get("href")
return href if href.startswith("http") else urljoin(BASE, href)
for cand in soup.select("a[href]"):
t = (cand.get_text(" ", strip=True) or "").lower()
if t == "next" or "next" in t:
href = cand.get("href")
if href:
return href if href.startswith("http") else urljoin(BASE, href)
return None
def crawl_search(start_url: str, pages: int = 5) -> list[dict]:
all_rows = []
seen_urls = set()
url = start_url
for i in range(1, pages + 1):
html = fetch(url)
batch = parse_listings(html)
new_count = 0
for row in batch:
if row["url"] in seen_urls:
continue
seen_urls.add(row["url"])
all_rows.append(row)
new_count += 1
print(f"page {i}: batch={len(batch)} new={new_count} total={len(all_rows)}")
next_url = find_next_url(html)
if not next_url:
print("no next link found — stopping")
break
url = next_url
return all_rows
rows = crawl_search(SEARCH_URL, pages=3)
print("total unique listings:", len(rows))
Step 5: Export to CSV
import csv
def write_csv(path: str, rows: list[dict]) -> None:
fields = ["url", "price", "beds", "baths", "address"]
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fields)
w.writeheader()
for r in rows:
w.writerow({k: r.get(k) for k in fields})
write_csv("realtor_listings.csv", rows)
print("wrote realtor_listings.csv", len(rows))
Practical anti-block checklist
- Use timeouts and retries (don’t hammer indefinitely)
- Back off when you see interstitials or repeated failures
- Keep a small
pages=2..3during development - Store HTML samples when selectors break (so you can adjust quickly)
Where ProxiesAPI fits (honestly)
Realtor.com is not a “toy” site. You can often fetch a few pages directly, but at higher volumes you’ll hit friction.
Use ProxiesAPI when:
- you need consistent success rates across many locations
- you need to run your scraper as a scheduled job
- you’re crawling detail pages in addition to search pages
QA checklist
- You can fetch your search URL consistently
-
parse_listings()returns non-zero listings - URLs are unique and look like property pages
- Pagination stops naturally when no “Next” is found
- CSV exports correctly
Next upgrades
- fetch each listing detail page for richer fields (sqft, year built, agent, etc.)
- store data in SQLite for incremental crawls
- implement per-city job queue + concurrency with rate limits
Real estate sites tend to rate-limit and fingerprint aggressively. ProxiesAPI gives you a stable network layer (rotating IPs + retries) so your scraper spends less time failing and more time collecting listings.