Scrape Real Estate Listings from Realtor.com (Python + ProxiesAPI)

Mar 24, 2026 · tutorial · #python, #real-estate, #realtor, #web-scraping, #beautifulsoup, #csv, #proxies

Realtor.com is one of the biggest real-estate portals in the US — which also makes it a high-friction scraping target.

In this guide we’ll build a practical Python scraper that:

visits a Realtor.com search results page
extracts listing URLs + core fields (price, beds, baths, address)
paginates through multiple result pages
exports to CSV
uses a ProxiesAPI-backed fetch function so you can scale more reliably

Realtor.com results page we’ll scrape

Make Realtor.com scraping more reliable with ProxiesAPI

Real estate sites tend to rate-limit and fingerprint aggressively. ProxiesAPI gives you a stable network layer (rotating IPs + retries) so your scraper spends less time failing and more time collecting listings.

Get 1,000 free API calls View pricing

What we’re scraping (and what can break)

On Realtor.com, the results UI can change and it may be partially client-rendered. That means:

selectors can shift (class names change, fields move)
some data may be missing in HTML depending on geo/cookies
rate limits / bot protections can trigger (timeouts, 403s, interstitials)

So the goal here is not “one magical selector”. The goal is a workflow:

fetch reliably (timeouts + retries + proxy)
detect what the page contains
extract what’s available, gracefully
iterate on selectors when the UI shifts

Prereqs

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
csv from the standard library for export

Step 1: A safe fetch() with retries (and ProxiesAPI)

You’ll reuse this pattern everywhere.

Below is a drop-in fetch layer:

sets realistic timeouts
adds a browser-ish User-Agent
retries transient failures
optionally routes requests through ProxiesAPI

Note: ProxiesAPI integration depends on the exact endpoint/key format in your account. The code below is written to be explicit and easy to adapt: you only need to adjust PROXIESAPI_URL / parameters to match your ProxiesAPI docs.

import os
import time
import random
from urllib.parse import urlencode

import requests

TIMEOUT = (10, 35)  # connect, read
MAX_RETRIES = 5

session = requests.Session()

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
}


def build_proxiesapi_url(target_url: str) -> str:
    """Return a ProxiesAPI-wrapped URL for a target.

    Adapt this to your ProxiesAPI account format.
    Common patterns are either:
      - https://api.proxiesapi.com/?auth_key=...&url=<encoded>
      - https://proxy.proxiesapi.com/?api_key=...&url=<encoded>
    """

    api_key = os.environ.get("PROXIESAPI_KEY")
    if not api_key:
        raise RuntimeError("Missing PROXIESAPI_KEY env var")

    base = os.environ.get("PROXIESAPI_URL", "https://api.proxiesapi.com")

    qs = urlencode({
        "api_key": api_key,
        "url": target_url,
    })

    return f"{base}/?{qs}"


def fetch(url: str, *, use_proxiesapi: bool = True) -> str:
    attempt = 0

    while True:
        attempt += 1
        try:
            final_url = build_proxiesapi_url(url) if use_proxiesapi else url

            r = session.get(final_url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)

            # Some anti-bot flows return 200 with an interstitial.
            # We still raise for typical HTTP errors.
            r.raise_for_status()

            text = r.text or ""

            if "unusual traffic" in text.lower() or "our systems have detected" in text.lower():
                raise RuntimeError("Blocked by interstitial (detected unusual traffic)")

            return text

        except Exception as e:
            if attempt >= MAX_RETRIES:
                raise

            # exponential backoff + jitter
            sleep_s = min(20, (2 ** (attempt - 1))) + random.uniform(0, 0.5)
            print(f"fetch failed (attempt {attempt}/{MAX_RETRIES}): {e} — sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

If you want to debug selectors without proxies, just call:

html = fetch(url, use_proxiesapi=False)

Step 2: Find a stable entry point (a search URL)

Realtor.com search URLs are typically state/city/zip-based. Example pattern:

https://www.realtor.com/realestateandhomes-search/San-Francisco_CA

Pick one location as your baseline and don’t change it while building selectors.

SEARCH_URL = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"
html = fetch(SEARCH_URL)
print(len(html))
print(html[:200])

Step 3: Parse listing cards (defensive selectors)

Instead of betting on one brittle class name, we:

look for anchors that resemble property detail links
try multiple ways to locate price / beds / baths / address
keep raw HTML snippets around during dev (optional)

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.realtor.com"


def clean_text(x: str) -> str:
    return re.sub(r"\s+", " ", (x or "").strip())


def parse_listings(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    seen = set()

    # Heuristic: property detail links often contain "/realestateandhomes-detail/".
    # If Realtor changes this, update the substring.
    for a in soup.select("a[href]"):
        href = a.get("href") or ""
        if "/realestateandhomes-detail/" not in href:
            continue

        url = href if href.startswith("http") else urljoin(BASE, href)
        if url in seen:
            continue

        # Walk up to a likely card container
        card = a
        for _ in range(6):
            if not card:
                break
            if card.name in ("li", "div", "article"):
                # stop early when we hit a container with enough text
                if len(clean_text(card.get_text(" ", strip=True))) > 40:
                    break
            card = card.parent

        container = card if card else a.parent
        text_blob = clean_text(container.get_text(" ", strip=True) if container else a.get_text(" ", strip=True))

        # Price heuristic: "$" followed by digits/commas
        price = None
        m = re.search(r"\$\s?([0-9,]+)", text_blob)
        if m:
            price = "$" + m.group(1)

        # Beds/baths heuristic (e.g., "3 bed", "2.5 bath")
        beds = None
        baths = None
        mb = re.search(r"(\d+(?:\.\d+)?)\s*(?:bd|bed)s?", text_blob, re.I)
        if mb:
            beds = mb.group(1)
        mba = re.search(r"(\d+(?:\.\d+)?)\s*(?:ba|bath)s?", text_blob, re.I)
        if mba:
            baths = mba.group(1)

        # Address heuristic: look for something that resembles street + city
        # This is intentionally loose. You can tighten based on your target.
        address = None
        # Try aria-label first
        if a.get("aria-label"):
            address = clean_text(a.get("aria-label"))
        else:
            # fallback: first ~80 chars of container text
            address = text_blob[:80] if text_blob else None

        out.append({
            "url": url,
            "price": price,
            "beds": beds,
            "baths": baths,
            "address": address,
        })
        seen.add(url)

    return out


listings = parse_listings(html)
print("listings:", len(listings))
print(listings[:2])

Why this approach?

Real estate result pages frequently shuffle their DOM. Anchors to detail pages are often the most stable “spine” — if you can find detail links, you can usually back into the card.

Step 4: Pagination

Realtor’s pagination can vary; sometimes it’s an explicit pg-2 style path, sometimes query params, sometimes JS.

So we’ll implement two strategies:

try to find a “Next” link in HTML
if not found, try a best-effort URL pattern and stop when results stop changing

from bs4 import BeautifulSoup


def find_next_url(html: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Try common patterns: rel=next or anchor containing "Next"
    a = soup.select_one('a[rel="next"][href]')
    if a and a.get("href"):
        href = a.get("href")
        return href if href.startswith("http") else urljoin(BASE, href)

    for cand in soup.select("a[href]"):
        t = (cand.get_text(" ", strip=True) or "").lower()
        if t == "next" or "next" in t:
            href = cand.get("href")
            if href:
                return href if href.startswith("http") else urljoin(BASE, href)

    return None


def crawl_search(start_url: str, pages: int = 5) -> list[dict]:
    all_rows = []
    seen_urls = set()

    url = start_url
    for i in range(1, pages + 1):
        html = fetch(url)
        batch = parse_listings(html)

        new_count = 0
        for row in batch:
            if row["url"] in seen_urls:
                continue
            seen_urls.add(row["url"])
            all_rows.append(row)
            new_count += 1

        print(f"page {i}: batch={len(batch)} new={new_count} total={len(all_rows)}")

        next_url = find_next_url(html)
        if not next_url:
            print("no next link found — stopping")
            break

        url = next_url

    return all_rows


rows = crawl_search(SEARCH_URL, pages=3)
print("total unique listings:", len(rows))

Step 5: Export to CSV

import csv


def write_csv(path: str, rows: list[dict]) -> None:
    fields = ["url", "price", "beds", "baths", "address"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fields})


write_csv("realtor_listings.csv", rows)
print("wrote realtor_listings.csv", len(rows))

Practical anti-block checklist

Use timeouts and retries (don’t hammer indefinitely)
Back off when you see interstitials or repeated failures
Keep a small pages=2..3 during development
Store HTML samples when selectors break (so you can adjust quickly)

Where ProxiesAPI fits (honestly)

Realtor.com is not a “toy” site. You can often fetch a few pages directly, but at higher volumes you’ll hit friction.

Use ProxiesAPI when:

you need consistent success rates across many locations
you need to run your scraper as a scheduled job
you’re crawling detail pages in addition to search pages

QA checklist

You can fetch your search URL consistently
parse_listings() returns non-zero listings
URLs are unique and look like property pages
Pagination stops naturally when no “Next” is found
CSV exports correctly

Next upgrades

fetch each listing detail page for richer fields (sqft, year built, agent, etc.)
store data in SQLite for incremental crawls
implement per-city job queue + concurrency with rate limits

Make Realtor.com scraping more reliable with ProxiesAPI

Get 1,000 free API calls View pricing

Build a sold-price dataset from Rightmove: crawl results, follow listing links, extract key fields, handle retries, and export to CSV using ProxiesAPI.

tutorial#python#rightmove#real-estate

Scrape UK Property Prices from Rightmove (Sold Prices Dataset Builder)

Build a repeatable Rightmove sold-prices dataset with pagination, retries, and screenshot proof. Includes a production-ready Python scraper and export to CSV/JSON.

tutorial#python#rightmove#real-estate

Scrape Steam Upcoming Releases and Launch Dates with Python

Collect Steam coming-soon games, release dates, store URLs, tags, and prices into a launch-watch CSV using Python.

tutorial#python#steam#upcoming-releases

Scrape Stack Overflow User Profiles and Badges with Python

Extract reputation, badge counts, top tags, and profile metadata from public Stack Overflow user pages into JSON/CSV with robust selectors and a ProxiesAPI-ready fetch layer.

tutorial#python#stack-overflow#web-scraping

Scrape Real Estate Listings from Realtor.com (Python + ProxiesAPI)

Related guides