Scrape Real Estate Listings from Realtor.com (Python + ProxiesAPI)

Realtor.com is one of the biggest real-estate portals in the US — which also makes it a high-friction scraping target.

In this guide we’ll build a practical Python scraper that:

  • visits a Realtor.com search results page
  • extracts listing URLs + core fields (price, beds, baths, address)
  • paginates through multiple result pages
  • exports to CSV
  • uses a ProxiesAPI-backed fetch function so you can scale more reliably

Realtor.com results page we’ll scrape

Make Realtor.com scraping more reliable with ProxiesAPI

Real estate sites tend to rate-limit and fingerprint aggressively. ProxiesAPI gives you a stable network layer (rotating IPs + retries) so your scraper spends less time failing and more time collecting listings.


What we’re scraping (and what can break)

On Realtor.com, the results UI can change and it may be partially client-rendered. That means:

  • selectors can shift (class names change, fields move)
  • some data may be missing in HTML depending on geo/cookies
  • rate limits / bot protections can trigger (timeouts, 403s, interstitials)

So the goal here is not “one magical selector”. The goal is a workflow:

  1. fetch reliably (timeouts + retries + proxy)
  2. detect what the page contains
  3. extract what’s available, gracefully
  4. iterate on selectors when the UI shifts

Prereqs

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • csv from the standard library for export

Step 1: A safe fetch() with retries (and ProxiesAPI)

You’ll reuse this pattern everywhere.

Below is a drop-in fetch layer:

  • sets realistic timeouts
  • adds a browser-ish User-Agent
  • retries transient failures
  • optionally routes requests through ProxiesAPI

Note: ProxiesAPI integration depends on the exact endpoint/key format in your account. The code below is written to be explicit and easy to adapt: you only need to adjust PROXIESAPI_URL / parameters to match your ProxiesAPI docs.

import os
import time
import random
from urllib.parse import urlencode

import requests

TIMEOUT = (10, 35)  # connect, read
MAX_RETRIES = 5

session = requests.Session()

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
}


def build_proxiesapi_url(target_url: str) -> str:
    """Return a ProxiesAPI-wrapped URL for a target.

    Adapt this to your ProxiesAPI account format.
    Common patterns are either:
      - https://api.proxiesapi.com/?auth_key=...&url=<encoded>
      - https://proxy.proxiesapi.com/?api_key=...&url=<encoded>
    """

    api_key = os.environ.get("PROXIESAPI_KEY")
    if not api_key:
        raise RuntimeError("Missing PROXIESAPI_KEY env var")

    base = os.environ.get("PROXIESAPI_URL", "https://api.proxiesapi.com")

    qs = urlencode({
        "api_key": api_key,
        "url": target_url,
    })

    return f"{base}/?{qs}"


def fetch(url: str, *, use_proxiesapi: bool = True) -> str:
    attempt = 0

    while True:
        attempt += 1
        try:
            final_url = build_proxiesapi_url(url) if use_proxiesapi else url

            r = session.get(final_url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)

            # Some anti-bot flows return 200 with an interstitial.
            # We still raise for typical HTTP errors.
            r.raise_for_status()

            text = r.text or ""

            if "unusual traffic" in text.lower() or "our systems have detected" in text.lower():
                raise RuntimeError("Blocked by interstitial (detected unusual traffic)")

            return text

        except Exception as e:
            if attempt >= MAX_RETRIES:
                raise

            # exponential backoff + jitter
            sleep_s = min(20, (2 ** (attempt - 1))) + random.uniform(0, 0.5)
            print(f"fetch failed (attempt {attempt}/{MAX_RETRIES}): {e} — sleeping {sleep_s:.1f}s")
            time.sleep(sleep_s)

If you want to debug selectors without proxies, just call:

html = fetch(url, use_proxiesapi=False)

Step 2: Find a stable entry point (a search URL)

Realtor.com search URLs are typically state/city/zip-based. Example pattern:

  • https://www.realtor.com/realestateandhomes-search/San-Francisco_CA

Pick one location as your baseline and don’t change it while building selectors.

SEARCH_URL = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"
html = fetch(SEARCH_URL)
print(len(html))
print(html[:200])

Step 3: Parse listing cards (defensive selectors)

Instead of betting on one brittle class name, we:

  • look for anchors that resemble property detail links
  • try multiple ways to locate price / beds / baths / address
  • keep raw HTML snippets around during dev (optional)
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.realtor.com"


def clean_text(x: str) -> str:
    return re.sub(r"\s+", " ", (x or "").strip())


def parse_listings(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    seen = set()

    # Heuristic: property detail links often contain "/realestateandhomes-detail/".
    # If Realtor changes this, update the substring.
    for a in soup.select("a[href]"):
        href = a.get("href") or ""
        if "/realestateandhomes-detail/" not in href:
            continue

        url = href if href.startswith("http") else urljoin(BASE, href)
        if url in seen:
            continue

        # Walk up to a likely card container
        card = a
        for _ in range(6):
            if not card:
                break
            if card.name in ("li", "div", "article"):
                # stop early when we hit a container with enough text
                if len(clean_text(card.get_text(" ", strip=True))) > 40:
                    break
            card = card.parent

        container = card if card else a.parent
        text_blob = clean_text(container.get_text(" ", strip=True) if container else a.get_text(" ", strip=True))

        # Price heuristic: "$" followed by digits/commas
        price = None
        m = re.search(r"\$\s?([0-9,]+)", text_blob)
        if m:
            price = "$" + m.group(1)

        # Beds/baths heuristic (e.g., "3 bed", "2.5 bath")
        beds = None
        baths = None
        mb = re.search(r"(\d+(?:\.\d+)?)\s*(?:bd|bed)s?", text_blob, re.I)
        if mb:
            beds = mb.group(1)
        mba = re.search(r"(\d+(?:\.\d+)?)\s*(?:ba|bath)s?", text_blob, re.I)
        if mba:
            baths = mba.group(1)

        # Address heuristic: look for something that resembles street + city
        # This is intentionally loose. You can tighten based on your target.
        address = None
        # Try aria-label first
        if a.get("aria-label"):
            address = clean_text(a.get("aria-label"))
        else:
            # fallback: first ~80 chars of container text
            address = text_blob[:80] if text_blob else None

        out.append({
            "url": url,
            "price": price,
            "beds": beds,
            "baths": baths,
            "address": address,
        })
        seen.add(url)

    return out


listings = parse_listings(html)
print("listings:", len(listings))
print(listings[:2])

Why this approach?

Real estate result pages frequently shuffle their DOM. Anchors to detail pages are often the most stable “spine” — if you can find detail links, you can usually back into the card.


Step 4: Pagination

Realtor’s pagination can vary; sometimes it’s an explicit pg-2 style path, sometimes query params, sometimes JS.

So we’ll implement two strategies:

  1. try to find a “Next” link in HTML
  2. if not found, try a best-effort URL pattern and stop when results stop changing
from bs4 import BeautifulSoup


def find_next_url(html: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Try common patterns: rel=next or anchor containing "Next"
    a = soup.select_one('a[rel="next"][href]')
    if a and a.get("href"):
        href = a.get("href")
        return href if href.startswith("http") else urljoin(BASE, href)

    for cand in soup.select("a[href]"):
        t = (cand.get_text(" ", strip=True) or "").lower()
        if t == "next" or "next" in t:
            href = cand.get("href")
            if href:
                return href if href.startswith("http") else urljoin(BASE, href)

    return None


def crawl_search(start_url: str, pages: int = 5) -> list[dict]:
    all_rows = []
    seen_urls = set()

    url = start_url
    for i in range(1, pages + 1):
        html = fetch(url)
        batch = parse_listings(html)

        new_count = 0
        for row in batch:
            if row["url"] in seen_urls:
                continue
            seen_urls.add(row["url"])
            all_rows.append(row)
            new_count += 1

        print(f"page {i}: batch={len(batch)} new={new_count} total={len(all_rows)}")

        next_url = find_next_url(html)
        if not next_url:
            print("no next link found — stopping")
            break

        url = next_url

    return all_rows


rows = crawl_search(SEARCH_URL, pages=3)
print("total unique listings:", len(rows))

Step 5: Export to CSV

import csv


def write_csv(path: str, rows: list[dict]) -> None:
    fields = ["url", "price", "beds", "baths", "address"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fields})


write_csv("realtor_listings.csv", rows)
print("wrote realtor_listings.csv", len(rows))

Practical anti-block checklist

  • Use timeouts and retries (don’t hammer indefinitely)
  • Back off when you see interstitials or repeated failures
  • Keep a small pages=2..3 during development
  • Store HTML samples when selectors break (so you can adjust quickly)

Where ProxiesAPI fits (honestly)

Realtor.com is not a “toy” site. You can often fetch a few pages directly, but at higher volumes you’ll hit friction.

Use ProxiesAPI when:

  • you need consistent success rates across many locations
  • you need to run your scraper as a scheduled job
  • you’re crawling detail pages in addition to search pages

QA checklist

  • You can fetch your search URL consistently
  • parse_listings() returns non-zero listings
  • URLs are unique and look like property pages
  • Pagination stops naturally when no “Next” is found
  • CSV exports correctly

Next upgrades

  • fetch each listing detail page for richer fields (sqft, year built, agent, etc.)
  • store data in SQLite for incremental crawls
  • implement per-city job queue + concurrency with rate limits
Make Realtor.com scraping more reliable with ProxiesAPI

Real estate sites tend to rate-limit and fingerprint aggressively. ProxiesAPI gives you a stable network layer (rotating IPs + retries) so your scraper spends less time failing and more time collecting listings.

Related guides

Scrape GitHub Repository Data (Stars, Releases, Issues) with Python + ProxiesAPI
Scrape GitHub repo pages as HTML (not just the API): stars, forks, open issues/PRs, latest release, and recent issues. Includes defensive selectors, CSV export, and a screenshot.
tutorial#python#github#web-scraping
How to Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)
Pull Craigslist listings for a chosen city + category, normalize fields, follow listing pages for details, and export clean CSV with retries and anti-block tips.
tutorial#python#craigslist#web-scraping
How to Scrape Apartment Listings from Apartments.com (Python + ProxiesAPI)
Scrape Apartments.com listing cards and detail-page fields with Python. Includes pagination, resilient parsing, retries, and clean JSON/CSV exports.
tutorial#python#apartments#real-estate
Scrape GitHub Repository Data (Stars, Releases, Issues) with Python + ProxiesAPI
Scrape GitHub repo metadata from HTML (not just the API): stars, forks, latest release, open issues, and pull requests. Includes a ProxiesAPI fetch layer, safe parsing, and CSV export + screenshot.
tutorial#python#github#web-scraping