How to Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)

Craigslist is one of the most useful “small HTML” targets on the internet:

  • pages are mostly server-rendered (no heavy JS)
  • listing cards are consistent
  • the site is split by city subdomains (e.g. sfbay.craigslist.org, newyork.craigslist.org)
  • categories have stable paths (e.g. /search/sss for “for sale”, /search/jjj for jobs)

In this tutorial we’ll build a production-grade Python scraper that:

  • searches a city + category
  • paginates through results
  • extracts listing data from the results page
  • optionally fetches each listing detail page for richer fields
  • exports a clean CSV
  • uses retries, timeouts, and a network layer you can route through ProxiesAPI

Craigslist search results (we’ll scrape the result rows + follow detail links)

Keep Craigslist scrapes stable with ProxiesAPI

Craigslist is lightweight, but large crawls still hit rate limits and occasional blocks. ProxiesAPI helps you run consistent requests with retries and IP rotation when you scale across cities and categories.


What we’re scraping (Craigslist URL structure)

Craigslist has a few concepts worth understanding before writing selectors.

City subdomains

Each region is its own host:

  • San Francisco Bay Area: https://sfbay.craigslist.org
  • New York City: https://newyork.craigslist.org
  • Los Angeles: https://losangeles.craigslist.org

Category paths

Craigslist uses short codes:

  • sss = for sale
  • hhh = housing
  • jjj = jobs

Search pages look like:

https://sfbay.craigslist.org/search/sss

…and take query parameters like:

  • query= free-text keyword
  • min_price= / max_price=
  • purveyor=owner (owner-only)
  • bundleDuplicates=1 (often helps reduce duplicates)
  • s= offset for pagination

Example:

https://sfbay.craigslist.org/search/sss?query=standing%20desk&min_price=50&max_price=300

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • tenacity for retry logic

Step 1: A solid fetch() with retries and timeouts

Craigslist pages are usually fast, but you still want:

  • connect/read timeouts (avoid hanging)
  • retry on transient network errors and 429/5xx
  • a real User-Agent

Below is a clean baseline.

from __future__ import annotations

import random
import time
from dataclasses import dataclass
from typing import Optional

import requests
from requests import Response
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type


TIMEOUT = (10, 30)  # connect, read


@dataclass
class HttpConfig:
    base_url: str
    proxiesapi_url: Optional[str] = None
    user_agent: str = (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    )


class HttpClient:
    def __init__(self, cfg: HttpConfig):
        self.cfg = cfg
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": cfg.user_agent,
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        })

    def _build_url(self, url_or_path: str) -> str:
        if url_or_path.startswith("http://") or url_or_path.startswith("https://"):
            return url_or_path
        return self.cfg.base_url.rstrip("/") + "/" + url_or_path.lstrip("/")

    def _via_proxiesapi(self, target_url: str) -> str:
        """Wrap a URL through ProxiesAPI if configured.

        IMPORTANT: Adjust this function to match your ProxiesAPI endpoint format.
        Common patterns are either:
        - https://proxiesapi.example/fetch?url=<ENCODED>
        - https://proxiesapi.example/?url=<ENCODED>

        Keep it explicit so you don't overclaim the API shape.
        """
        if not self.cfg.proxiesapi_url:
            return target_url

        from urllib.parse import urlencode

        q = urlencode({"url": target_url})
        return self.cfg.proxiesapi_url.rstrip("/") + "?" + q

    @retry(
        reraise=True,
        stop=stop_after_attempt(5),
        wait=wait_exponential_jitter(initial=1, max=20),
        retry=retry_if_exception_type(requests.RequestException),
    )
    def get(self, url_or_path: str, *, params: dict | None = None) -> Response:
        url = self._build_url(url_or_path)
        fetch_url = self._via_proxiesapi(url)

        r = self.session.get(fetch_url, params=params, timeout=TIMEOUT)

        # If ProxiesAPI returns the upstream status in headers, you can inspect it here.
        # We'll keep it simple and retry on common transient statuses.
        if r.status_code in (429, 500, 502, 503, 504):
            raise requests.RequestException(f"Transient status {r.status_code} for {url}")

        r.raise_for_status()
        return r


def polite_sleep(min_s: float = 0.8, max_s: float = 2.0) -> None:
    time.sleep(random.uniform(min_s, max_s))

Configure city + (optional) ProxiesAPI

cfg = HttpConfig(
    base_url="https://sfbay.craigslist.org",  # change city here
    proxiesapi_url=None,  # e.g. "https://YOUR_PROXIESAPI_ENDPOINT/fetch"
)
http = HttpClient(cfg)

Step 2: Fetch a search page and confirm HTML

Start with a manual curl to sanity-check the response.

curl -s "https://sfbay.craigslist.org/search/sss?query=standing%20desk" | head -n 8

You should see a normal HTML document.


Step 3: Parse search results (real selectors)

Craigslist search results usually have rows like:

  • each card/row has a link to the listing
  • a title
  • a price (sometimes missing)
  • a neighborhood / location hint
  • a date / time

In practice, the most reliable approach is:

  1. select rows by CSS that Craigslist consistently uses (.result-row)
  2. extract the a.result-title link (href + title)
  3. extract span.result-price (optional)
  4. extract span.result-hood (optional)
from bs4 import BeautifulSoup
from urllib.parse import urljoin


def parse_search_results(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out: list[dict] = []
    for row in soup.select("li.result-row"):
        a = row.select_one("a.result-title")
        if not a:
            continue

        title = a.get_text(" ", strip=True)
        href = a.get("href")
        url = urljoin(base_url, href) if href else None

        price_el = row.select_one("span.result-price")
        price = price_el.get_text(strip=True) if price_el else None

        hood_el = row.select_one("span.result-hood")
        hood = hood_el.get_text(" ", strip=True).strip(" ()") if hood_el else None

        time_el = row.select_one("time.result-date")
        posted_datetime = time_el.get("datetime") if time_el else None

        out.append({
            "title": title,
            "url": url,
            "price": price,
            "hood": hood,
            "posted_datetime": posted_datetime,
        })

    return out

Step 4: Pagination (offset via s=)

Craigslist uses s= as an offset (often 0, 120, 240...).

We’ll crawl pages until:

  • we hit a max page limit, or
  • we stop seeing new URLs
from urllib.parse import urlencode


def crawl_search(
    http: HttpClient,
    category: str = "sss",
    query: str = "standing desk",
    min_price: int | None = None,
    max_price: int | None = None,
    limit_pages: int = 5,
    page_size: int = 120,
) -> list[dict]:
    all_rows: list[dict] = []
    seen_urls: set[str] = set()

    for page in range(limit_pages):
        offset = page * page_size

        params: dict = {
            "query": query,
            "bundleDuplicates": 1,
            "s": offset,
        }
        if min_price is not None:
            params["min_price"] = min_price
        if max_price is not None:
            params["max_price"] = max_price

        r = http.get(f"/search/{category}", params=params)
        html = r.text

        batch = parse_search_results(html, http.cfg.base_url)

        new_in_batch = 0
        for row in batch:
            u = row.get("url")
            if not u or u in seen_urls:
                continue
            seen_urls.add(u)
            all_rows.append(row)
            new_in_batch += 1

        print(f"page={page+1} offset={offset} rows={len(batch)} new={new_in_batch} total={len(all_rows)}")

        if new_in_batch == 0:
            break

        polite_sleep()

    return all_rows


rows = crawl_search(
    http,
    category="sss",
    query="standing desk",
    min_price=50,
    max_price=300,
    limit_pages=3,
)
print("total", len(rows))
print(rows[0])

Step 5: Follow each listing page for richer fields

Search rows are great for discovery, but you often want detail fields like:

  • description text
  • image URLs
  • attributes (condition, size, etc.)
  • exact location (sometimes)

On a listing page, Craigslist commonly uses:

  • title: span#titletextonly
  • price: span.price
  • description: section#postingbody
  • images: img tags inside div.swipe-wrap or figure.iw

Because Craigslist templates vary slightly by category, we’ll implement a tolerant parser.

import re


def clean_posting_body(text: str) -> str:
    # Craigslist often prefixes "QR Code Link to This Post"
    text = re.sub(r"\bQR Code Link to This Post\b", "", text, flags=re.I).strip()
    return text


def parse_listing_detail(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title_el = soup.select_one("span#titletextonly")
    title = title_el.get_text(" ", strip=True) if title_el else None

    price_el = soup.select_one("span.price")
    price = price_el.get_text(strip=True) if price_el else None

    body_el = soup.select_one("section#postingbody")
    body = clean_posting_body(body_el.get_text("\n", strip=True)) if body_el else None

    # Attributes are in p.attrgroup spans
    attrs = {}
    for span in soup.select("p.attrgroup span"):
        t = span.get_text(" ", strip=True)
        if ":" in t:
            k, v = t.split(":", 1)
            attrs[k.strip()] = v.strip()
        else:
            # standalone flags like "delivery available"
            attrs[t] = True

    # Image URLs: take any image in the gallery
    images = []
    for img in soup.select("img"):
        src = img.get("src") or img.get("data-src")
        if src and "craigslist" in src and src not in images:
            images.append(src)

    return {
        "url": url,
        "title": title,
        "price": price,
        "body": body,
        "attributes": attrs,
        "images": images,
    }


def enrich_with_details(http: HttpClient, rows: list[dict], max_details: int = 50) -> list[dict]:
    out = []

    for i, row in enumerate(rows[:max_details], start=1):
        url = row.get("url")
        if not url:
            continue

        r = http.get(url)
        detail = parse_listing_detail(r.text, url)

        merged = {**row, **detail}
        out.append(merged)

        print(f"detail {i}/{min(max_details, len(rows))} fetched")
        polite_sleep(0.6, 1.6)

    return out

Step 6: Export to CSV (properly)

CSV gets messy if you dump nested objects. We’ll:

  • keep attributes and images as JSON strings
  • ensure UTF-8
import csv
import json


def to_csv(rows: list[dict], path: str) -> None:
    if not rows:
        raise ValueError("No rows to write")

    # normalize keys
    fieldnames = sorted({k for r in rows for k in r.keys()})

    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for r in rows:
            rr = dict(r)
            if isinstance(rr.get("attributes"), dict):
                rr["attributes"] = json.dumps(rr["attributes"], ensure_ascii=False)
            if isinstance(rr.get("images"), list):
                rr["images"] = json.dumps(rr["images"], ensure_ascii=False)
            w.writerow(rr)


rows = crawl_search(http, category="sss", query="standing desk", min_price=50, max_price=300, limit_pages=3)
detailed = enrich_with_details(http, rows, max_details=30)

to_csv(detailed, "craigslist_listings.csv")
print("wrote craigslist_listings.csv", len(detailed))

Anti-block tips (Craigslist-specific)

Craigslist is generally tolerant, but you can still get throttled if you:

  • hammer one city with many requests per second
  • fetch details for thousands of listings in one run
  • use a default Python User-Agent

Practical mitigations:

  • sleep between requests (random jitter)
  • limit detail fetching (max_details) and run incrementally
  • cache listing pages locally (or in SQLite) and only re-fetch new URLs
  • spread across time (cron) rather than trying to do everything in one blast

Where ProxiesAPI fits (honestly)

You can scrape small Craigslist batches without proxies.

But when you scale to multiple cities + categories + detail pages, failures become noisy:

  • intermittent 429s
  • occasional captchas or blocked IPs
  • unstable throughput

ProxiesAPI is most useful as a consistent network layer: route requests through it, keep retries centralized, and rotate IPs when needed.


QA checklist

  • Your search URL returns HTML (not an error page)
  • Parsed rows contain a title + URL
  • Pagination adds new results
  • Detail pages parse body + attributes for at least a few items
  • CSV opens cleanly in Google Sheets/Excel

Next upgrades

  • store rows in SQLite for incremental crawls
  • add deduping by Craigslist post id (present in URL)
  • add structured geocoding if you need lat/lon (when available)
  • add concurrency carefully (threads) with strict rate limits
Keep Craigslist scrapes stable with ProxiesAPI

Craigslist is lightweight, but large crawls still hit rate limits and occasional blocks. ProxiesAPI helps you run consistent requests with retries and IP rotation when you scale across cities and categories.

Related guides

Scrape Wikipedia Article Data at Scale (Tables + Infobox + Links)
Extract structured fields from many Wikipedia pages (infobox + tables + links) with ProxiesAPI + Python, then save to CSV/JSON.
tutorial#python#wikipedia#web-scraping
How to Scrape Apartment Listings from Apartments.com (Python + ProxiesAPI)
Scrape Apartments.com listing cards and detail-page fields with Python. Includes pagination, resilient parsing, retries, and clean JSON/CSV exports.
tutorial#python#apartments#real-estate
How to Scrape Business Reviews from Yelp (Python + ProxiesAPI)
Extract Yelp search results and business-page review snippets with Python. Includes pagination, resilient selectors, retries, and a clean JSON/CSV export.
tutorial#python#yelp#reviews
How to Scrape AutoTrader Used Car Listings with Python (Make/Model/Price/Mileage)
Scrape AutoTrader search results into a clean dataset: title, price, mileage, year, location, and dealer vs private hints. Includes ProxiesAPI fetch, robust selectors, and export to JSON.
tutorial#python#autotrader#cars