How to Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)

Mar 22, 2026 · tutorial · #python, #craigslist, #web-scraping, #requests, #beautifulsoup, #csv, #proxies

Craigslist is one of the most useful “small HTML” targets on the internet:

pages are mostly server-rendered (no heavy JS)
listing cards are consistent
the site is split by city subdomains (e.g. sfbay.craigslist.org, newyork.craigslist.org)
categories have stable paths (e.g. /search/sss for “for sale”, /search/jjj for jobs)

In this tutorial we’ll build a production-grade Python scraper that:

searches a city + category
paginates through results
extracts listing data from the results page
optionally fetches each listing detail page for richer fields
exports a clean CSV
uses retries, timeouts, and a network layer you can route through ProxiesAPI

Craigslist search results (we’ll scrape the result rows + follow detail links)

Keep Craigslist scrapes stable with ProxiesAPI

Craigslist is lightweight, but large crawls still hit rate limits and occasional blocks. ProxiesAPI helps you run consistent requests with retries and IP rotation when you scale across cities and categories.

Get 1,000 free API calls View pricing

What we’re scraping (Craigslist URL structure)

Craigslist has a few concepts worth understanding before writing selectors.

City subdomains

Each region is its own host:

San Francisco Bay Area: https://sfbay.craigslist.org
New York City: https://newyork.craigslist.org
Los Angeles: https://losangeles.craigslist.org

Category paths

Craigslist uses short codes:

sss = for sale
hhh = housing
jjj = jobs

Search pages look like:

https://sfbay.craigslist.org/search/sss

…and take query parameters like:

query= free-text keyword
min_price= / max_price=
purveyor=owner (owner-only)
bundleDuplicates=1 (often helps reduce duplicates)
s= offset for pagination

Example:

https://sfbay.craigslist.org/search/sss?query=standing%20desk&min_price=50&max_price=300

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
tenacity for retry logic

Step 1: A solid fetch() with retries and timeouts

Craigslist pages are usually fast, but you still want:

connect/read timeouts (avoid hanging)
retry on transient network errors and 429/5xx
a real User-Agent

Below is a clean baseline.

from __future__ import annotations

import random
import time
from dataclasses import dataclass
from typing import Optional

import requests
from requests import Response
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type


TIMEOUT = (10, 30)  # connect, read


@dataclass
class HttpConfig:
    base_url: str
    proxiesapi_url: Optional[str] = None
    user_agent: str = (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    )


class HttpClient:
    def __init__(self, cfg: HttpConfig):
        self.cfg = cfg
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": cfg.user_agent,
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        })

    def _build_url(self, url_or_path: str) -> str:
        if url_or_path.startswith("http://") or url_or_path.startswith("https://"):
            return url_or_path
        return self.cfg.base_url.rstrip("/") + "/" + url_or_path.lstrip("/")

    def _via_proxiesapi(self, target_url: str) -> str:
        """Wrap a URL through ProxiesAPI if configured.

        IMPORTANT: Adjust this function to match your ProxiesAPI endpoint format.
        Common patterns are either:
        - https://proxiesapi.example/fetch?url=<ENCODED>
        - https://proxiesapi.example/?url=<ENCODED>

        Keep it explicit so you don't overclaim the API shape.
        """
        if not self.cfg.proxiesapi_url:
            return target_url

        from urllib.parse import urlencode

        q = urlencode({"url": target_url})
        return self.cfg.proxiesapi_url.rstrip("/") + "?" + q

    @retry(
        reraise=True,
        stop=stop_after_attempt(5),
        wait=wait_exponential_jitter(initial=1, max=20),
        retry=retry_if_exception_type(requests.RequestException),
    )
    def get(self, url_or_path: str, *, params: dict | None = None) -> Response:
        url = self._build_url(url_or_path)
        fetch_url = self._via_proxiesapi(url)

        r = self.session.get(fetch_url, params=params, timeout=TIMEOUT)

        # If ProxiesAPI returns the upstream status in headers, you can inspect it here.
        # We'll keep it simple and retry on common transient statuses.
        if r.status_code in (429, 500, 502, 503, 504):
            raise requests.RequestException(f"Transient status {r.status_code} for {url}")

        r.raise_for_status()
        return r


def polite_sleep(min_s: float = 0.8, max_s: float = 2.0) -> None:
    time.sleep(random.uniform(min_s, max_s))

Configure city + (optional) ProxiesAPI

cfg = HttpConfig(
    base_url="https://sfbay.craigslist.org",  # change city here
    proxiesapi_url=None,  # e.g. "https://YOUR_PROXIESAPI_ENDPOINT/fetch"
)
http = HttpClient(cfg)

Step 2: Fetch a search page and confirm HTML

Start with a manual curl to sanity-check the response.

curl -s "https://sfbay.craigslist.org/search/sss?query=standing%20desk" | head -n 8

You should see a normal HTML document.

Step 3: Parse search results (real selectors)

Craigslist search results usually have rows like:

each card/row has a link to the listing
a title
a price (sometimes missing)
a neighborhood / location hint
a date / time

In practice, the most reliable approach is:

select rows by CSS that Craigslist consistently uses (.result-row)
extract the a.result-title link (href + title)
extract span.result-price (optional)
extract span.result-hood (optional)

from bs4 import BeautifulSoup
from urllib.parse import urljoin


def parse_search_results(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out: list[dict] = []
    for row in soup.select("li.result-row"):
        a = row.select_one("a.result-title")
        if not a:
            continue

        title = a.get_text(" ", strip=True)
        href = a.get("href")
        url = urljoin(base_url, href) if href else None

        price_el = row.select_one("span.result-price")
        price = price_el.get_text(strip=True) if price_el else None

        hood_el = row.select_one("span.result-hood")
        hood = hood_el.get_text(" ", strip=True).strip(" ()") if hood_el else None

        time_el = row.select_one("time.result-date")
        posted_datetime = time_el.get("datetime") if time_el else None

        out.append({
            "title": title,
            "url": url,
            "price": price,
            "hood": hood,
            "posted_datetime": posted_datetime,
        })

    return out

Step 4: Pagination (offset via `s=`)

Craigslist uses s= as an offset (often 0, 120, 240...).

We’ll crawl pages until:

we hit a max page limit, or
we stop seeing new URLs

from urllib.parse import urlencode


def crawl_search(
    http: HttpClient,
    category: str = "sss",
    query: str = "standing desk",
    min_price: int | None = None,
    max_price: int | None = None,
    limit_pages: int = 5,
    page_size: int = 120,
) -> list[dict]:
    all_rows: list[dict] = []
    seen_urls: set[str] = set()

    for page in range(limit_pages):
        offset = page * page_size

        params: dict = {
            "query": query,
            "bundleDuplicates": 1,
            "s": offset,
        }
        if min_price is not None:
            params["min_price"] = min_price
        if max_price is not None:
            params["max_price"] = max_price

        r = http.get(f"/search/{category}", params=params)
        html = r.text

        batch = parse_search_results(html, http.cfg.base_url)

        new_in_batch = 0
        for row in batch:
            u = row.get("url")
            if not u or u in seen_urls:
                continue
            seen_urls.add(u)
            all_rows.append(row)
            new_in_batch += 1

        print(f"page={page+1} offset={offset} rows={len(batch)} new={new_in_batch} total={len(all_rows)}")

        if new_in_batch == 0:
            break

        polite_sleep()

    return all_rows


rows = crawl_search(
    http,
    category="sss",
    query="standing desk",
    min_price=50,
    max_price=300,
    limit_pages=3,
)
print("total", len(rows))
print(rows[0])

Step 5: Follow each listing page for richer fields

Search rows are great for discovery, but you often want detail fields like:

description text
image URLs
attributes (condition, size, etc.)
exact location (sometimes)

On a listing page, Craigslist commonly uses:

title: span#titletextonly
price: span.price
description: section#postingbody
images: img tags inside div.swipe-wrap or figure.iw

Because Craigslist templates vary slightly by category, we’ll implement a tolerant parser.

import re


def clean_posting_body(text: str) -> str:
    # Craigslist often prefixes "QR Code Link to This Post"
    text = re.sub(r"\bQR Code Link to This Post\b", "", text, flags=re.I).strip()
    return text


def parse_listing_detail(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title_el = soup.select_one("span#titletextonly")
    title = title_el.get_text(" ", strip=True) if title_el else None

    price_el = soup.select_one("span.price")
    price = price_el.get_text(strip=True) if price_el else None

    body_el = soup.select_one("section#postingbody")
    body = clean_posting_body(body_el.get_text("\n", strip=True)) if body_el else None

    # Attributes are in p.attrgroup spans
    attrs = {}
    for span in soup.select("p.attrgroup span"):
        t = span.get_text(" ", strip=True)
        if ":" in t:
            k, v = t.split(":", 1)
            attrs[k.strip()] = v.strip()
        else:
            # standalone flags like "delivery available"
            attrs[t] = True

    # Image URLs: take any image in the gallery
    images = []
    for img in soup.select("img"):
        src = img.get("src") or img.get("data-src")
        if src and "craigslist" in src and src not in images:
            images.append(src)

    return {
        "url": url,
        "title": title,
        "price": price,
        "body": body,
        "attributes": attrs,
        "images": images,
    }


def enrich_with_details(http: HttpClient, rows: list[dict], max_details: int = 50) -> list[dict]:
    out = []

    for i, row in enumerate(rows[:max_details], start=1):
        url = row.get("url")
        if not url:
            continue

        r = http.get(url)
        detail = parse_listing_detail(r.text, url)

        merged = {**row, **detail}
        out.append(merged)

        print(f"detail {i}/{min(max_details, len(rows))} fetched")
        polite_sleep(0.6, 1.6)

    return out

Step 6: Export to CSV (properly)

CSV gets messy if you dump nested objects. We’ll:

keep attributes and images as JSON strings
ensure UTF-8

import csv
import json


def to_csv(rows: list[dict], path: str) -> None:
    if not rows:
        raise ValueError("No rows to write")

    # normalize keys
    fieldnames = sorted({k for r in rows for k in r.keys()})

    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for r in rows:
            rr = dict(r)
            if isinstance(rr.get("attributes"), dict):
                rr["attributes"] = json.dumps(rr["attributes"], ensure_ascii=False)
            if isinstance(rr.get("images"), list):
                rr["images"] = json.dumps(rr["images"], ensure_ascii=False)
            w.writerow(rr)


rows = crawl_search(http, category="sss", query="standing desk", min_price=50, max_price=300, limit_pages=3)
detailed = enrich_with_details(http, rows, max_details=30)

to_csv(detailed, "craigslist_listings.csv")
print("wrote craigslist_listings.csv", len(detailed))