How to Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)
Craigslist is one of the most useful “small HTML” targets on the internet:
- pages are mostly server-rendered (no heavy JS)
- listing cards are consistent
- the site is split by city subdomains (e.g.
sfbay.craigslist.org,newyork.craigslist.org) - categories have stable paths (e.g.
/search/sssfor “for sale”,/search/jjjfor jobs)
In this tutorial we’ll build a production-grade Python scraper that:
- searches a city + category
- paginates through results
- extracts listing data from the results page
- optionally fetches each listing detail page for richer fields
- exports a clean CSV
- uses retries, timeouts, and a network layer you can route through ProxiesAPI

Craigslist is lightweight, but large crawls still hit rate limits and occasional blocks. ProxiesAPI helps you run consistent requests with retries and IP rotation when you scale across cities and categories.
What we’re scraping (Craigslist URL structure)
Craigslist has a few concepts worth understanding before writing selectors.
City subdomains
Each region is its own host:
- San Francisco Bay Area:
https://sfbay.craigslist.org - New York City:
https://newyork.craigslist.org - Los Angeles:
https://losangeles.craigslist.org
Category paths
Craigslist uses short codes:
sss= for salehhh= housingjjj= jobs
Search pages look like:
https://sfbay.craigslist.org/search/sss
…and take query parameters like:
query=free-text keywordmin_price=/max_price=purveyor=owner(owner-only)bundleDuplicates=1(often helps reduce duplicates)s=offset for pagination
Example:
https://sfbay.craigslist.org/search/sss?query=standing%20desk&min_price=50&max_price=300
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingtenacityfor retry logic
Step 1: A solid fetch() with retries and timeouts
Craigslist pages are usually fast, but you still want:
- connect/read timeouts (avoid hanging)
- retry on transient network errors and 429/5xx
- a real User-Agent
Below is a clean baseline.
from __future__ import annotations
import random
import time
from dataclasses import dataclass
from typing import Optional
import requests
from requests import Response
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
TIMEOUT = (10, 30) # connect, read
@dataclass
class HttpConfig:
base_url: str
proxiesapi_url: Optional[str] = None
user_agent: str = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
)
class HttpClient:
def __init__(self, cfg: HttpConfig):
self.cfg = cfg
self.session = requests.Session()
self.session.headers.update({
"User-Agent": cfg.user_agent,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
})
def _build_url(self, url_or_path: str) -> str:
if url_or_path.startswith("http://") or url_or_path.startswith("https://"):
return url_or_path
return self.cfg.base_url.rstrip("/") + "/" + url_or_path.lstrip("/")
def _via_proxiesapi(self, target_url: str) -> str:
"""Wrap a URL through ProxiesAPI if configured.
IMPORTANT: Adjust this function to match your ProxiesAPI endpoint format.
Common patterns are either:
- https://proxiesapi.example/fetch?url=<ENCODED>
- https://proxiesapi.example/?url=<ENCODED>
Keep it explicit so you don't overclaim the API shape.
"""
if not self.cfg.proxiesapi_url:
return target_url
from urllib.parse import urlencode
q = urlencode({"url": target_url})
return self.cfg.proxiesapi_url.rstrip("/") + "?" + q
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential_jitter(initial=1, max=20),
retry=retry_if_exception_type(requests.RequestException),
)
def get(self, url_or_path: str, *, params: dict | None = None) -> Response:
url = self._build_url(url_or_path)
fetch_url = self._via_proxiesapi(url)
r = self.session.get(fetch_url, params=params, timeout=TIMEOUT)
# If ProxiesAPI returns the upstream status in headers, you can inspect it here.
# We'll keep it simple and retry on common transient statuses.
if r.status_code in (429, 500, 502, 503, 504):
raise requests.RequestException(f"Transient status {r.status_code} for {url}")
r.raise_for_status()
return r
def polite_sleep(min_s: float = 0.8, max_s: float = 2.0) -> None:
time.sleep(random.uniform(min_s, max_s))
Configure city + (optional) ProxiesAPI
cfg = HttpConfig(
base_url="https://sfbay.craigslist.org", # change city here
proxiesapi_url=None, # e.g. "https://YOUR_PROXIESAPI_ENDPOINT/fetch"
)
http = HttpClient(cfg)
Step 2: Fetch a search page and confirm HTML
Start with a manual curl to sanity-check the response.
curl -s "https://sfbay.craigslist.org/search/sss?query=standing%20desk" | head -n 8
You should see a normal HTML document.
Step 3: Parse search results (real selectors)
Craigslist search results usually have rows like:
- each card/row has a link to the listing
- a title
- a price (sometimes missing)
- a neighborhood / location hint
- a date / time
In practice, the most reliable approach is:
- select rows by CSS that Craigslist consistently uses (
.result-row) - extract the
a.result-titlelink (href + title) - extract
span.result-price(optional) - extract
span.result-hood(optional)
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def parse_search_results(html: str, base_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out: list[dict] = []
for row in soup.select("li.result-row"):
a = row.select_one("a.result-title")
if not a:
continue
title = a.get_text(" ", strip=True)
href = a.get("href")
url = urljoin(base_url, href) if href else None
price_el = row.select_one("span.result-price")
price = price_el.get_text(strip=True) if price_el else None
hood_el = row.select_one("span.result-hood")
hood = hood_el.get_text(" ", strip=True).strip(" ()") if hood_el else None
time_el = row.select_one("time.result-date")
posted_datetime = time_el.get("datetime") if time_el else None
out.append({
"title": title,
"url": url,
"price": price,
"hood": hood,
"posted_datetime": posted_datetime,
})
return out
Step 4: Pagination (offset via s=)
Craigslist uses s= as an offset (often 0, 120, 240...).
We’ll crawl pages until:
- we hit a max page limit, or
- we stop seeing new URLs
from urllib.parse import urlencode
def crawl_search(
http: HttpClient,
category: str = "sss",
query: str = "standing desk",
min_price: int | None = None,
max_price: int | None = None,
limit_pages: int = 5,
page_size: int = 120,
) -> list[dict]:
all_rows: list[dict] = []
seen_urls: set[str] = set()
for page in range(limit_pages):
offset = page * page_size
params: dict = {
"query": query,
"bundleDuplicates": 1,
"s": offset,
}
if min_price is not None:
params["min_price"] = min_price
if max_price is not None:
params["max_price"] = max_price
r = http.get(f"/search/{category}", params=params)
html = r.text
batch = parse_search_results(html, http.cfg.base_url)
new_in_batch = 0
for row in batch:
u = row.get("url")
if not u or u in seen_urls:
continue
seen_urls.add(u)
all_rows.append(row)
new_in_batch += 1
print(f"page={page+1} offset={offset} rows={len(batch)} new={new_in_batch} total={len(all_rows)}")
if new_in_batch == 0:
break
polite_sleep()
return all_rows
rows = crawl_search(
http,
category="sss",
query="standing desk",
min_price=50,
max_price=300,
limit_pages=3,
)
print("total", len(rows))
print(rows[0])
Step 5: Follow each listing page for richer fields
Search rows are great for discovery, but you often want detail fields like:
- description text
- image URLs
- attributes (condition, size, etc.)
- exact location (sometimes)
On a listing page, Craigslist commonly uses:
- title:
span#titletextonly - price:
span.price - description:
section#postingbody - images:
imgtags insidediv.swipe-wraporfigure.iw
Because Craigslist templates vary slightly by category, we’ll implement a tolerant parser.
import re
def clean_posting_body(text: str) -> str:
# Craigslist often prefixes "QR Code Link to This Post"
text = re.sub(r"\bQR Code Link to This Post\b", "", text, flags=re.I).strip()
return text
def parse_listing_detail(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
title_el = soup.select_one("span#titletextonly")
title = title_el.get_text(" ", strip=True) if title_el else None
price_el = soup.select_one("span.price")
price = price_el.get_text(strip=True) if price_el else None
body_el = soup.select_one("section#postingbody")
body = clean_posting_body(body_el.get_text("\n", strip=True)) if body_el else None
# Attributes are in p.attrgroup spans
attrs = {}
for span in soup.select("p.attrgroup span"):
t = span.get_text(" ", strip=True)
if ":" in t:
k, v = t.split(":", 1)
attrs[k.strip()] = v.strip()
else:
# standalone flags like "delivery available"
attrs[t] = True
# Image URLs: take any image in the gallery
images = []
for img in soup.select("img"):
src = img.get("src") or img.get("data-src")
if src and "craigslist" in src and src not in images:
images.append(src)
return {
"url": url,
"title": title,
"price": price,
"body": body,
"attributes": attrs,
"images": images,
}
def enrich_with_details(http: HttpClient, rows: list[dict], max_details: int = 50) -> list[dict]:
out = []
for i, row in enumerate(rows[:max_details], start=1):
url = row.get("url")
if not url:
continue
r = http.get(url)
detail = parse_listing_detail(r.text, url)
merged = {**row, **detail}
out.append(merged)
print(f"detail {i}/{min(max_details, len(rows))} fetched")
polite_sleep(0.6, 1.6)
return out
Step 6: Export to CSV (properly)
CSV gets messy if you dump nested objects. We’ll:
- keep
attributesandimagesas JSON strings - ensure UTF-8
import csv
import json
def to_csv(rows: list[dict], path: str) -> None:
if not rows:
raise ValueError("No rows to write")
# normalize keys
fieldnames = sorted({k for r in rows for k in r.keys()})
with open(path, "w", encoding="utf-8", newline="") as f:
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()
for r in rows:
rr = dict(r)
if isinstance(rr.get("attributes"), dict):
rr["attributes"] = json.dumps(rr["attributes"], ensure_ascii=False)
if isinstance(rr.get("images"), list):
rr["images"] = json.dumps(rr["images"], ensure_ascii=False)
w.writerow(rr)
rows = crawl_search(http, category="sss", query="standing desk", min_price=50, max_price=300, limit_pages=3)
detailed = enrich_with_details(http, rows, max_details=30)
to_csv(detailed, "craigslist_listings.csv")
print("wrote craigslist_listings.csv", len(detailed))
Anti-block tips (Craigslist-specific)
Craigslist is generally tolerant, but you can still get throttled if you:
- hammer one city with many requests per second
- fetch details for thousands of listings in one run
- use a default Python User-Agent
Practical mitigations:
- sleep between requests (random jitter)
- limit detail fetching (
max_details) and run incrementally - cache listing pages locally (or in SQLite) and only re-fetch new URLs
- spread across time (cron) rather than trying to do everything in one blast
Where ProxiesAPI fits (honestly)
You can scrape small Craigslist batches without proxies.
But when you scale to multiple cities + categories + detail pages, failures become noisy:
- intermittent 429s
- occasional captchas or blocked IPs
- unstable throughput
ProxiesAPI is most useful as a consistent network layer: route requests through it, keep retries centralized, and rotate IPs when needed.
QA checklist
- Your search URL returns HTML (not an error page)
- Parsed rows contain a title + URL
- Pagination adds new results
- Detail pages parse body + attributes for at least a few items
- CSV opens cleanly in Google Sheets/Excel
Next upgrades
- store rows in SQLite for incremental crawls
- add deduping by Craigslist post id (present in URL)
- add structured geocoding if you need lat/lon (when available)
- add concurrency carefully (threads) with strict rate limits
Craigslist is lightweight, but large crawls still hit rate limits and occasional blocks. ProxiesAPI helps you run consistent requests with retries and IP rotation when you scale across cities and categories.