Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)
Craigslist is one of the best “real-world” scraping targets because it’s mostly server-rendered HTML and the URL structure is predictable.
In this guide, you’ll build a production-style scraper that:
- targets a city + category (e.g., SF Bay Area → for sale → bicycles)
- crawls pagination
- extracts clean fields (title, price, location, url, post id, date)
- dedupes results across pages
- exports to CSV
We’ll also show where ProxiesAPI fits into the network layer when you scale up.

Craigslist is usually straightforward, but bigger crawls get noisy (timeouts, throttling, IP-based blocks). ProxiesAPI helps keep your fetch layer stable while you focus on parsing + dedupe + exports.
What we’re scraping (Craigslist structure)
Craigslist is split into city subdomains, for example:
- San Francisco Bay Area:
https://sfbay.craigslist.org/ - New York:
https://newyork.craigslist.org/
Within a city, categories have short slugs. Example for bicycles for sale:
https://sfbay.craigslist.org/search/bia
A search results page contains a list of <li class="cl-static-search-result"> ... items (newer layout) or <li class="result-row"> ... items (older layout). Craigslist has been migrating layouts, so we’ll support both.
Pagination is typically via a query param like:
?s=120(offset)
We’ll implement pagination by following the “next” link if present, and fall back to s= offsets when needed.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsing
Step 1: Build a fetcher (Requests) + ProxiesAPI hook
First: write a fetch function with real timeouts and a decent User-Agent.
You have two common approaches:
- Direct requests (works for small, polite crawls)
- Requests routed through ProxiesAPI (helps when you’re crawling more pages, more categories, or more cities)
Below is a simple pattern that supports both.
import os
import time
from urllib.parse import urljoin
import requests
TIMEOUT = (10, 30) # connect, read
UA = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
session = requests.Session()
session.headers.update({
"User-Agent": UA,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
})
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "")
def fetch(url: str) -> str:
"""Fetch HTML, optionally via ProxiesAPI.
IMPORTANT: Keep claims honest. ProxiesAPI changes the network path; it does not
magically bypass every block.
"""
# Option A: direct
if not PROXIESAPI_KEY:
r = session.get(url, timeout=TIMEOUT)
r.raise_for_status()
return r.text
# Option B: via ProxiesAPI (example style)
# Adjust parameter names to match your ProxiesAPI account docs.
proxy_url = "https://api.proxiesapi.com"
params = {
"api_key": PROXIESAPI_KEY,
"url": url,
# Common optional knobs (names vary by provider):
# "render": "false",
# "country": "US",
# "session": "cl_1",
}
r = session.get(proxy_url, params=params, timeout=TIMEOUT)
r.raise_for_status()
return r.text
def polite_sleep(i: int) -> None:
# keep it simple: a little jitter reduces burstiness
time.sleep(1.0 + (i % 3) * 0.3)
If you don’t set PROXIESAPI_KEY, the code runs directly (good for local tests).
Step 2: Parse listings from a results page
We want these fields:
post_idtitlepricelocation(if shown)urlposted_at(if available)
Craigslist listing URLs usually contain a numeric id, e.g.:
https://sfbay.craigslist.org/sfc/bia/d/san-francisco-something/1234567890.html
We’ll extract the id from the URL.
import re
from bs4 import BeautifulSoup
ID_RE = re.compile(r"/(\d+)\.html")
def extract_post_id(href: str) -> str | None:
if not href:
return None
m = ID_RE.search(href)
return m.group(1) if m else None
def parse_results(html: str, base_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out: list[dict] = []
# Newer static layout
items = soup.select("li.cl-static-search-result")
if items:
for li in items:
a = li.select_one("a")
href = a.get("href") if a else None
url = urljoin(base_url, href) if href else None
title = a.get_text(" ", strip=True) if a else None
price_el = li.select_one("span.price")
price = price_el.get_text(" ", strip=True) if price_el else None
loc_el = li.select_one("div.location")
location = loc_el.get_text(" ", strip=True) if loc_el else None
time_el = li.select_one("time")
posted_at = time_el.get("datetime") if time_el else None
out.append({
"post_id": extract_post_id(url or ""),
"title": title,
"price": price,
"location": location,
"posted_at": posted_at,
"url": url,
})
return out
# Older layout fallback
for row in soup.select("li.result-row"):
a = row.select_one("a.result-title")
href = a.get("href") if a else None
url = urljoin(base_url, href) if href else None
title = a.get_text(" ", strip=True) if a else None
price_el = row.select_one("span.result-price")
price = price_el.get_text(" ", strip=True) if price_el else None
hood_el = row.select_one("span.result-hood")
location = hood_el.get_text(" ", strip=True).strip(" ()") if hood_el else None
time_el = row.select_one("time.result-date")
posted_at = time_el.get("datetime") if time_el else None
out.append({
"post_id": extract_post_id(url or ""),
"title": title,
"price": price,
"location": location,
"posted_at": posted_at,
"url": url,
})
return out
Step 3: Pagination (follow “next”)
Craigslist pagination changes over time. The most robust approach is:
- Parse the page
- Try to locate a “next” link
- Crawl until no next link
from urllib.parse import urlparse
def find_next_url(html: str, base_url: str) -> str | None:
soup = BeautifulSoup(html, "lxml")
# Common pattern: a.next
a = soup.select_one("a.next")
if a and a.get("href"):
return urljoin(base_url, a.get("href"))
# Alternate pattern: link rel=next
link = soup.select_one("link[rel='next']")
if link and link.get("href"):
return urljoin(base_url, link.get("href"))
return None
def crawl_search(start_url: str, max_pages: int = 5) -> list[dict]:
all_rows: list[dict] = []
seen_ids: set[str] = set()
url = start_url
for i in range(max_pages):
html = fetch(url)
rows = parse_results(html, base_url=url)
for r in rows:
pid = r.get("post_id")
if not pid:
# no id → keep but don’t dedupe strongly
all_rows.append(r)
continue
if pid in seen_ids:
continue
seen_ids.add(pid)
all_rows.append(r)
next_url = find_next_url(html, base_url=url)
if not next_url:
break
url = next_url
polite_sleep(i)
return all_rows
Step 4: Export to CSV
import csv
def write_csv(rows: list[dict], path: str) -> None:
fields = ["post_id", "title", "price", "location", "posted_at", "url"]
with open(path, "w", encoding="utf-8", newline="") as f:
w = csv.DictWriter(f, fieldnames=fields)
w.writeheader()
for r in rows:
w.writerow({k: r.get(k) for k in fields})
if __name__ == "__main__":
# Example: SF Bay Area → bicycles (bia)
start = "https://sfbay.craigslist.org/search/bia"
rows = crawl_search(start, max_pages=5)
print("rows:", len(rows))
print("sample:", rows[0] if rows else None)
write_csv(rows, "craigslist_bia_sfbay.csv")
print("wrote craigslist_bia_sfbay.csv")
Selector rationale + troubleshooting
1) Why support both layouts?
Craigslist has multiple HTML layouts in the wild. Supporting both li.cl-static-search-result (newer) and li.result-row (older) makes your scraper survive transitions.
2) Missing price / location
Not all listings include location or a structured price. Your output should tolerate None.
3) Getting blocked / rate-limited
Be realistic:
- start slow (few pages)
- add jitter (
polite_sleep) - avoid fetching listing detail pages unless you need them
When your crawl grows (multiple categories × multiple cities), ProxiesAPI can help by stabilizing the fetch layer.
Where ProxiesAPI fits (honestly)
Craigslist often works without proxies for small crawls.
But scrapers fail in production due to:
- request bursts (pagination across many categories)
- regional routing differences
- IP-based throttling
- transient network errors
A proxy API like ProxiesAPI helps you make the network layer more resilient so your code spends less time on retries.
QA checklist
- Scraper returns non-zero rows for a known category
- URLs are absolute and include the numeric post id
- Dedupe keeps only unique
post_id - CSV opens cleanly in Excel/Google Sheets
- Crawl stops when there’s no next page
Craigslist is usually straightforward, but bigger crawls get noisy (timeouts, throttling, IP-based blocks). ProxiesAPI helps keep your fetch layer stable while you focus on parsing + dedupe + exports.