Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)

Craigslist is one of the best “real-world” scraping targets because it’s mostly server-rendered HTML and the URL structure is predictable.

In this guide, you’ll build a production-style scraper that:

  • targets a city + category (e.g., SF Bay Area → for sale → bicycles)
  • crawls pagination
  • extracts clean fields (title, price, location, url, post id, date)
  • dedupes results across pages
  • exports to CSV

We’ll also show where ProxiesAPI fits into the network layer when you scale up.

Craigslist results page (we’ll scrape listing rows + pagination)

Make Craigslist scrapes more reliable with ProxiesAPI

Craigslist is usually straightforward, but bigger crawls get noisy (timeouts, throttling, IP-based blocks). ProxiesAPI helps keep your fetch layer stable while you focus on parsing + dedupe + exports.


What we’re scraping (Craigslist structure)

Craigslist is split into city subdomains, for example:

  • San Francisco Bay Area: https://sfbay.craigslist.org/
  • New York: https://newyork.craigslist.org/

Within a city, categories have short slugs. Example for bicycles for sale:

  • https://sfbay.craigslist.org/search/bia

A search results page contains a list of <li class="cl-static-search-result"> ... items (newer layout) or <li class="result-row"> ... items (older layout). Craigslist has been migrating layouts, so we’ll support both.

Pagination is typically via a query param like:

  • ?s=120 (offset)

We’ll implement pagination by following the “next” link if present, and fall back to s= offsets when needed.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing

Step 1: Build a fetcher (Requests) + ProxiesAPI hook

First: write a fetch function with real timeouts and a decent User-Agent.

You have two common approaches:

  1. Direct requests (works for small, polite crawls)
  2. Requests routed through ProxiesAPI (helps when you’re crawling more pages, more categories, or more cities)

Below is a simple pattern that supports both.

import os
import time
from urllib.parse import urljoin

import requests

TIMEOUT = (10, 30)  # connect, read
UA = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/124.0.0.0 Safari/537.36"
)

session = requests.Session()
session.headers.update({
    "User-Agent": UA,
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
})

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "")


def fetch(url: str) -> str:
    """Fetch HTML, optionally via ProxiesAPI.

    IMPORTANT: Keep claims honest. ProxiesAPI changes the network path; it does not
    magically bypass every block.
    """
    # Option A: direct
    if not PROXIESAPI_KEY:
        r = session.get(url, timeout=TIMEOUT)
        r.raise_for_status()
        return r.text

    # Option B: via ProxiesAPI (example style)
    # Adjust parameter names to match your ProxiesAPI account docs.
    proxy_url = "https://api.proxiesapi.com"
    params = {
        "api_key": PROXIESAPI_KEY,
        "url": url,
        # Common optional knobs (names vary by provider):
        # "render": "false",
        # "country": "US",
        # "session": "cl_1",
    }

    r = session.get(proxy_url, params=params, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def polite_sleep(i: int) -> None:
    # keep it simple: a little jitter reduces burstiness
    time.sleep(1.0 + (i % 3) * 0.3)

If you don’t set PROXIESAPI_KEY, the code runs directly (good for local tests).


Step 2: Parse listings from a results page

We want these fields:

  • post_id
  • title
  • price
  • location (if shown)
  • url
  • posted_at (if available)

Craigslist listing URLs usually contain a numeric id, e.g.:

https://sfbay.craigslist.org/sfc/bia/d/san-francisco-something/1234567890.html

We’ll extract the id from the URL.

import re
from bs4 import BeautifulSoup

ID_RE = re.compile(r"/(\d+)\.html")


def extract_post_id(href: str) -> str | None:
    if not href:
        return None
    m = ID_RE.search(href)
    return m.group(1) if m else None


def parse_results(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out: list[dict] = []

    # Newer static layout
    items = soup.select("li.cl-static-search-result")
    if items:
        for li in items:
            a = li.select_one("a")
            href = a.get("href") if a else None
            url = urljoin(base_url, href) if href else None

            title = a.get_text(" ", strip=True) if a else None

            price_el = li.select_one("span.price")
            price = price_el.get_text(" ", strip=True) if price_el else None

            loc_el = li.select_one("div.location")
            location = loc_el.get_text(" ", strip=True) if loc_el else None

            time_el = li.select_one("time")
            posted_at = time_el.get("datetime") if time_el else None

            out.append({
                "post_id": extract_post_id(url or ""),
                "title": title,
                "price": price,
                "location": location,
                "posted_at": posted_at,
                "url": url,
            })
        return out

    # Older layout fallback
    for row in soup.select("li.result-row"):
        a = row.select_one("a.result-title")
        href = a.get("href") if a else None
        url = urljoin(base_url, href) if href else None

        title = a.get_text(" ", strip=True) if a else None

        price_el = row.select_one("span.result-price")
        price = price_el.get_text(" ", strip=True) if price_el else None

        hood_el = row.select_one("span.result-hood")
        location = hood_el.get_text(" ", strip=True).strip(" ()") if hood_el else None

        time_el = row.select_one("time.result-date")
        posted_at = time_el.get("datetime") if time_el else None

        out.append({
            "post_id": extract_post_id(url or ""),
            "title": title,
            "price": price,
            "location": location,
            "posted_at": posted_at,
            "url": url,
        })

    return out

Step 3: Pagination (follow “next”)

Craigslist pagination changes over time. The most robust approach is:

  1. Parse the page
  2. Try to locate a “next” link
  3. Crawl until no next link
from urllib.parse import urlparse


def find_next_url(html: str, base_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Common pattern: a.next
    a = soup.select_one("a.next")
    if a and a.get("href"):
        return urljoin(base_url, a.get("href"))

    # Alternate pattern: link rel=next
    link = soup.select_one("link[rel='next']")
    if link and link.get("href"):
        return urljoin(base_url, link.get("href"))

    return None


def crawl_search(start_url: str, max_pages: int = 5) -> list[dict]:
    all_rows: list[dict] = []
    seen_ids: set[str] = set()

    url = start_url

    for i in range(max_pages):
        html = fetch(url)
        rows = parse_results(html, base_url=url)

        for r in rows:
            pid = r.get("post_id")
            if not pid:
                # no id → keep but don’t dedupe strongly
                all_rows.append(r)
                continue
            if pid in seen_ids:
                continue
            seen_ids.add(pid)
            all_rows.append(r)

        next_url = find_next_url(html, base_url=url)
        if not next_url:
            break

        url = next_url
        polite_sleep(i)

    return all_rows

Step 4: Export to CSV

import csv


def write_csv(rows: list[dict], path: str) -> None:
    fields = ["post_id", "title", "price", "location", "posted_at", "url"]

    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fields})


if __name__ == "__main__":
    # Example: SF Bay Area → bicycles (bia)
    start = "https://sfbay.craigslist.org/search/bia"
    rows = crawl_search(start, max_pages=5)

    print("rows:", len(rows))
    print("sample:", rows[0] if rows else None)

    write_csv(rows, "craigslist_bia_sfbay.csv")
    print("wrote craigslist_bia_sfbay.csv")

Selector rationale + troubleshooting

1) Why support both layouts?

Craigslist has multiple HTML layouts in the wild. Supporting both li.cl-static-search-result (newer) and li.result-row (older) makes your scraper survive transitions.

2) Missing price / location

Not all listings include location or a structured price. Your output should tolerate None.

3) Getting blocked / rate-limited

Be realistic:

  • start slow (few pages)
  • add jitter (polite_sleep)
  • avoid fetching listing detail pages unless you need them

When your crawl grows (multiple categories × multiple cities), ProxiesAPI can help by stabilizing the fetch layer.


Where ProxiesAPI fits (honestly)

Craigslist often works without proxies for small crawls.

But scrapers fail in production due to:

  • request bursts (pagination across many categories)
  • regional routing differences
  • IP-based throttling
  • transient network errors

A proxy API like ProxiesAPI helps you make the network layer more resilient so your code spends less time on retries.


QA checklist

  • Scraper returns non-zero rows for a known category
  • URLs are absolute and include the numeric post id
  • Dedupe keeps only unique post_id
  • CSV opens cleanly in Excel/Google Sheets
  • Crawl stops when there’s no next page
Make Craigslist scrapes more reliable with ProxiesAPI

Craigslist is usually straightforward, but bigger crawls get noisy (timeouts, throttling, IP-based blocks). ProxiesAPI helps keep your fetch layer stable while you focus on parsing + dedupe + exports.

Related guides

How to Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)
Pull Craigslist listings for a chosen city + category, normalize fields, follow listing pages for details, and export clean CSV with retries and anti-block tips.
tutorial#python#craigslist#web-scraping
Scrape Academic Papers from arXiv: Metadata + PDFs (Python + ProxiesAPI)
Collect arXiv paper metadata (title, authors, abstract) and download PDFs reliably. Includes practical selectors, rate-limits, and screenshot proof.
tutorial#python#arxiv#web-scraping
Scrape Government Contract Data from SAM.gov (Opportunities + Details)
Build a SAM.gov opportunities scraper in Python: search opportunities, paginate, fetch detail pages, and export a clean dataset. Includes ProxiesAPI proxy support, retries, and a screenshot for verification.
tutorial#python#sam-gov#government-contracts
Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)
Build a repeatable Rightmove sold-price dataset pipeline in Python: crawl result pages, extract listing URLs, parse sold-price details, and export clean CSV/JSON with retries and politeness.
tutorial#python#rightmove#real-estate