Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)

May 10, 2026 · tutorial · #python, #craigslist, #web-scraping, #requests, #beautifulsoup, #csv, #proxiesapi

Craigslist is one of the best “real-world” scraping targets because it’s mostly server-rendered HTML and the URL structure is predictable.

In this guide, you’ll build a production-style scraper that:

targets a city + category (e.g., SF Bay Area → for sale → bicycles)
crawls pagination
extracts clean fields (title, price, location, url, post id, date)
dedupes results across pages
exports to CSV

We’ll also show where ProxiesAPI fits into the network layer when you scale up.

Craigslist results page (we’ll scrape listing rows + pagination)

Make Craigslist scrapes more reliable with ProxiesAPI

Craigslist is usually straightforward, but bigger crawls get noisy (timeouts, throttling, IP-based blocks). ProxiesAPI helps keep your fetch layer stable while you focus on parsing + dedupe + exports.

Get 1,000 free API calls View pricing

What we’re scraping (Craigslist structure)

Craigslist is split into city subdomains, for example:

San Francisco Bay Area: https://sfbay.craigslist.org/
New York: https://newyork.craigslist.org/

Within a city, categories have short slugs. Example for bicycles for sale:

https://sfbay.craigslist.org/search/bia

A search results page contains a list of <li class="cl-static-search-result"> ... items (newer layout) or <li class="result-row"> ... items (older layout). Craigslist has been migrating layouts, so we’ll support both.

Pagination is typically via a query param like:

?s=120 (offset)

We’ll implement pagination by following the “next” link if present, and fall back to s= offsets when needed.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing

Step 1: Build a fetcher (Requests) + ProxiesAPI hook

First: write a fetch function with real timeouts and a decent User-Agent.

You have two common approaches:

Direct requests (works for small, polite crawls)
Requests routed through ProxiesAPI (helps when you’re crawling more pages, more categories, or more cities)

Below is a simple pattern that supports both.

import os
import time
from urllib.parse import urljoin

import requests

TIMEOUT = (10, 30)  # connect, read
UA = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/124.0.0.0 Safari/537.36"
)

session = requests.Session()
session.headers.update({
    "User-Agent": UA,
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
})

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "")


def fetch(url: str) -> str:
    """Fetch HTML, optionally via ProxiesAPI.

    IMPORTANT: Keep claims honest. ProxiesAPI changes the network path; it does not
    magically bypass every block.
    """
    # Option A: direct
    if not PROXIESAPI_KEY:
        r = session.get(url, timeout=TIMEOUT)
        r.raise_for_status()
        return r.text

    # Option B: via ProxiesAPI (example style)
    # Adjust parameter names to match your ProxiesAPI account docs.
    proxy_url = "https://api.proxiesapi.com"
    params = {
        "api_key": PROXIESAPI_KEY,
        "url": url,
        # Common optional knobs (names vary by provider):
        # "render": "false",
        # "country": "US",
        # "session": "cl_1",
    }

    r = session.get(proxy_url, params=params, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def polite_sleep(i: int) -> None:
    # keep it simple: a little jitter reduces burstiness
    time.sleep(1.0 + (i % 3) * 0.3)

If you don’t set PROXIESAPI_KEY, the code runs directly (good for local tests).

Step 2: Parse listings from a results page

We want these fields:

post_id
title
price
location (if shown)
url
posted_at (if available)

Craigslist listing URLs usually contain a numeric id, e.g.:

https://sfbay.craigslist.org/sfc/bia/d/san-francisco-something/1234567890.html

We’ll extract the id from the URL.

import re
from bs4 import BeautifulSoup

ID_RE = re.compile(r"/(\d+)\.html")


def extract_post_id(href: str) -> str | None:
    if not href:
        return None
    m = ID_RE.search(href)
    return m.group(1) if m else None


def parse_results(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out: list[dict] = []

    # Newer static layout
    items = soup.select("li.cl-static-search-result")
    if items:
        for li in items:
            a = li.select_one("a")
            href = a.get("href") if a else None
            url = urljoin(base_url, href) if href else None

            title = a.get_text(" ", strip=True) if a else None

            price_el = li.select_one("span.price")
            price = price_el.get_text(" ", strip=True) if price_el else None

            loc_el = li.select_one("div.location")
            location = loc_el.get_text(" ", strip=True) if loc_el else None

            time_el = li.select_one("time")
            posted_at = time_el.get("datetime") if time_el else None

            out.append({
                "post_id": extract_post_id(url or ""),
                "title": title,
                "price": price,
                "location": location,
                "posted_at": posted_at,
                "url": url,
            })
        return out

    # Older layout fallback
    for row in soup.select("li.result-row"):
        a = row.select_one("a.result-title")
        href = a.get("href") if a else None
        url = urljoin(base_url, href) if href else None

        title = a.get_text(" ", strip=True) if a else None

        price_el = row.select_one("span.result-price")
        price = price_el.get_text(" ", strip=True) if price_el else None

        hood_el = row.select_one("span.result-hood")
        location = hood_el.get_text(" ", strip=True).strip(" ()") if hood_el else None

        time_el = row.select_one("time.result-date")
        posted_at = time_el.get("datetime") if time_el else None

        out.append({
            "post_id": extract_post_id(url or ""),
            "title": title,
            "price": price,
            "location": location,
            "posted_at": posted_at,
            "url": url,
        })

    return out

Step 3: Pagination (follow “next”)

Craigslist pagination changes over time. The most robust approach is:

Parse the page
Try to locate a “next” link
Crawl until no next link

from urllib.parse import urlparse


def find_next_url(html: str, base_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")

    # Common pattern: a.next
    a = soup.select_one("a.next")
    if a and a.get("href"):
        return urljoin(base_url, a.get("href"))

    # Alternate pattern: link rel=next
    link = soup.select_one("link[rel='next']")
    if link and link.get("href"):
        return urljoin(base_url, link.get("href"))

    return None


def crawl_search(start_url: str, max_pages: int = 5) -> list[dict]:
    all_rows: list[dict] = []
    seen_ids: set[str] = set()

    url = start_url

    for i in range(max_pages):
        html = fetch(url)
        rows = parse_results(html, base_url=url)

        for r in rows:
            pid = r.get("post_id")
            if not pid:
                # no id → keep but don’t dedupe strongly
                all_rows.append(r)
                continue
            if pid in seen_ids:
                continue
            seen_ids.add(pid)
            all_rows.append(r)

        next_url = find_next_url(html, base_url=url)
        if not next_url:
            break

        url = next_url
        polite_sleep(i)

    return all_rows

Step 4: Export to CSV

import csv


def write_csv(rows: list[dict], path: str) -> None:
    fields = ["post_id", "title", "price", "location", "posted_at", "url"]

    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fields})


if __name__ == "__main__":
    # Example: SF Bay Area → bicycles (bia)
    start = "https://sfbay.craigslist.org/search/bia"
    rows = crawl_search(start, max_pages=5)

    print("rows:", len(rows))
    print("sample:", rows[0] if rows else None)

    write_csv(rows, "craigslist_bia_sfbay.csv")
    print("wrote craigslist_bia_sfbay.csv")

Selector rationale + troubleshooting

1) Why support both layouts?

Craigslist has multiple HTML layouts in the wild. Supporting both li.cl-static-search-result (newer) and li.result-row (older) makes your scraper survive transitions.