How to Scrape Business Reviews from Yelp (Python + ProxiesAPI)

Yelp is a goldmine of real-world signals: reviews, ratings, categories, and business metadata.

In this tutorial you’ll build a practical Yelp scraper in Python that:

  • crawls a Yelp search page (e.g. “pizza in San Francisco”)
  • extracts business URLs from the results
  • visits each business page and collects review snippets + key fields
  • writes outputs to JSON + CSV
Yelp search results page (we’ll scrape business cards + business URLs)
Make review crawls more reliable with ProxiesAPI

Yelp pages are heavier than hobby targets. When you add pagination, detail pages, and retries, ProxiesAPI helps stabilize your fetch layer so your scraper can keep moving.


Important note (read this once)

Yelp actively defends against automated scraping. The goal of this guide is to show clean parsing + resilient crawling patterns (timeouts, retries, pacing, and stable selectors).

Be respectful:

  • follow the site’s Terms of Service
  • keep request volume low
  • add delays and caching
  • don’t scrape personal data

What we’re scraping (URLs and structure)

We’ll use two page types:

  1. Search results

Example:

  • https://www.yelp.com/search?find_desc=pizza&find_loc=San%20Francisco%2C%20CA

On this page, the results contain links that look like:

  • /biz/tonys-pizza-napoletana-san-francisco?osq=pizza
  1. Business pages

Example:

  • https://www.yelp.com/biz/tonys-pizza-napoletana-san-francisco

On business pages we’ll extract:

  • business name
  • rating (when present)
  • review count (when present)
  • a few recent review snippets (text + rating + date when available)

Because Yelp can change markup, we’ll use multiple fallbacks in our selectors and always sanity-check.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for robust parsing

Step 1: Fetching HTML with retries + timeouts

A “real” scraper does not hang forever or crash on one transient 503.

import random
import time
from typing import Optional

import requests

TIMEOUT = (10, 30)  # connect, read

session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    )
})


def fetch_html(url: str, *, max_retries: int = 4, backoff_base: float = 1.5) -> str:
    """Fetch HTML with retries and jitter.

    Raises requests.HTTPError for non-retriable errors.
    """
    last_err: Optional[Exception] = None

    for attempt in range(1, max_retries + 1):
        try:
            r = session.get(url, timeout=TIMEOUT)

            # Retry a few common transient status codes
            if r.status_code in (429, 500, 502, 503, 504):
                raise requests.HTTPError(f"{r.status_code} transient", response=r)

            r.raise_for_status()
            return r.text

        except Exception as e:
            last_err = e
            # exponential backoff with jitter
            sleep_s = (backoff_base ** attempt) + random.random()
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after {max_retries} attempts: {url} ({last_err})")

Step 2: Fetch via ProxiesAPI (drop-in)

ProxiesAPI lets you request a URL through their proxy layer.

cURL example

API_KEY="YOUR_API_KEY"
TARGET="https://www.yelp.com/search?find_desc=pizza&find_loc=San%20Francisco%2C%20CA"

curl "http://api.proxiesapi.com/?key=${API_KEY}&url=${TARGET}" | head -n 20

Python helper

import os
import urllib.parse

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "API_KEY")


def proxiesapi_url(target_url: str) -> str:
    return "http://api.proxiesapi.com/?" + urllib.parse.urlencode({
        "key": PROXIESAPI_KEY,
        "url": target_url,
    })


def fetch_html_via_proxiesapi(target_url: str) -> str:
    gateway = proxiesapi_url(target_url)
    return fetch_html(gateway)

From this point onward, you can swap fetch_html(url) with fetch_html_via_proxiesapi(url).


Step 3: Parse business URLs from a Yelp search page

On the results page, we want to capture /biz/... links and dedupe them.

from bs4 import BeautifulSoup

BASE = "https://www.yelp.com"


def parse_search_business_urls(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")

    urls: list[str] = []
    seen = set()

    # Most organic results include /biz/ links.
    # We avoid /adredir and other redirectors.
    for a in soup.select('a[href^="/biz/"]'):
        href = a.get("href")
        if not href:
            continue

        # Strip querystring to normalize
        biz_path = href.split("?")[0]
        full = BASE + biz_path

        if full in seen:
            continue
        seen.add(full)
        urls.append(full)

    return urls

Quick sanity check

search_url = "https://www.yelp.com/search?find_desc=pizza&find_loc=San%20Francisco%2C%20CA"
html = fetch_html_via_proxiesapi(search_url)

biz_urls = parse_search_business_urls(html)
print("businesses:", len(biz_urls))
print(biz_urls[:5])

Typical output:

businesses: 10
['https://www.yelp.com/biz/tonys-pizza-napoletana-san-francisco', ...]

Step 4: Parse business details + review snippets

Yelp markup can vary, so we’ll prioritize:

  • the canonical business name (from the page <h1>)
  • rating + review count when present
  • a handful of review snippets (best-effort)
import re


def text_or_none(el):
    return el.get_text(" ", strip=True) if el else None


def parse_float(text: str):
    m = re.search(r"(\d+(?:\.\d+)?)", text or "")
    return float(m.group(1)) if m else None


def parse_int(text: str):
    m = re.search(r"(\d+)", text or "")
    return int(m.group(1)) if m else None


def parse_business_page(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    name = text_or_none(soup.select_one("h1"))

    # Rating is often embedded in aria-label text like "4.2 star rating"
    rating = None
    rating_el = soup.select_one('[aria-label$="star rating"], [aria-label*="star rating"]')
    if rating_el:
        rating = parse_float(rating_el.get("aria-label", ""))

    # Review count is often near "reviews" text
    review_count = None
    for el in soup.select("span, a"):
        t = el.get_text(" ", strip=True)
        if t and "review" in t.lower():
            # Matches "8.6k reviews" or "459 reviews"
            if re.search(r"\d", t):
                # handle k shorthand
                t2 = t.lower().replace(",", "")
                mk = re.search(r"(\d+(?:\.\d+)?)\s*k\s+reviews", t2)
                if mk:
                    review_count = int(float(mk.group(1)) * 1000)
                    break

                mi = re.search(r"(\d+)\s+reviews", t2)
                if mi:
                    review_count = int(mi.group(1))
                    break

    # Review snippets: Yelp can render these in several ways.
    # We'll look for common containers that include readable text.
    snippets: list[dict] = []

    # Try common review text spans
    for block in soup.select("span[lang]"):
        txt = block.get_text(" ", strip=True)
        if not txt:
            continue
        # Heuristic: review snippets are usually longer than navigation labels.
        if len(txt) < 60:
            continue
        snippets.append({"text": txt})
        if len(snippets) >= 5:
            break

    return {
        "url": url,
        "name": name,
        "rating": rating,
        "review_count": review_count,
        "review_snippets": snippets,
    }

Terminal run (typical)

u = biz_urls[0]
page = fetch_html_via_proxiesapi(u)
data = parse_business_page(page, u)
print(data["name"], data["rating"], data["review_count"])
print("snippets:", len(data["review_snippets"]))
print(data["review_snippets"][0]["text"][:120])

Example output:

Tony’s Pizza Napoletana 4.2 8600
snippets: 5
The pizza was so good that I would definitely wait and be happy just to have the delicious pizza...

Step 5: Add pagination to your search crawl

Yelp search uses a start= query parameter.

Example:

  • page 1: .../search?...
  • page 2: .../search?...&start=10
  • page 3: .../search?...&start=20

We’ll crawl N pages and collect business URLs.

from urllib.parse import urlencode, urlparse, parse_qs, urlunparse


def with_start_param(url: str, start: int) -> str:
    u = urlparse(url)
    q = parse_qs(u.query)
    q["start"] = [str(start)]
    new_query = urlencode(q, doseq=True)
    return urlunparse((u.scheme, u.netloc, u.path, u.params, new_query, u.fragment))


def crawl_search(search_url: str, pages: int = 3) -> list[str]:
    all_urls: list[str] = []
    seen = set()

    for i in range(pages):
        start = i * 10
        url = with_start_param(search_url, start)
        html = fetch_html_via_proxiesapi(url)
        batch = parse_search_business_urls(html)

        for b in batch:
            if b in seen:
                continue
            seen.add(b)
            all_urls.append(b)

        print("page", i + 1, "start", start, "batch", len(batch), "total", len(all_urls))

        # polite delay
        time.sleep(2 + random.random())

    return all_urls

Step 6: Full scraper (search → businesses → export)

import csv
import json


def scrape_yelp(search_url: str, pages: int = 2, max_businesses: int = 20) -> list[dict]:
    biz_urls = crawl_search(search_url, pages=pages)
    biz_urls = biz_urls[:max_businesses]

    out: list[dict] = []

    for idx, url in enumerate(biz_urls, start=1):
        print(f"[{idx}/{len(biz_urls)}] business", url)
        html = fetch_html_via_proxiesapi(url)
        out.append(parse_business_page(html, url))

        time.sleep(2 + random.random())

    return out


def export_json(path: str, rows: list[dict]) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


def export_csv(path: str, rows: list[dict]) -> None:
    # Flatten snippets into a single field for CSV
    flat = []
    for r in rows:
        snippets = " | ".join([s.get("text", "") for s in r.get("review_snippets", [])])
        flat.append({
            "url": r.get("url"),
            "name": r.get("name"),
            "rating": r.get("rating"),
            "review_count": r.get("review_count"),
            "review_snippets": snippets,
        })

    fieldnames = ["url", "name", "rating", "review_count", "review_snippets"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(flat)


if __name__ == "__main__":
    search_url = "https://www.yelp.com/search?find_desc=pizza&find_loc=San%20Francisco%2C%20CA"
    rows = scrape_yelp(search_url, pages=2, max_businesses=15)

    export_json("yelp_businesses.json", rows)
    export_csv("yelp_businesses.csv", rows)

    print("done", len(rows))

QA checklist

  • Search parsing returns a realistic number of /biz/ URLs
  • Business page parsing yields a non-empty name for most rows
  • Review snippet heuristic returns at least 1 long snippet for some pages
  • Export files open cleanly

Where ProxiesAPI fits (honestly)

Yelp can be inconsistent for direct scraping: transient blocks, heavy markup, and frequent changes.

ProxiesAPI doesn’t “magically” solve everything — but it helps your crawler keep moving by stabilizing the request layer as you add:

  • pagination
  • more business pages
  • retries
  • long-running jobs

If you want to go further, the next upgrades are caching (to avoid refetching), a URL queue, and storing results in SQLite for incremental refreshes.

Make review crawls more reliable with ProxiesAPI

Yelp pages are heavier than hobby targets. When you add pagination, detail pages, and retries, ProxiesAPI helps stabilize your fetch layer so your scraper can keep moving.

Related guides

How to Scrape Apartment Listings from Apartments.com (Python + ProxiesAPI)
Scrape Apartments.com listing cards and detail-page fields with Python. Includes pagination, resilient parsing, retries, and clean JSON/CSV exports.
tutorial#python#apartments#real-estate
Build a Job Board with Data from Indeed (Python scraper tutorial)
Scrape Indeed job listings (title, company, location, salary, summary) with Python (requests + BeautifulSoup), then save a clean dataset you can render as a simple job board. Includes pagination + ProxiesAPI fetch.
tutorial#python#indeed#jobs
Scrape IMDb Top 250 Movies into a Dataset
Pull rank, title, year, rating, and votes into clean CSV/JSON for analysis with working Python code.
tutorial#python#imdb#web-scraping
How to Scrape Trustpilot Reviews for Any Company
Pull ratings, dates, reviewer names, and review text into a clean CSV for reputation monitoring.
tutorial#python#trustpilot#web-scraping