Scrape TripAdvisor Hotel Reviews with Python (Pagination + Rate Limits)

TripAdvisor is one of the best (and trickiest) sources of hotel reviews:

  • lots of structured fields (rating, date, reviewer, trip type)
  • consistent URL patterns for review pages
  • aggressive anti-bot behavior if you hammer it from one IP

In this tutorial we’ll build a practical Python scraper that:

  1. fetches a hotel’s review pages using a safe network layer (timeouts, retries)
  2. parses real review cards into structured data
  3. paginates across multiple review pages
  4. exports clean JSON
  5. shows where ProxiesAPI fits (without overclaiming)
TripAdvisor hotel reviews page (we’ll scrape review cards + rating + date)
Make review pagination more reliable with ProxiesAPI

TripAdvisor pages can rate-limit repeated requests from a single IP. ProxiesAPI gives you a proxy-backed fetch URL so your crawler can retry and paginate with fewer sudden blocks.


Important note (structure changes + access)

TripAdvisor is a heavily defended site. Expect changes:

  • CSS classes can shift
  • some content may be rendered via JS depending on locale/AB tests
  • responses can include bot checks or consent flows

That’s why we’ll:

  • avoid brittle selectors when possible
  • parse by semantic anchors (ARIA labels, stable attributes)
  • add detection for “blocked” HTML

This guide focuses on HTML parsing (not browser automation). If your target hotel pages are JS-only in your region, you’ll need a headless browser pipeline.


What we’re scraping (TripAdvisor review fields)

On a typical hotel review page, each review card contains:

  • reviewer name
  • review title
  • review text
  • bubble rating (1–5)
  • published date
  • optional metadata: trip type, room tip, helpful votes, etc.

We’ll extract a normalized subset:

  • review_id
  • reviewer
  • rating
  • date
  • title
  • text
  • url

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: A resilient fetch layer (with optional ProxiesAPI)

Two rules for scraping defended sites:

  1. Always use timeouts.
  2. Retry deliberately (backoff + jitter) and treat blocks as data.

Below is a minimal fetch_html() that supports:

  • normal direct requests
  • ProxiesAPI-backed requests (by turning the target URL into a proxy “fetch URL”)

Configure ProxiesAPI

Set an environment variable with your ProxiesAPI API key:

export PROXIESAPI_KEY="YOUR_KEY"

Fetch code

import os
import random
import time
import urllib.parse

import requests

TIMEOUT = (10, 35)  # connect, read


def build_proxiesapi_url(target_url: str) -> str:
    """Build a ProxiesAPI fetch URL.

    Note: Parameter names can vary by provider plan.
    If your ProxiesAPI account uses different params, adjust here.
    """
    key = os.environ.get("PROXIESAPI_KEY")
    if not key:
        raise RuntimeError("Missing PROXIESAPI_KEY env var")

    # Common pattern: https://api.proxiesapi.com/?auth_key=...&url=...
    return "https://api.proxiesapi.com/?" + urllib.parse.urlencode({
        "auth_key": key,
        "url": target_url,
    })


def is_likely_blocked(html: str) -> bool:
    h = (html or "").lower()
    return any(s in h for s in [
        "captcha",
        "are you a human",
        "robot",
        "access denied",
        "unusual traffic",
        "verify you are",
    ])


def fetch_html(url: str, *, use_proxiesapi: bool = True, session: requests.Session | None = None) -> str:
    s = session or requests.Session()

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }

    fetch_url = build_proxiesapi_url(url) if use_proxiesapi else url

    last_err = None
    for attempt in range(1, 6):
        try:
            r = s.get(fetch_url, headers=headers, timeout=TIMEOUT)
            r.raise_for_status()
            html = r.text

            if is_likely_blocked(html):
                raise RuntimeError("Blocked page detected (captcha/bot page)")

            return html
        except Exception as e:
            last_err = e
            # exponential backoff with jitter
            sleep_s = min(2 ** attempt, 20) + random.random()
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after retries: {last_err}")

Why this matters: when you paginate reviews, you’ll do many requests. Without timeouts/retries, a single slow response can hang your whole run.


Step 2: Find the hotel review URL pattern

TripAdvisor hotel pages often look like:

https://www.tripadvisor.com/Hotel_Review-g60763-d93359-Reviews-Hotel_Name-New_York_City_New_York.html

Review pagination is typically encoded in the path, commonly with an or{offset} segment. For example:

  • page 1: ...-Reviews-...html
  • page 2 (offset 5/10): ...-Reviews-or5-...html or ...-Reviews-or10-...html

The exact offset step depends on how many reviews the page shows.

We’ll implement pagination by:

  • scraping the first page
  • generating the next-page URL by inserting or{offset} into the path

Step 3: Parse review cards (selectors that survive)

TripAdvisor markup changes, so prefer:

  • data-* attributes when present
  • aria-label patterns for rating
  • avoiding long chains of classes

Here’s a parser that looks for “review containers” and extracts a stable subset.

import re
from bs4 import BeautifulSoup


def clean_text(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def parse_rating_from_aria(el) -> int | None:
    if not el:
        return None
    aria = (el.get("aria-label") or "").lower()
    # patterns like "5 of 5 bubbles" or "4.0 of 5 bubbles"
    m = re.search(r"(\d+(?:\.\d+)?)\s+of\s+5", aria)
    if not m:
        return None
    return int(float(m.group(1)))


def parse_reviews(html: str, page_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Strategy: find review cards by looking for an element that contains a rating aria-label.
    cards = []
    for rating in soup.select('[aria-label*="of 5 bubbles"], [aria-label*="of 5 bubble"]'):
        # bubble rating is inside a review card; walk up a bit.
        card = rating
        for _ in range(6):
            if not card:
                break
            # heuristic: review cards often contain a "Read more" or title/text blocks
            if card.name in ("div", "article"):
                cards.append(card)
                break
            card = card.parent

    # de-duplicate by object id
    uniq = []
    seen = set()
    for c in cards:
        k = id(c)
        if k in seen:
            continue
        seen.add(k)
        uniq.append(c)

    out = []
    for card in uniq:
        rating_el = card.select_one('[aria-label*="of 5 bubbles"], [aria-label*="of 5 bubble"]')
        rating = parse_rating_from_aria(rating_el)

        # title and text: try a few common patterns
        title_el = card.find(["h3", "h2"])
        title = clean_text(title_el.get_text(" ", strip=True)) if title_el else None

        text_el = card.select_one("span, q, div")
        text = clean_text(text_el.get_text(" ", strip=True)) if text_el else ""

        # date: look for a <time> or date-like text
        time_el = card.find("time")
        date = clean_text(time_el.get("datetime") or time_el.get_text(" ", strip=True)) if time_el else None

        reviewer = None
        # reviewer names are often links near the top of the card
        reviewer_el = card.select_one("a[href*='Profile'], a[href*='member'], span")
        if reviewer_el:
            reviewer = clean_text(reviewer_el.get_text(" ", strip=True))

        # Try to get a review id if present
        review_id = None
        for attr in ("data-reviewid", "data-review-id", "id"):
            if card.has_attr(attr):
                review_id = str(card.get(attr))
                break

        # basic sanity: skip tiny/garbage
        if rating is None and len(text) < 40:
            continue

        out.append({
            "review_id": review_id,
            "reviewer": reviewer,
            "rating": rating,
            "date": date,
            "title": title,
            "text": text,
            "url": page_url,
        })

    return out

Selector note: This parser is deliberately heuristic. On defended sites, one perfect selector is a myth; you want a strategy that fails gracefully and is easy to tweak.


Step 4: Paginate reviews

Let’s implement TripAdvisor-style offsets. Many listings show 5–10 reviews per page. We’ll make the step configurable.

import re


def insert_offset(url: str, offset: int) -> str:
    # Insert -Reviews-or{offset}- after "-Reviews-" if not present.
    if "-Reviews-" not in url:
        return url

    # If URL already has -Reviews-orNN-
    if re.search(r"-Reviews-or\d+-", url):
        return re.sub(r"-Reviews-or\d+-", f"-Reviews-or{offset}-", url)

    return url.replace("-Reviews-", f"-Reviews-or{offset}-", 1)


def crawl_reviews(start_url: str, pages: int = 3, page_step: int = 10, use_proxiesapi: bool = True) -> list[dict]:
    s = requests.Session()
    all_reviews: list[dict] = []

    for i in range(pages):
        offset = i * page_step
        url = start_url if offset == 0 else insert_offset(start_url, offset)

        html = fetch_html(url, use_proxiesapi=use_proxiesapi, session=s)
        batch = parse_reviews(html, url)

        print(f"page {i+1}/{pages} -> {len(batch)} reviews")
        all_reviews.extend(batch)

        # polite delay (especially when not using proxies)
        time.sleep(1.0 + random.random())

    return all_reviews

Run it

import json

START = "https://www.tripadvisor.com/Hotel_Review-REPLACE_WITH_REAL_HOTEL_URL.html"

reviews = crawl_reviews(START, pages=5, page_step=10, use_proxiesapi=True)
print("total", len(reviews))

with open("tripadvisor_reviews.json", "w", encoding="utf-8") as f:
    json.dump(reviews, f, ensure_ascii=False, indent=2)

print("wrote tripadvisor_reviews.json")

Troubleshooting (what actually breaks)

1) You get a bot page / captcha

  • Reduce request rate (increase delays)
  • Add retries (already included)
  • Use ProxiesAPI (proxy-backed fetch)
  • Rotate user agents cautiously (don’t create an obvious “UA roulette”)

2) You get empty reviews

  • Print html[:500] and confirm you got the actual page
  • Inspect the page HTML for a stable hook (e.g., aria-label, data-test-target)
  • Adjust parse_reviews() selectors to your current markup

3) Pagination doesn’t change content

  • Confirm the URL offset pattern for your specific hotel page
  • Some pages require a consistent locale/currency; add query params or accept-language

Where ProxiesAPI fits (honestly)

TripAdvisor can block repeated requests from one IP.

ProxiesAPI does not “solve scraping” — you still need:

  • good parsing logic
  • polite request pacing
  • retries + timeouts

But it does give you a simpler way to:

  • route requests through proxies
  • reduce “one-IP” rate limiting failures
  • keep long pagination crawls running

QA checklist

  • You can fetch page 1 HTML consistently
  • You can extract at least 5–10 reviews from page 1
  • Pagination changes results (offset pages are different)
  • Exported JSON has rating, date, text
  • You detect blocks and retry instead of silently writing empty data
Make review pagination more reliable with ProxiesAPI

TripAdvisor pages can rate-limit repeated requests from a single IP. ProxiesAPI gives you a proxy-backed fetch URL so your crawler can retry and paginate with fewer sudden blocks.

Related guides

Scrape Vinted Listings with Python: Search, Prices, Images, and Pagination
Build a dataset from Vinted search results (title, price, size, condition, seller, images) with a production-minded Python scraper + a proxy-backed fetch layer via ProxiesAPI.
tutorial#python#vinted#ecommerce
How to Scrape Etsy Product Listings with Python (ProxiesAPI + Pagination)
Extract title, price, rating, and shop info from Etsy search pages reliably with rotating proxies, retries, and pagination.
tutorial#python#etsy#web-scraping
Scrape Restaurant Data from TripAdvisor (Reviews, Ratings, and Locations)
Build a practical TripAdvisor scraper in Python: discover restaurant listing URLs, extract name/rating/review count/address, and export clean CSV/JSON with ProxiesAPI in the fetch layer.
tutorial#python#web-scraping#beautifulsoup
How to Scrape G2 Software Reviews (Ratings, Pros/Cons) with Python + ProxiesAPI
A production-grade G2 reviews scraper: discover review pages, paginate safely, extract rating + pros/cons + metadata, and export a clean dataset. Includes retries, backoff, and a JSONL exporter.
tutorial#python#g2#reviews