Scrape TripAdvisor Hotel Reviews with Python (Pagination + Rate Limits)

Apr 13, 2026 · tutorial · #python, #tripadvisor, #reviews, #web-scraping, #requests, #beautifulsoup, #pagination, #proxies

TripAdvisor is one of the best (and trickiest) sources of hotel reviews:

lots of structured fields (rating, date, reviewer, trip type)
consistent URL patterns for review pages
aggressive anti-bot behavior if you hammer it from one IP

In this tutorial we’ll build a practical Python scraper that:

fetches a hotel’s review pages using a safe network layer (timeouts, retries)
parses real review cards into structured data
paginates across multiple review pages
exports clean JSON
shows where ProxiesAPI fits (without overclaiming)

TripAdvisor hotel reviews page (we’ll scrape review cards + rating + date)

Make review pagination more reliable with ProxiesAPI

TripAdvisor pages can rate-limit repeated requests from a single IP. ProxiesAPI gives you a proxy-backed fetch URL so your crawler can retry and paginate with fewer sudden blocks.

Get 1,000 free API calls View pricing

Important note (structure changes + access)

TripAdvisor is a heavily defended site. Expect changes:

CSS classes can shift
some content may be rendered via JS depending on locale/AB tests
responses can include bot checks or consent flows

That’s why we’ll:

avoid brittle selectors when possible
parse by semantic anchors (ARIA labels, stable attributes)
add detection for “blocked” HTML

This guide focuses on HTML parsing (not browser automation). If your target hotel pages are JS-only in your region, you’ll need a headless browser pipeline.

What we’re scraping (TripAdvisor review fields)

On a typical hotel review page, each review card contains:

reviewer name
review title
review text
bubble rating (1–5)
published date
optional metadata: trip type, room tip, helpful votes, etc.

We’ll extract a normalized subset:

review_id
reviewer
rating
date
title
text
url

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: A resilient fetch layer (with optional ProxiesAPI)

Two rules for scraping defended sites:

Always use timeouts.
Retry deliberately (backoff + jitter) and treat blocks as data.

Below is a minimal fetch_html() that supports:

normal direct requests
ProxiesAPI-backed requests (by turning the target URL into a proxy “fetch URL”)

Configure ProxiesAPI

Set an environment variable with your ProxiesAPI API key:

export PROXIESAPI_KEY="YOUR_KEY"

Fetch code

import os
import random
import time
import urllib.parse

import requests

TIMEOUT = (10, 35)  # connect, read


def build_proxiesapi_url(target_url: str) -> str:
    """Build a ProxiesAPI fetch URL.

    Note: Parameter names can vary by provider plan.
    If your ProxiesAPI account uses different params, adjust here.
    """
    key = os.environ.get("PROXIESAPI_KEY")
    if not key:
        raise RuntimeError("Missing PROXIESAPI_KEY env var")

    # Common pattern: https://api.proxiesapi.com/?auth_key=...&url=...
    return "https://api.proxiesapi.com/?" + urllib.parse.urlencode({
        "auth_key": key,
        "url": target_url,
    })


def is_likely_blocked(html: str) -> bool:
    h = (html or "").lower()
    return any(s in h for s in [
        "captcha",
        "are you a human",
        "robot",
        "access denied",
        "unusual traffic",
        "verify you are",
    ])


def fetch_html(url: str, *, use_proxiesapi: bool = True, session: requests.Session | None = None) -> str:
    s = session or requests.Session()

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }

    fetch_url = build_proxiesapi_url(url) if use_proxiesapi else url

    last_err = None
    for attempt in range(1, 6):
        try:
            r = s.get(fetch_url, headers=headers, timeout=TIMEOUT)
            r.raise_for_status()
            html = r.text

            if is_likely_blocked(html):
                raise RuntimeError("Blocked page detected (captcha/bot page)")

            return html
        except Exception as e:
            last_err = e
            # exponential backoff with jitter
            sleep_s = min(2 ** attempt, 20) + random.random()
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after retries: {last_err}")

Why this matters: when you paginate reviews, you’ll do many requests. Without timeouts/retries, a single slow response can hang your whole run.

Step 2: Find the hotel review URL pattern

TripAdvisor hotel pages often look like:

https://www.tripadvisor.com/Hotel_Review-g60763-d93359-Reviews-Hotel_Name-New_York_City_New_York.html

Review pagination is typically encoded in the path, commonly with an or{offset} segment. For example:

page 1: ...-Reviews-...html
page 2 (offset 5/10): ...-Reviews-or5-...html or ...-Reviews-or10-...html

The exact offset step depends on how many reviews the page shows.

We’ll implement pagination by:

scraping the first page
generating the next-page URL by inserting or{offset} into the path

Step 3: Parse review cards (selectors that survive)

TripAdvisor markup changes, so prefer:

data-* attributes when present
aria-label patterns for rating
avoiding long chains of classes

Here’s a parser that looks for “review containers” and extracts a stable subset.

import re
from bs4 import BeautifulSoup


def clean_text(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def parse_rating_from_aria(el) -> int | None:
    if not el:
        return None
    aria = (el.get("aria-label") or "").lower()
    # patterns like "5 of 5 bubbles" or "4.0 of 5 bubbles"
    m = re.search(r"(\d+(?:\.\d+)?)\s+of\s+5", aria)
    if not m:
        return None
    return int(float(m.group(1)))


def parse_reviews(html: str, page_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Strategy: find review cards by looking for an element that contains a rating aria-label.
    cards = []
    for rating in soup.select('[aria-label*="of 5 bubbles"], [aria-label*="of 5 bubble"]'):
        # bubble rating is inside a review card; walk up a bit.
        card = rating
        for _ in range(6):
            if not card:
                break
            # heuristic: review cards often contain a "Read more" or title/text blocks
            if card.name in ("div", "article"):
                cards.append(card)
                break
            card = card.parent

    # de-duplicate by object id
    uniq = []
    seen = set()
    for c in cards:
        k = id(c)
        if k in seen:
            continue
        seen.add(k)
        uniq.append(c)

    out = []
    for card in uniq:
        rating_el = card.select_one('[aria-label*="of 5 bubbles"], [aria-label*="of 5 bubble"]')
        rating = parse_rating_from_aria(rating_el)

        # title and text: try a few common patterns
        title_el = card.find(["h3", "h2"])
        title = clean_text(title_el.get_text(" ", strip=True)) if title_el else None

        text_el = card.select_one("span, q, div")
        text = clean_text(text_el.get_text(" ", strip=True)) if text_el else ""

        # date: look for a <time> or date-like text
        time_el = card.find("time")
        date = clean_text(time_el.get("datetime") or time_el.get_text(" ", strip=True)) if time_el else None

        reviewer = None
        # reviewer names are often links near the top of the card
        reviewer_el = card.select_one("a[href*='Profile'], a[href*='member'], span")
        if reviewer_el:
            reviewer = clean_text(reviewer_el.get_text(" ", strip=True))

        # Try to get a review id if present
        review_id = None
        for attr in ("data-reviewid", "data-review-id", "id"):
            if card.has_attr(attr):
                review_id = str(card.get(attr))
                break

        # basic sanity: skip tiny/garbage
        if rating is None and len(text) < 40:
            continue

        out.append({
            "review_id": review_id,
            "reviewer": reviewer,
            "rating": rating,
            "date": date,
            "title": title,
            "text": text,
            "url": page_url,
        })

    return out

Selector note: This parser is deliberately heuristic. On defended sites, one perfect selector is a myth; you want a strategy that fails gracefully and is easy to tweak.

Step 4: Paginate reviews

Let’s implement TripAdvisor-style offsets. Many listings show 5–10 reviews per page. We’ll make the step configurable.

import re


def insert_offset(url: str, offset: int) -> str:
    # Insert -Reviews-or{offset}- after "-Reviews-" if not present.
    if "-Reviews-" not in url:
        return url

    # If URL already has -Reviews-orNN-
    if re.search(r"-Reviews-or\d+-", url):
        return re.sub(r"-Reviews-or\d+-", f"-Reviews-or{offset}-", url)

    return url.replace("-Reviews-", f"-Reviews-or{offset}-", 1)


def crawl_reviews(start_url: str, pages: int = 3, page_step: int = 10, use_proxiesapi: bool = True) -> list[dict]:
    s = requests.Session()
    all_reviews: list[dict] = []

    for i in range(pages):
        offset = i * page_step
        url = start_url if offset == 0 else insert_offset(start_url, offset)

        html = fetch_html(url, use_proxiesapi=use_proxiesapi, session=s)
        batch = parse_reviews(html, url)

        print(f"page {i+1}/{pages} -> {len(batch)} reviews")
        all_reviews.extend(batch)

        # polite delay (especially when not using proxies)
        time.sleep(1.0 + random.random())

    return all_reviews

Run it

import json

START = "https://www.tripadvisor.com/Hotel_Review-REPLACE_WITH_REAL_HOTEL_URL.html"

reviews = crawl_reviews(START, pages=5, page_step=10, use_proxiesapi=True)
print("total", len(reviews))

with open("tripadvisor_reviews.json", "w", encoding="utf-8") as f:
    json.dump(reviews, f, ensure_ascii=False, indent=2)

print("wrote tripadvisor_reviews.json")

Troubleshooting (what actually breaks)

1) You get a bot page / captcha

Reduce request rate (increase delays)
Add retries (already included)
Use ProxiesAPI (proxy-backed fetch)
Rotate user agents cautiously (don’t create an obvious “UA roulette”)

2) You get empty reviews

Print html[:500] and confirm you got the actual page
Inspect the page HTML for a stable hook (e.g., aria-label, data-test-target)
Adjust parse_reviews() selectors to your current markup

3) Pagination doesn’t change content

Confirm the URL offset pattern for your specific hotel page
Some pages require a consistent locale/currency; add query params or accept-language

Where ProxiesAPI fits (honestly)

TripAdvisor can block repeated requests from one IP.

ProxiesAPI does not “solve scraping” — you still need:

good parsing logic
polite request pacing
retries + timeouts

But it does give you a simpler way to:

route requests through proxies
reduce “one-IP” rate limiting failures
keep long pagination crawls running

QA checklist

You can fetch page 1 HTML consistently
You can extract at least 5–10 reviews from page 1
Pagination changes results (offset pages are different)
Exported JSON has rating, date, text
You detect blocks and retry instead of silently writing empty data

Make review pagination more reliable with ProxiesAPI

TripAdvisor pages can rate-limit repeated requests from a single IP. ProxiesAPI gives you a proxy-backed fetch URL so your crawler can retry and paginate with fewer sudden blocks.

Get 1,000 free API calls View pricing

Related guides

Scrape eBay Listings and Prices

Build an eBay scraper that captures titles, prices, item URLs, and pagination into CSV-ready output.

tutorial#python#ebay#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads book metadata, average rating, rating counts, review counts, and top review snippets with Python using JSON-LD plus __NEXT_DATA__ review objects.

tutorial#python#goodreads#books

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, pagination cursors, and reviewer metadata into a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Restaurant Data from TripAdvisor

Show how to collect restaurant names, ratings, review counts, and location details from TripAdvisor into a clean dataset.

tutorial#python#web-scraping#beautifulsoup