Scrape Trustpilot Category Rankings (Top Companies + Ratings) with ProxiesAPI

Trustpilot categories are a goldmine when you’re building:

  • lead lists (top companies in a niche)
  • competitive research (who ranks + how ratings shift)
  • review monitoring pipelines (discover new companies automatically)

In this guide we’ll scrape a Trustpilot category ranking page and extract:

  • company name
  • Trustpilot “review URL” (the /review/… page)
  • company website domain (when shown)
  • rating
  • review count
  • country/category context

Then we’ll export everything to CSV.

Trustpilot category page screenshot (proof target)
Scale category crawls reliably with ProxiesAPI

Trustpilot is quick to rate-limit and block bot-like traffic. ProxiesAPI gives you a stable proxy layer so your category crawls stay consistent as you paginate and expand to multiple categories.


What we’re scraping (example URL + pagination)

Trustpilot category pages look like:

  • Category: https://www.trustpilot.com/categories/{category_slug}
  • Pagination: https://www.trustpilot.com/categories/{category_slug}?page=2

Example used in this tutorial:

  • https://www.trustpilot.com/categories/travel_insurance_company

Why this target needs a proxy layer

If you fetch Trustpilot too aggressively from a single IP, you’ll often get a JSON error response describing a bot block.

The goal here isn’t “blast Trustpilot.” The goal is:

  • fetch like a normal browser (headers + timeouts)
  • paginate slowly
  • keep retries and transient blocks from killing your job

That’s exactly where a proxy layer helps.


Setup

python3 -m venv .venv
source .venv/bin/activate

pip install requests beautifulsoup4 lxml python-dotenv

ProxiesAPI request pattern (requests + proxy URL)

Set your proxy URL (ProxiesAPI will give you the exact credentials/host/port):

export PROXIESAPI_PROXY_URL="http://YOUR_USERNAME:YOUR_PASSWORD@gw.proxiesapi.com:8080"

We’ll read this from os.environ.


Step 1: Fetch category HTML with realistic headers

import os
import time
import random
import requests

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
TIMEOUT = (10, 30)  # connect, read

session = requests.Session()


def fetch(url: str) -> str:
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    proxies = None
    if PROXY_URL:
        proxies = {"http": PROXY_URL, "https": PROXY_URL}

    r = session.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def polite_sleep(min_s: float = 1.0, max_s: float = 2.2) -> None:
    time.sleep(random.uniform(min_s, max_s))

Two production notes:

  • Always set timeouts. Trustpilot can hang on some responses.
  • Keep a session. It reduces overhead and looks more browser-like.

Step 2: Find stable anchors (don’t guess classes)

Trustpilot’s CSS classes are often hashed, so don’t anchor on them.

Instead, two stable anchors on category pages are:

  1. Company links go to /review/{domain}
  2. Pagination uses nav[aria-label="Pagination"]

So we’ll scrape cards by finding anchors that match /review/… and then walking up to the “card container”.


Step 3: Parse company cards (name, rating, review count)

import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup

BASE = "https://www.trustpilot.com"

RATING_RE = re.compile(r"(\d+(?:\.\d+)?)")
REVIEWS_RE = re.compile(r"([\d,]+)\s+reviews", re.IGNORECASE)


def parse_rating_and_reviews(text: str) -> tuple[float | None, int | None]:
    rating = None
    reviews = None

    if text:
        m = RATING_RE.search(text)
        if m:
            rating = float(m.group(1))

        m = REVIEWS_RE.search(text)
        if m:
            reviews = int(m.group(1).replace(",", ""))

    return rating, reviews


def parse_category_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    rows = []

    # Trustpilot category pages contain many /review/ links.
    # We dedupe by href and only keep the first occurrence.
    seen = set()

    for a in soup.select('a[href^="/review/"]'):
        href = a.get("href")
        if not href or href in seen:
            continue
        seen.add(href)

        # Card-ish container: climb a few parents.
        card = a
        for _ in range(6):
            if not getattr(card, "parent", None):
                break
            card = card.parent

        card_text = " ".join((card.get_text(" ", strip=True) if card else a.get_text(" ", strip=True)).split())

        rating, reviews = parse_rating_and_reviews(card_text)

        name = a.get_text(" ", strip=True)
        review_url = urljoin(BASE, href)

        # The website domain usually appears as plain text near the company name.
        dm = re.search(r"\b([a-z0-9-]+\.[a-z]{2,})\b", card_text, re.IGNORECASE)
        domain = dm.group(1).lower() if dm else None

        rows.append(
            {
                "name": name,
                "review_url": review_url,
                "website": domain,
                "rating": rating,
                "review_count": reviews,
            }
        )

    return rows

Step 4: Paginate with nav[aria-label="Pagination"]

The next-page link is usually present as:

  • nav[aria-label="Pagination"] a[aria-label="Next page"]

We’ll follow it until it disappears.

from urllib.parse import urljoin


def find_next_page(html: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")
    nav = soup.select_one('nav[aria-label="Pagination"]')
    if not nav:
        return None

    a = nav.select_one('a[aria-label="Next page"], a[rel="next"]')
    if not a:
        return None

    href = a.get("href")
    if not href:
        return None

    return urljoin(BASE, href)


def crawl_category(category_slug: str, max_pages: int = 5) -> list[dict]:
    url = f"{BASE}/categories/{category_slug}"

    all_rows: list[dict] = []
    page = 1

    while url and page <= max_pages:
        html = fetch(url)

        rows = parse_category_page(html)
        for r in rows:
            r["category_slug"] = category_slug
            r["page"] = page
        all_rows.extend(rows)

        url = find_next_page(html)
        page += 1
        polite_sleep()

    return all_rows

Step 5: Export to CSV

import csv


def write_csv(path: str, rows: list[dict]) -> None:
    if not rows:
        raise ValueError("no rows")

    fieldnames = list(rows[0].keys())
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)


if __name__ == "__main__":
    category = "travel_insurance_company"
    rows = crawl_category(category_slug=category, max_pages=3)
    write_csv("trustpilot_category_rankings.csv", rows)
    print(f"wrote {len(rows)} rows")

Practical tips (what breaks first)

1) You’ll sometimes get a “bot block” response

If the response isn’t HTML, it’s usually a block.

Detect it with a simple guard:

  • if the body starts with { and contains "blocked" → treat as blocked and retry later

2) Don’t scrape every category every day

Trustpilot has many categories.

Instead:

  • keep a category allowlist
  • crawl a few categories per run
  • store and diff results

3) Keep selectors simple

The most stable signal here is the /review/ link. CSS classes will change.


Next steps

  • Add a “top N only” mode (stop after 50 companies)
  • Crawl multiple categories in one run
  • Store results in a database and alert on rank/rating changes
Scale category crawls reliably with ProxiesAPI

Trustpilot is quick to rate-limit and block bot-like traffic. ProxiesAPI gives you a stable proxy layer so your category crawls stay consistent as you paginate and expand to multiple categories.

Related guides

Scrape Yahoo Finance Top Gainers/Losers Screener with ProxiesAPI (CSV Export)
Scrape Yahoo Finance movers tables (gainers + losers), extract tickers, prices, % change, and volume using stable data-testid anchors, then export to CSV. Includes selector rationale and a screenshot.
tutorial#python#yahoo-finance#stocks
Scrape Marktplaats Search Results (Listings) with Python + ProxiesAPI
Build a practical Marktplaats search scraper: fetch the real HTML, extract listing title/price/location/url, and export CSV. Includes a screenshot and a ProxiesAPI-based fetch layer to keep crawls stable.
tutorial#python#marktplaats#web-scraping
Scrape Steam Game Prices + Reviews (Search Results) with Python + ProxiesAPI
Build a practical Steam search scraper: fetch the real HTML, extract game title/appid/price/discount/review summary, and export clean CSV/JSON. Includes a screenshot and a ProxiesAPI-based fetch layer for stability.
tutorial#python#steam#price-scraping
Scrape App Store Rankings (Python + ProxiesAPI)
Pull Apple App Store top charts and app metadata reliably, export to CSV, and keep runs stable with retries + ProxiesAPI. Includes a screenshot-backed walkthrough.
tutorial#python#app-store#rankings