Scrape Trustpilot Category Rankings (Top Companies + Ratings) with ProxiesAPI

Trustpilot categories are a goldmine when you’re building:

  • lead lists (top companies in a niche)
  • competitive research (who ranks + how ratings shift)
  • review monitoring pipelines (discover new companies automatically)

In this guide we’ll scrape a Trustpilot category ranking page and extract:

  • company name
  • Trustpilot “review URL” (the /review/… page)
  • company website domain (when shown)
  • rating
  • review count
  • country/category context

Then we’ll export everything to CSV.

Trustpilot category page screenshot (proof target)
Scale category crawls reliably with ProxiesAPI

Trustpilot is quick to rate-limit and block bot-like traffic. ProxiesAPI gives you a stable proxy layer so your category crawls stay consistent as you paginate and expand to multiple categories.


What we’re scraping (example URL + pagination)

Trustpilot category pages look like:

  • Category: https://www.trustpilot.com/categories/{category_slug}
  • Pagination: https://www.trustpilot.com/categories/{category_slug}?page=2

Example used in this tutorial:

  • https://www.trustpilot.com/categories/travel_insurance_company

Why this target needs a proxy layer

If you fetch Trustpilot too aggressively from a single IP, you’ll often get a JSON error response describing a bot block.

The goal here isn’t “blast Trustpilot.” The goal is:

  • fetch like a normal browser (headers + timeouts)
  • paginate slowly
  • keep retries and transient blocks from killing your job

That’s exactly where a proxy layer helps.


Setup

python3 -m venv .venv
source .venv/bin/activate

pip install requests beautifulsoup4 lxml python-dotenv

ProxiesAPI request pattern (requests + proxy URL)

Set your proxy URL (ProxiesAPI will give you the exact credentials/host/port):

export PROXIESAPI_PROXY_URL="http://YOUR_USERNAME:YOUR_PASSWORD@gw.proxiesapi.com:8080"

We’ll read this from os.environ.


Step 1: Fetch category HTML with realistic headers

import os
import time
import random
import requests

PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
TIMEOUT = (10, 30)  # connect, read

session = requests.Session()


def fetch(url: str) -> str:
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    proxies = None
    if PROXY_URL:
        proxies = {"http": PROXY_URL, "https": PROXY_URL}

    r = session.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text


def polite_sleep(min_s: float = 1.0, max_s: float = 2.2) -> None:
    time.sleep(random.uniform(min_s, max_s))

Two production notes:

  • Always set timeouts. Trustpilot can hang on some responses.
  • Keep a session. It reduces overhead and looks more browser-like.

Step 2: Find stable anchors (don’t guess classes)

Trustpilot’s CSS classes are often hashed, so don’t anchor on them.

Instead, two stable anchors on category pages are:

  1. Company links go to /review/{domain}
  2. Pagination uses nav[aria-label="Pagination"]

So we’ll scrape cards by finding anchors that match /review/… and then walking up to the “card container”.


Step 3: Parse company cards (name, rating, review count)

import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup

BASE = "https://www.trustpilot.com"

RATING_RE = re.compile(r"(\d+(?:\.\d+)?)")
REVIEWS_RE = re.compile(r"([\d,]+)\s+reviews", re.IGNORECASE)


def parse_rating_and_reviews(text: str) -> tuple[float | None, int | None]:
    rating = None
    reviews = None

    if text:
        m = RATING_RE.search(text)
        if m:
            rating = float(m.group(1))

        m = REVIEWS_RE.search(text)
        if m:
            reviews = int(m.group(1).replace(",", ""))

    return rating, reviews


def parse_category_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    rows = []

    # Trustpilot category pages contain many /review/ links.
    # We dedupe by href and only keep the first occurrence.
    seen = set()

    for a in soup.select('a[href^="/review/"]'):
        href = a.get("href")
        if not href or href in seen:
            continue
        seen.add(href)

        # Card-ish container: climb a few parents.
        card = a
        for _ in range(6):
            if not getattr(card, "parent", None):
                break
            card = card.parent

        card_text = " ".join((card.get_text(" ", strip=True) if card else a.get_text(" ", strip=True)).split())

        rating, reviews = parse_rating_and_reviews(card_text)

        name = a.get_text(" ", strip=True)
        review_url = urljoin(BASE, href)

        # The website domain usually appears as plain text near the company name.
        dm = re.search(r"\b([a-z0-9-]+\.[a-z]{2,})\b", card_text, re.IGNORECASE)
        domain = dm.group(1).lower() if dm else None

        rows.append(
            {
                "name": name,
                "review_url": review_url,
                "website": domain,
                "rating": rating,
                "review_count": reviews,
            }
        )

    return rows

Step 4: Paginate with nav[aria-label="Pagination"]

The next-page link is usually present as:

  • nav[aria-label="Pagination"] a[aria-label="Next page"]

We’ll follow it until it disappears.

from urllib.parse import urljoin


def find_next_page(html: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")
    nav = soup.select_one('nav[aria-label="Pagination"]')
    if not nav:
        return None

    a = nav.select_one('a[aria-label="Next page"], a[rel="next"]')
    if not a:
        return None

    href = a.get("href")
    if not href:
        return None

    return urljoin(BASE, href)


def crawl_category(category_slug: str, max_pages: int = 5) -> list[dict]:
    url = f"{BASE}/categories/{category_slug}"

    all_rows: list[dict] = []
    page = 1

    while url and page <= max_pages:
        html = fetch(url)

        rows = parse_category_page(html)
        for r in rows:
            r["category_slug"] = category_slug
            r["page"] = page
        all_rows.extend(rows)

        url = find_next_page(html)
        page += 1
        polite_sleep()

    return all_rows

Step 5: Export to CSV

import csv


def write_csv(path: str, rows: list[dict]) -> None:
    if not rows:
        raise ValueError("no rows")

    fieldnames = list(rows[0].keys())
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)


if __name__ == "__main__":
    category = "travel_insurance_company"
    rows = crawl_category(category_slug=category, max_pages=3)
    write_csv("trustpilot_category_rankings.csv", rows)
    print(f"wrote {len(rows)} rows")

Practical tips (what breaks first)

1) You’ll sometimes get a “bot block” response

If the response isn’t HTML, it’s usually a block.

Detect it with a simple guard:

  • if the body starts with { and contains "blocked" → treat as blocked and retry later

2) Don’t scrape every category every day

Trustpilot has many categories.

Instead:

  • keep a category allowlist
  • crawl a few categories per run
  • store and diff results

3) Keep selectors simple

The most stable signal here is the /review/ link. CSS classes will change.


Next steps

  • Add a “top N only” mode (stop after 50 companies)
  • Crawl multiple categories in one run
  • Store results in a database and alert on rank/rating changes
Scale category crawls reliably with ProxiesAPI

Trustpilot is quick to rate-limit and block bot-like traffic. ProxiesAPI gives you a stable proxy layer so your category crawls stay consistent as you paginate and expand to multiple categories.

Related guides

Scrape Book Reviews and Ratings from Goodreads
Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.
tutorial#python#goodreads#web-scraping
Scrape Financial Data from Yahoo Finance (Green List site)
Fetch a quote page via ProxiesAPI, parse price + key stats, and export to CSV (with a screenshot).
tutorial#python#yahoo-finance#stocks
Python BeautifulSoup Tutorial: Scraping Your First Website (2026)
A beginner-friendly BeautifulSoup tutorial: fetch HTML with requests, parse elements with CSS selectors, handle pagination, avoid common pitfalls, and export results. Includes an honest ProxiesAPI section for when you scale.
tutorial#python beautifulsoup tutorial#python#beautifulsoup
Scrape eBay Listings + Sold Prices with Python (Active + Completed Listings)
Build a small eBay dataset (title, price, condition, shipping) from search results, then pull completed/sold prices from the Sold filter. Includes pagination, CSV export, and ProxiesAPI in the fetch layer.
tutorial#python#ebay#web-scraping