Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)

Glassdoor is one of those sites that looks easy until you try to run a crawl for more than a few minutes.

You’ll typically hit some combination of:

  • sessions/cookies that matter (requests without cookies behave differently)
  • rate limits and soft blocks
  • HTML that changes slightly by locale
  • occasional interstitials and “are you a human?” style pages

In this tutorial we’ll build a production-shaped scraper in Python that:

  1. Locates a company page (you can provide the URL or search by name)
  2. Scrapes reviews with pagination
  3. Scrapes salary ranges where they’re visible
  4. Uses timeouts + retries + session cookies
  5. Adds ProxiesAPI at the network layer so blocks are less likely to kill the run
  6. Exports clean JSON/JSONL

Important note: always review a site’s Terms of Service and ensure you have the right to collect and use the data you’re scraping.

Keep Glassdoor crawls stable with ProxiesAPI

Scrapers fail in the network layer first: timeouts, throttling, and blocks. ProxiesAPI gives you clean IP rotation + a consistent proxy endpoint so your Glassdoor crawl can keep moving.


What we’re scraping (and what to expect)

Glassdoor content is split across multiple URL types. Depending on the company and your region, you’ll see pages like:

  • Company overview
  • Reviews list (often paginated)
  • Salaries (often a separate page)

The exact URL patterns can vary, but the scraper shape stays the same:

  • fetch a page
  • detect if you got real content vs a block/interstitial
  • parse the HTML with selectors that are easy to debug
  • paginate until you hit the end

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • tenacity for retries
  • python-dotenv to load a ProxiesAPI key from .env

Create a .env file:

PROXIESAPI_KEY="YOUR_KEY_HERE"

Step 1: Build a robust fetcher (sessions + retries + proxy)

The goal is to centralize everything that makes scraping reliable:

  • Session: cookies persist across requests
  • Headers: realistic UA + accept-language
  • Timeouts: never hang
  • Retries: transient failures are normal
  • Proxy: one switch to turn ProxiesAPI on/off
import os
import random
import time
from dataclasses import dataclass

import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

load_dotenv()

TIMEOUT = (10, 30)  # connect, read

USER_AGENTS = [
    # Keep a small rotation; don’t overdo it.
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


class GlassdoorClient:
    def __init__(self, use_proxiesapi: bool = True):
        self.session = requests.Session()
        self.use_proxiesapi = use_proxiesapi

        self.session.headers.update(
            {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en-US,en;q=0.9",
                "Connection": "keep-alive",
            }
        )

    def _proxies(self):
        if not self.use_proxiesapi:
            return None
        key = os.getenv("PROXIESAPI_KEY")
        if not key:
            raise RuntimeError("Missing PROXIESAPI_KEY in environment")

        # ProxiesAPI typically provides an authenticated proxy endpoint.
        # If your ProxiesAPI account provides a different format, adapt here.
        proxy = f"http://{key}:@proxy.proxiesapi.com:10000"
        return {"http": proxy, "https": proxy}

    @retry(stop=stop_after_attempt(5), wait=wait_exponential_jitter(initial=1, max=20))
    def fetch(self, url: str) -> FetchResult:
        # Rotate UA per request (lightweight)
        self.session.headers["User-Agent"] = random.choice(USER_AGENTS)

        r = self.session.get(url, timeout=TIMEOUT, proxies=self._proxies())

        # Some blocks return 200 with “robot” page; we detect later.
        return FetchResult(url=url, status_code=r.status_code, text=r.text)


def soupify(html: str) -> BeautifulSoup:
    return BeautifulSoup(html, "lxml")

Block/interstitial detection (pragmatic)

Rather than guessing perfectly, we look for a few high-signal indicators:

  • page contains “captcha” / “robot” keywords
  • very short HTML
  • missing a main content container repeatedly

def looks_blocked(html: str) -> bool:
    h = (html or "").lower()
    if len(h) < 2000:
        return True
    needles = ["captcha", "are you a human", "robot", "unusual traffic"]
    return any(n in h for n in needles)

Glassdoor URLs can be fickle. The most reliable approach is:

  • manually find the company page once
  • feed that URL into your scraper

If you need discovery by name, do it via a search engine and then validate the resulting URL.

For the scraping part, we’ll assume you have a URL like:

  • https://www.glassdoor.com/Overview/...

Step 3: Scrape reviews (selectors + pagination)

Glassdoor review pages often include cards with:

  • rating
  • review title
  • author/job title (sometimes)
  • date
  • pros/cons

The exact CSS classes can change, so our strategy is:

  • select semantic-ish containers first
  • fall back to text heuristics
  • always log a sample if parsing yields zero
import json
import re
from urllib.parse import urljoin, urlparse, parse_qs


def clean_text(el) -> str:
    if not el:
        return ""
    return re.sub(r"\s+", " ", el.get_text(" ", strip=True)).strip()


def parse_reviews(html: str) -> list[dict]:
    soup = soupify(html)

    reviews = []

    # Try common “review card” patterns. You may need to update selectors over time.
    cards = soup.select("[data-test='review-card'], article, div")

    for c in cards:
        txt = clean_text(c)
        if not txt:
            continue

        # Heuristic: review cards usually contain “Pros” or “Cons” labels.
        if "pros" not in txt.lower() and "cons" not in txt.lower():
            continue

        rating = None
        rating_el = c.select_one("[aria-label*='rating'], span[aria-label*='rating']")
        if rating_el:
            m = re.search(r"([0-9]\.?[0-9]?)", rating_el.get("aria-label", ""))
            rating = float(m.group(1)) if m else None

        title_el = c.select_one("[data-test='review-title'], a, h2, h3")
        title = clean_text(title_el)

        # Extract pros/cons blocks if labeled
        pros = ""
        cons = ""
        for label in c.select("span, div, p"):
            lt = clean_text(label).lower()
            if lt in ("pros", "pro"):
                # next sibling text
                nxt = label.find_next()
                pros = clean_text(nxt)
            if lt in ("cons", "con"):
                nxt = label.find_next()
                cons = clean_text(nxt)

        reviews.append(
            {
                "title": title,
                "rating": rating,
                "pros": pros,
                "cons": cons,
                "raw_snippet": txt[:400],
            }
        )

    return reviews

Pagination loop

Many review lists paginate via a query param or path segment. Since the pattern changes, we’ll do something simple and resilient:

  • start from a URL you provide (first page)
  • after parsing, look for a “next” link
  • stop if no next link or if it loops

def find_next_page(html: str, base_url: str) -> str | None:
    soup = soupify(html)

    # Common patterns: rel=next or “Next” text
    a = soup.select_one("a[rel='next']")
    if not a:
        for cand in soup.select("a"):
            if clean_text(cand).lower() in ("next", "next page"):
                a = cand
                break

    if not a:
        return None

    href = a.get("href")
    if not href:
        return None

    return urljoin(base_url, href)


def crawl_reviews(client: GlassdoorClient, start_url: str, max_pages: int = 10) -> list[dict]:
    out = []
    seen_urls = set()

    url = start_url
    for _ in range(max_pages):
        if url in seen_urls:
            break
        seen_urls.add(url)

        res = client.fetch(url)
        if res.status_code >= 400 or looks_blocked(res.text):
            # Back off a bit and try again (tenacity handles retries on fetch errors;
            # here we just slow down between pages)
            time.sleep(2)

        batch = parse_reviews(res.text)
        if batch:
            out.extend(batch)

        next_url = find_next_page(res.text, url)
        if not next_url:
            break

        # polite pacing
        time.sleep(1.0)
        url = next_url

    return out

Step 4: Scrape salary ranges (where visible)

Salary pages may show:

  • job title
  • base pay range
  • location
  • data source count

Again: the DOM changes. Treat salary scraping as best-effort, and always export what you got.


def parse_salaries(html: str) -> list[dict]:
    soup = soupify(html)
    rows = []

    # Look for rows that include currency symbols
    for el in soup.select("tr, li, div"):
        t = clean_text(el)
        if not t:
            continue
        if "$" not in t and "₹" not in t and "£" not in t and "€" not in t:
            continue

        # crude parse: capture a range like “$80K - $120K”
        m = re.search(r"([$€£₹]\s?\d[\d,.]*\s?[Kk]?)\s?[-–]\s?([$€£₹]\s?\d[\d,.]*\s?[Kk]?)", t)
        if not m:
            continue

        rows.append({"raw": t, "min": m.group(1), "max": m.group(2)})

    # de-dup
    uniq = []
    seen = set()
    for r in rows:
        key = r["raw"]
        if key in seen:
            continue
        seen.add(key)
        uniq.append(r)
    return uniq

Step 5: Put it together (end-to-end run)

You’ll provide:

  • a reviews URL (first page)
  • a salaries URL (optional)
import json


def main():
    reviews_url = "https://www.glassdoor.com/Reviews/COMPANY-Reviews-E12345.htm"
    salaries_url = "https://www.glassdoor.com/Salary/COMPANY-Salaries-E12345.htm"

    client = GlassdoorClient(use_proxiesapi=True)

    print("crawling reviews...")
    reviews = crawl_reviews(client, reviews_url, max_pages=8)
    print("reviews:", len(reviews))

    print("crawling salaries...")
    res = client.fetch(salaries_url)
    salaries = parse_salaries(res.text) if not looks_blocked(res.text) else []
    print("salary rows:", len(salaries))

    out = {
        "reviews_url": reviews_url,
        "salaries_url": salaries_url,
        "reviews": reviews,
        "salaries": salaries,
    }

    with open("glassdoor_company.json", "w", encoding="utf-8") as f:
        json.dump(out, f, ensure_ascii=False, indent=2)

    print("wrote glassdoor_company.json")


if __name__ == "__main__":
    main()

Practical reliability tips (what actually helps)

  1. Start with a small run (1–2 pages) and confirm your selectors
  2. Log HTML samples when you parse zero records (most bugs are “different page type”)
  3. Keep sessions sticky: a single requests.Session() per crawl
  4. Slow down: 0.8–2.0 seconds between pages is often worth it
  5. Rotate IPs when blocked: that’s exactly where ProxiesAPI helps

QA checklist

  • Fetcher uses timeouts and retries
  • Block detection triggers on interstitial pages
  • Review parsing returns non-zero results on a known-good page
  • Pagination stops cleanly (no loops)
  • JSON export writes and is readable

Where ProxiesAPI fits (honestly)

Proxies don’t magically bypass everything—but they make scrapers far more resilient to the normal failure modes: throttling, temporary blocks, and uneven success rates across IPs.

Use ProxiesAPI to keep the networking layer predictable while you focus on the part that actually changes: the DOM.

Keep Glassdoor crawls stable with ProxiesAPI

Scrapers fail in the network layer first: timeouts, throttling, and blocks. ProxiesAPI gives you clean IP rotation + a consistent proxy endpoint so your Glassdoor crawl can keep moving.

Related guides

Scrape Product Comparisons from CNET (Python + ProxiesAPI)
Collect CNET comparison tables and spec blocks, normalize the data into a clean dataset, and keep the crawl stable with retries + ProxiesAPI. Includes screenshot workflow.
tutorial#python#cnet#web-scraping
How to Scrape Etsy Product Listings with Python (ProxiesAPI + Pagination)
Extract title, price, rating, and shop info from Etsy search pages reliably with rotating proxies, retries, and pagination.
tutorial#python#etsy#web-scraping
Scrape NBA Scores and Standings from ESPN with Python (Box Scores + Schedule)
Build a clean dataset of today’s NBA games and standings from ESPN pages using robust selectors and proxy-safe requests.
tutorial#python#nba#espn
Scrape Google Maps Business Listings with Python: Search → Place Details → Reviews (ProxiesAPI)
Extract local leads from Google Maps: search results → place details → reviews, with a resilient fetch pipeline and a screenshot-driven selector approach.
tutorial#python#google-maps#local-leads