Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)

Apr 04, 2026 · tutorial · #python, #glassdoor, #web-scraping, #beautifulsoup, #requests, #proxies, #data-extraction

Glassdoor is one of those sites that looks easy until you try to run a crawl for more than a few minutes.

You’ll typically hit some combination of:

sessions/cookies that matter (requests without cookies behave differently)
rate limits and soft blocks
HTML that changes slightly by locale
occasional interstitials and “are you a human?” style pages

In this tutorial we’ll build a production-shaped scraper in Python that:

Locates a company page (you can provide the URL or search by name)
Scrapes reviews with pagination
Scrapes salary ranges where they’re visible
Uses timeouts + retries + session cookies
Adds ProxiesAPI at the network layer so blocks are less likely to kill the run
Exports clean JSON/JSONL

Important note: always review a site’s Terms of Service and ensure you have the right to collect and use the data you’re scraping.

Keep Glassdoor crawls stable with ProxiesAPI

Scrapers fail in the network layer first: timeouts, throttling, and blocks. ProxiesAPI gives you clean IP rotation + a consistent proxy endpoint so your Glassdoor crawl can keep moving.

Get 1,000 free API calls View pricing

What we’re scraping (and what to expect)

Glassdoor content is split across multiple URL types. Depending on the company and your region, you’ll see pages like:

Company overview
Reviews list (often paginated)
Salaries (often a separate page)

The exact URL patterns can vary, but the scraper shape stays the same:

fetch a page
detect if you got real content vs a block/interstitial
parse the HTML with selectors that are easy to debug
paginate until you hit the end

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
tenacity for retries
python-dotenv to load a ProxiesAPI key from .env

Create a .env file:

PROXIESAPI_KEY="YOUR_KEY_HERE"

Step 1: Build a robust fetcher (sessions + retries + proxy)

The goal is to centralize everything that makes scraping reliable:

Session: cookies persist across requests
Headers: realistic UA + accept-language
Timeouts: never hang
Retries: transient failures are normal
Proxy: one switch to turn ProxiesAPI on/off

import os
import random
import time
from dataclasses import dataclass

import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

load_dotenv()

TIMEOUT = (10, 30)  # connect, read

USER_AGENTS = [
    # Keep a small rotation; don’t overdo it.
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


class GlassdoorClient:
    def __init__(self, use_proxiesapi: bool = True):
        self.session = requests.Session()
        self.use_proxiesapi = use_proxiesapi

        self.session.headers.update(
            {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en-US,en;q=0.9",
                "Connection": "keep-alive",
            }
        )

    def _proxies(self):
        if not self.use_proxiesapi:
            return None
        key = os.getenv("PROXIESAPI_KEY")
        if not key:
            raise RuntimeError("Missing PROXIESAPI_KEY in environment")

        # ProxiesAPI typically provides an authenticated proxy endpoint.
        # If your ProxiesAPI account provides a different format, adapt here.
        proxy = f"http://{key}:@proxy.proxiesapi.com:10000"
        return {"http": proxy, "https": proxy}

    @retry(stop=stop_after_attempt(5), wait=wait_exponential_jitter(initial=1, max=20))
    def fetch(self, url: str) -> FetchResult:
        # Rotate UA per request (lightweight)
        self.session.headers["User-Agent"] = random.choice(USER_AGENTS)

        r = self.session.get(url, timeout=TIMEOUT, proxies=self._proxies())

        # Some blocks return 200 with “robot” page; we detect later.
        return FetchResult(url=url, status_code=r.status_code, text=r.text)


def soupify(html: str) -> BeautifulSoup:
    return BeautifulSoup(html, "lxml")

Block/interstitial detection (pragmatic)

Rather than guessing perfectly, we look for a few high-signal indicators:

page contains “captcha” / “robot” keywords
very short HTML
missing a main content container repeatedly


def looks_blocked(html: str) -> bool:
    h = (html or "").lower()
    if len(h) < 2000:
        return True
    needles = ["captcha", "are you a human", "robot", "unusual traffic"]
    return any(n in h for n in needles)

Step 2: Provide a company URL (recommended)

Glassdoor URLs can be fickle. The most reliable approach is:

manually find the company page once
feed that URL into your scraper

If you need discovery by name, do it via a search engine and then validate the resulting URL.

For the scraping part, we’ll assume you have a URL like:

https://www.glassdoor.com/Overview/...

Step 3: Scrape reviews (selectors + pagination)

Glassdoor review pages often include cards with:

rating
review title
author/job title (sometimes)
date
pros/cons

The exact CSS classes can change, so our strategy is:

select semantic-ish containers first
fall back to text heuristics
always log a sample if parsing yields zero

import json
import re
from urllib.parse import urljoin, urlparse, parse_qs


def clean_text(el) -> str:
    if not el:
        return ""
    return re.sub(r"\s+", " ", el.get_text(" ", strip=True)).strip()


def parse_reviews(html: str) -> list[dict]:
    soup = soupify(html)

    reviews = []

    # Try common “review card” patterns. You may need to update selectors over time.
    cards = soup.select("[data-test='review-card'], article, div")

    for c in cards:
        txt = clean_text(c)
        if not txt:
            continue

        # Heuristic: review cards usually contain “Pros” or “Cons” labels.
        if "pros" not in txt.lower() and "cons" not in txt.lower():
            continue

        rating = None
        rating_el = c.select_one("[aria-label*='rating'], span[aria-label*='rating']")
        if rating_el:
            m = re.search(r"([0-9]\.?[0-9]?)", rating_el.get("aria-label", ""))
            rating = float(m.group(1)) if m else None

        title_el = c.select_one("[data-test='review-title'], a, h2, h3")
        title = clean_text(title_el)

        # Extract pros/cons blocks if labeled
        pros = ""
        cons = ""
        for label in c.select("span, div, p"):
            lt = clean_text(label).lower()
            if lt in ("pros", "pro"):
                # next sibling text
                nxt = label.find_next()
                pros = clean_text(nxt)
            if lt in ("cons", "con"):
                nxt = label.find_next()
                cons = clean_text(nxt)

        reviews.append(
            {
                "title": title,
                "rating": rating,
                "pros": pros,
                "cons": cons,
                "raw_snippet": txt[:400],
            }
        )

    return reviews

Pagination loop

Many review lists paginate via a query param or path segment. Since the pattern changes, we’ll do something simple and resilient:

start from a URL you provide (first page)
after parsing, look for a “next” link
stop if no next link or if it loops


def find_next_page(html: str, base_url: str) -> str | None:
    soup = soupify(html)

    # Common patterns: rel=next or “Next” text
    a = soup.select_one("a[rel='next']")
    if not a:
        for cand in soup.select("a"):
            if clean_text(cand).lower() in ("next", "next page"):
                a = cand
                break

    if not a:
        return None

    href = a.get("href")
    if not href:
        return None

    return urljoin(base_url, href)


def crawl_reviews(client: GlassdoorClient, start_url: str, max_pages: int = 10) -> list[dict]:
    out = []
    seen_urls = set()

    url = start_url
    for _ in range(max_pages):
        if url in seen_urls:
            break
        seen_urls.add(url)

        res = client.fetch(url)
        if res.status_code >= 400 or looks_blocked(res.text):
            # Back off a bit and try again (tenacity handles retries on fetch errors;
            # here we just slow down between pages)
            time.sleep(2)

        batch = parse_reviews(res.text)
        if batch:
            out.extend(batch)

        next_url = find_next_page(res.text, url)
        if not next_url:
            break

        # polite pacing
        time.sleep(1.0)
        url = next_url

    return out

Step 4: Scrape salary ranges (where visible)

Salary pages may show:

job title
base pay range
location
data source count

Again: the DOM changes. Treat salary scraping as best-effort, and always export what you got.


def parse_salaries(html: str) -> list[dict]:
    soup = soupify(html)
    rows = []

    # Look for rows that include currency symbols
    for el in soup.select("tr, li, div"):
        t = clean_text(el)
        if not t:
            continue
        if "$" not in t and "₹" not in t and "£" not in t and "€" not in t:
            continue

        # crude parse: capture a range like “$80K - $120K”
        m = re.search(r"([$€£₹]\s?\d[\d,.]*\s?[Kk]?)\s?[-–]\s?([$€£₹]\s?\d[\d,.]*\s?[Kk]?)", t)
        if not m:
            continue

        rows.append({"raw": t, "min": m.group(1), "max": m.group(2)})

    # de-dup
    uniq = []
    seen = set()
    for r in rows:
        key = r["raw"]
        if key in seen:
            continue
        seen.add(key)
        uniq.append(r)
    return uniq

Step 5: Put it together (end-to-end run)

You’ll provide:

a reviews URL (first page)
a salaries URL (optional)

import json


def main():
    reviews_url = "https://www.glassdoor.com/Reviews/COMPANY-Reviews-E12345.htm"
    salaries_url = "https://www.glassdoor.com/Salary/COMPANY-Salaries-E12345.htm"

    client = GlassdoorClient(use_proxiesapi=True)

    print("crawling reviews...")
    reviews = crawl_reviews(client, reviews_url, max_pages=8)
    print("reviews:", len(reviews))

    print("crawling salaries...")
    res = client.fetch(salaries_url)
    salaries = parse_salaries(res.text) if not looks_blocked(res.text) else []
    print("salary rows:", len(salaries))

    out = {
        "reviews_url": reviews_url,
        "salaries_url": salaries_url,
        "reviews": reviews,
        "salaries": salaries,
    }

    with open("glassdoor_company.json", "w", encoding="utf-8") as f:
        json.dump(out, f, ensure_ascii=False, indent=2)

    print("wrote glassdoor_company.json")


if __name__ == "__main__":
    main()

Practical reliability tips (what actually helps)

Start with a small run (1–2 pages) and confirm your selectors
Log HTML samples when you parse zero records (most bugs are “different page type”)
Keep sessions sticky: a single requests.Session() per crawl
Slow down: 0.8–2.0 seconds between pages is often worth it
Rotate IPs when blocked: that’s exactly where ProxiesAPI helps

QA checklist

Fetcher uses timeouts and retries
Block detection triggers on interstitial pages
Review parsing returns non-zero results on a known-good page
Pagination stops cleanly (no loops)
JSON export writes and is readable

Where ProxiesAPI fits (honestly)

Proxies don’t magically bypass everything—but they make scrapers far more resilient to the normal failure modes: throttling, temporary blocks, and uneven success rates across IPs.

Use ProxiesAPI to keep the networking layer predictable while you focus on the part that actually changes: the DOM.

Keep Glassdoor crawls stable with ProxiesAPI

Scrapers fail in the network layer first: timeouts, throttling, and blocks. ProxiesAPI gives you clean IP rotation + a consistent proxy endpoint so your Glassdoor crawl can keep moving.

Get 1,000 free API calls View pricing

Collect CNET comparison tables and spec blocks, normalize the data into a clean dataset, and keep the crawl stable with retries + ProxiesAPI. Includes screenshot workflow.

tutorial#python#cnet#web-scraping

Scrape eBay Listings and Prices

Build an eBay scraper that captures titles, prices, item URLs, and pagination into CSV-ready output.

tutorial#python#ebay#web-scraping

Scrape Secondhand Fashion Listings from Vinted

Show how to collect Vinted search listings, prices, brands, and image URLs into a resale market dataset.

tutorial#python#vinted#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads book metadata, average rating, rating counts, review counts, and top review snippets with Python using JSON-LD plus __NEXT_DATA__ review objects.

tutorial#python#goodreads#books

Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)

Related guides