Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)
Glassdoor is one of those sites that looks easy until you try to run a crawl for more than a few minutes.
You’ll typically hit some combination of:
- sessions/cookies that matter (requests without cookies behave differently)
- rate limits and soft blocks
- HTML that changes slightly by locale
- occasional interstitials and “are you a human?” style pages
In this tutorial we’ll build a production-shaped scraper in Python that:
- Locates a company page (you can provide the URL or search by name)
- Scrapes reviews with pagination
- Scrapes salary ranges where they’re visible
- Uses timeouts + retries + session cookies
- Adds ProxiesAPI at the network layer so blocks are less likely to kill the run
- Exports clean JSON/JSONL
Important note: always review a site’s Terms of Service and ensure you have the right to collect and use the data you’re scraping.
Scrapers fail in the network layer first: timeouts, throttling, and blocks. ProxiesAPI gives you clean IP rotation + a consistent proxy endpoint so your Glassdoor crawl can keep moving.
What we’re scraping (and what to expect)
Glassdoor content is split across multiple URL types. Depending on the company and your region, you’ll see pages like:
- Company overview
- Reviews list (often paginated)
- Salaries (often a separate page)
The exact URL patterns can vary, but the scraper shape stays the same:
- fetch a page
- detect if you got real content vs a block/interstitial
- parse the HTML with selectors that are easy to debug
- paginate until you hit the end
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingtenacityfor retriespython-dotenvto load a ProxiesAPI key from.env
Create a .env file:
PROXIESAPI_KEY="YOUR_KEY_HERE"
Step 1: Build a robust fetcher (sessions + retries + proxy)
The goal is to centralize everything that makes scraping reliable:
- Session: cookies persist across requests
- Headers: realistic UA + accept-language
- Timeouts: never hang
- Retries: transient failures are normal
- Proxy: one switch to turn ProxiesAPI on/off
import os
import random
import time
from dataclasses import dataclass
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
load_dotenv()
TIMEOUT = (10, 30) # connect, read
USER_AGENTS = [
# Keep a small rotation; don’t overdo it.
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]
@dataclass
class FetchResult:
url: str
status_code: int
text: str
class GlassdoorClient:
def __init__(self, use_proxiesapi: bool = True):
self.session = requests.Session()
self.use_proxiesapi = use_proxiesapi
self.session.headers.update(
{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
}
)
def _proxies(self):
if not self.use_proxiesapi:
return None
key = os.getenv("PROXIESAPI_KEY")
if not key:
raise RuntimeError("Missing PROXIESAPI_KEY in environment")
# ProxiesAPI typically provides an authenticated proxy endpoint.
# If your ProxiesAPI account provides a different format, adapt here.
proxy = f"http://{key}:@proxy.proxiesapi.com:10000"
return {"http": proxy, "https": proxy}
@retry(stop=stop_after_attempt(5), wait=wait_exponential_jitter(initial=1, max=20))
def fetch(self, url: str) -> FetchResult:
# Rotate UA per request (lightweight)
self.session.headers["User-Agent"] = random.choice(USER_AGENTS)
r = self.session.get(url, timeout=TIMEOUT, proxies=self._proxies())
# Some blocks return 200 with “robot” page; we detect later.
return FetchResult(url=url, status_code=r.status_code, text=r.text)
def soupify(html: str) -> BeautifulSoup:
return BeautifulSoup(html, "lxml")
Block/interstitial detection (pragmatic)
Rather than guessing perfectly, we look for a few high-signal indicators:
- page contains “captcha” / “robot” keywords
- very short HTML
- missing a main content container repeatedly
def looks_blocked(html: str) -> bool:
h = (html or "").lower()
if len(h) < 2000:
return True
needles = ["captcha", "are you a human", "robot", "unusual traffic"]
return any(n in h for n in needles)
Step 2: Provide a company URL (recommended)
Glassdoor URLs can be fickle. The most reliable approach is:
- manually find the company page once
- feed that URL into your scraper
If you need discovery by name, do it via a search engine and then validate the resulting URL.
For the scraping part, we’ll assume you have a URL like:
https://www.glassdoor.com/Overview/...
Step 3: Scrape reviews (selectors + pagination)
Glassdoor review pages often include cards with:
- rating
- review title
- author/job title (sometimes)
- date
- pros/cons
The exact CSS classes can change, so our strategy is:
- select semantic-ish containers first
- fall back to text heuristics
- always log a sample if parsing yields zero
import json
import re
from urllib.parse import urljoin, urlparse, parse_qs
def clean_text(el) -> str:
if not el:
return ""
return re.sub(r"\s+", " ", el.get_text(" ", strip=True)).strip()
def parse_reviews(html: str) -> list[dict]:
soup = soupify(html)
reviews = []
# Try common “review card” patterns. You may need to update selectors over time.
cards = soup.select("[data-test='review-card'], article, div")
for c in cards:
txt = clean_text(c)
if not txt:
continue
# Heuristic: review cards usually contain “Pros” or “Cons” labels.
if "pros" not in txt.lower() and "cons" not in txt.lower():
continue
rating = None
rating_el = c.select_one("[aria-label*='rating'], span[aria-label*='rating']")
if rating_el:
m = re.search(r"([0-9]\.?[0-9]?)", rating_el.get("aria-label", ""))
rating = float(m.group(1)) if m else None
title_el = c.select_one("[data-test='review-title'], a, h2, h3")
title = clean_text(title_el)
# Extract pros/cons blocks if labeled
pros = ""
cons = ""
for label in c.select("span, div, p"):
lt = clean_text(label).lower()
if lt in ("pros", "pro"):
# next sibling text
nxt = label.find_next()
pros = clean_text(nxt)
if lt in ("cons", "con"):
nxt = label.find_next()
cons = clean_text(nxt)
reviews.append(
{
"title": title,
"rating": rating,
"pros": pros,
"cons": cons,
"raw_snippet": txt[:400],
}
)
return reviews
Pagination loop
Many review lists paginate via a query param or path segment. Since the pattern changes, we’ll do something simple and resilient:
- start from a URL you provide (first page)
- after parsing, look for a “next” link
- stop if no next link or if it loops
def find_next_page(html: str, base_url: str) -> str | None:
soup = soupify(html)
# Common patterns: rel=next or “Next” text
a = soup.select_one("a[rel='next']")
if not a:
for cand in soup.select("a"):
if clean_text(cand).lower() in ("next", "next page"):
a = cand
break
if not a:
return None
href = a.get("href")
if not href:
return None
return urljoin(base_url, href)
def crawl_reviews(client: GlassdoorClient, start_url: str, max_pages: int = 10) -> list[dict]:
out = []
seen_urls = set()
url = start_url
for _ in range(max_pages):
if url in seen_urls:
break
seen_urls.add(url)
res = client.fetch(url)
if res.status_code >= 400 or looks_blocked(res.text):
# Back off a bit and try again (tenacity handles retries on fetch errors;
# here we just slow down between pages)
time.sleep(2)
batch = parse_reviews(res.text)
if batch:
out.extend(batch)
next_url = find_next_page(res.text, url)
if not next_url:
break
# polite pacing
time.sleep(1.0)
url = next_url
return out
Step 4: Scrape salary ranges (where visible)
Salary pages may show:
- job title
- base pay range
- location
- data source count
Again: the DOM changes. Treat salary scraping as best-effort, and always export what you got.
def parse_salaries(html: str) -> list[dict]:
soup = soupify(html)
rows = []
# Look for rows that include currency symbols
for el in soup.select("tr, li, div"):
t = clean_text(el)
if not t:
continue
if "$" not in t and "₹" not in t and "£" not in t and "€" not in t:
continue
# crude parse: capture a range like “$80K - $120K”
m = re.search(r"([$€£₹]\s?\d[\d,.]*\s?[Kk]?)\s?[-–]\s?([$€£₹]\s?\d[\d,.]*\s?[Kk]?)", t)
if not m:
continue
rows.append({"raw": t, "min": m.group(1), "max": m.group(2)})
# de-dup
uniq = []
seen = set()
for r in rows:
key = r["raw"]
if key in seen:
continue
seen.add(key)
uniq.append(r)
return uniq
Step 5: Put it together (end-to-end run)
You’ll provide:
- a reviews URL (first page)
- a salaries URL (optional)
import json
def main():
reviews_url = "https://www.glassdoor.com/Reviews/COMPANY-Reviews-E12345.htm"
salaries_url = "https://www.glassdoor.com/Salary/COMPANY-Salaries-E12345.htm"
client = GlassdoorClient(use_proxiesapi=True)
print("crawling reviews...")
reviews = crawl_reviews(client, reviews_url, max_pages=8)
print("reviews:", len(reviews))
print("crawling salaries...")
res = client.fetch(salaries_url)
salaries = parse_salaries(res.text) if not looks_blocked(res.text) else []
print("salary rows:", len(salaries))
out = {
"reviews_url": reviews_url,
"salaries_url": salaries_url,
"reviews": reviews,
"salaries": salaries,
}
with open("glassdoor_company.json", "w", encoding="utf-8") as f:
json.dump(out, f, ensure_ascii=False, indent=2)
print("wrote glassdoor_company.json")
if __name__ == "__main__":
main()
Practical reliability tips (what actually helps)
- Start with a small run (1–2 pages) and confirm your selectors
- Log HTML samples when you parse zero records (most bugs are “different page type”)
- Keep sessions sticky: a single
requests.Session()per crawl - Slow down: 0.8–2.0 seconds between pages is often worth it
- Rotate IPs when blocked: that’s exactly where ProxiesAPI helps
QA checklist
- Fetcher uses timeouts and retries
- Block detection triggers on interstitial pages
- Review parsing returns non-zero results on a known-good page
- Pagination stops cleanly (no loops)
- JSON export writes and is readable
Where ProxiesAPI fits (honestly)
Proxies don’t magically bypass everything—but they make scrapers far more resilient to the normal failure modes: throttling, temporary blocks, and uneven success rates across IPs.
Use ProxiesAPI to keep the networking layer predictable while you focus on the part that actually changes: the DOM.
Scrapers fail in the network layer first: timeouts, throttling, and blocks. ProxiesAPI gives you clean IP rotation + a consistent proxy endpoint so your Glassdoor crawl can keep moving.