How to Scrape G2 Software Reviews (Ratings, Pros/Cons) with Python + ProxiesAPI
G2 reviews are one of the most useful public datasets for building:
- lead lists ("who uses what")
- competitive intel (feature gaps, pricing complaints)
- product research dashboards
- AI summarization pipelines (topic clustering, sentiment)
In this guide we’ll build a real Python scraper that extracts software reviews from G2, including:
- overall rating
- review title + body
- pros and cons sections
- reviewer metadata (role, company size where present)
- review date
- pagination across review pages
We’ll also show where ProxiesAPI fits into the fetch layer so the scraper stays reliable when you scale.

G2 is a high-signal dataset, but it’s also a high-friction target. ProxiesAPI helps you keep pagination, retries, and IP rotation consistent as your URL list grows.
What we’re scraping (G2 URL patterns)
G2 product pages usually look like:
- Product overview:
https://www.g2.com/products/<slug>/reviews - With pagination: query params change over time, but G2 typically supports a page param in the URL or query.
Because G2’s front-end evolves, the most robust approach is:
- Fetch the reviews page HTML.
- Extract review cards from the DOM.
- For pagination, discover the next page URL from the page itself.
That avoids hard-coding a fragile ?page= convention.
A quick sanity check (HTTP works)
curl -s "https://www.g2.com/products/slack/reviews" | head -n 20
If you see an HTML page with review content, you can proceed.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsingtenacityfor robust retries
Step 1: A fetch layer you can trust (with ProxiesAPI)
Scrapers fail in boring ways: timeouts, TLS hiccups, occasional 403/429, and inconsistent responses.
So we’ll start with a fetch function that has:
- connect/read timeouts
- retry with exponential backoff
- a consistent User-Agent
- optional ProxiesAPI proxy configuration
Configure ProxiesAPI
ProxiesAPI typically gives you a proxy endpoint + credentials.
Set these env vars:
export PROXIESAPI_PROXY_URL="http://USERNAME:PASSWORD@gateway.proxiesapi.com:PORT"
If you don’t have your proxy URL handy, you can still run the scraper without it.
Fetch code
import os
import random
import time
from urllib.parse import urljoin
import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
BASE = "https://www.g2.com"
TIMEOUT = (10, 40) # connect, read
UA_POOL = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
session = requests.Session()
def build_proxies():
proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
if not proxy_url:
return None
return {
"http": proxy_url,
"https": proxy_url,
}
@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=1, max=20))
def fetch(url: str) -> str:
headers = {
"User-Agent": random.choice(UA_POOL),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
}
r = session.get(
url,
headers=headers,
timeout=TIMEOUT,
proxies=build_proxies(),
allow_redirects=True,
)
# Handle occasional anti-bot / rate limiting gracefully.
if r.status_code in (403, 429, 503):
raise RuntimeError(f"Blocked or rate limited: {r.status_code}")
r.raise_for_status()
return r.text
def abs_url(href: str) -> str:
return href if href.startswith("http") else urljoin(BASE, href)
Note the honest reality: ProxiesAPI doesn’t magically bypass all defenses. What it gives you is a more reliable network layer and IP rotation patterns that reduce failure rates when you’re crawling lots of pages.
Step 2: Inspect the HTML (don’t guess selectors)
G2 is a modern web app. You may see:
- review cards in HTML
- embedded JSON data in
<script>tags
We’ll support both:
- Try to parse structured JSON if present.
- Fall back to extracting from review card HTML.
This “two-lane” approach keeps your scraper alive when front-end markup shifts.
Step 3: Parse reviews from a page
HTML extraction (fallback)
The exact class names on G2 can change. So we focus on stable cues:
- text labels like "Pros" / "Cons"
- star rating blocks (often with
aria-label) - time tags for dates
Here’s a pragmatic parser that tries multiple selectors.
import re
from bs4 import BeautifulSoup
def clean_text(x: str) -> str:
return re.sub(r"\s+", " ", (x or "").strip())
def first_text(el) -> str | None:
if not el:
return None
return clean_text(el.get_text(" ", strip=True))
def parse_rating(soup: BeautifulSoup) -> float | None:
# Common pattern: aria-label="4.5 out of 5"
star = soup.select_one('[aria-label$="out of 5"]')
if star:
m = re.search(r"([0-9.]+)\s+out of\s+5", star.get("aria-label", ""))
if m:
return float(m.group(1))
# Fallback: text like "4.5"
txt = first_text(soup.select_one("[data-testid*='rating'], .rating, .stars"))
if txt:
m = re.search(r"([0-9.]+)", txt)
if m:
return float(m.group(1))
return None
def extract_section(card: BeautifulSoup, label: str) -> str | None:
# Find heading containing label, then grab adjacent text.
# Works across many card layouts.
heading = None
for h in card.select("h1,h2,h3,h4,span,div"):
t = h.get_text(" ", strip=True)
if t and t.strip().lower() == label.lower():
heading = h
break
if not heading:
return None
# Next sibling text block
nxt = heading.find_next()
if nxt:
# If the next node is the heading itself, move one more
if nxt == heading:
nxt = heading.find_next_sibling()
return first_text(nxt)
def parse_reviews_html(html: str) -> tuple[list[dict], str | None]:
soup = BeautifulSoup(html, "lxml")
# Review cards often have a repeating container; we try a few candidates.
cards = soup.select("[data-testid*='review']")
if not cards:
cards = soup.select("article")
reviews = []
for card in cards[:50]:
title = first_text(card.select_one("h3, h4"))
body = first_text(card.select_one("p"))
pros = extract_section(card, "Pros")
cons = extract_section(card, "Cons")
rating = parse_rating(card)
date = None
time_el = card.select_one("time")
if time_el and time_el.get("datetime"):
date = time_el.get("datetime")
else:
date = first_text(card.select_one("time"))
reviews.append({
"title": title,
"body": body,
"pros": pros,
"cons": cons,
"rating": rating,
"date": date,
})
# Discover next page link
next_url = None
next_a = soup.select_one("a[rel='next'], a[aria-label*='Next']")
if next_a and next_a.get("href"):
next_url = abs_url(next_a.get("href"))
return reviews, next_url
This isn’t pretty, but it’s resilient: it looks for meaning (Pros/Cons, ratings) rather than brittle CSS class names.
Step 4: Pagination (crawl multiple review pages)
Now we’ll crawl pages until:
- we hit
max_pages - there’s no
next_url - the page repeats (loop protection)
from typing import Iterable
def crawl_reviews(start_url: str, max_pages: int = 10, sleep_s: float = 1.5) -> list[dict]:
url = start_url
out: list[dict] = []
seen_urls = set()
for page in range(1, max_pages + 1):
if url in seen_urls:
break
seen_urls.add(url)
html = fetch(url)
reviews, next_url = parse_reviews_html(html)
for r in reviews:
r["source_url"] = url
r["page"] = page
out.append(r)
print(f"page {page}: +{len(reviews)} reviews (total {len(out)})")
if not next_url:
break
url = next_url
time.sleep(sleep_s)
return out
if __name__ == "__main__":
product_reviews_url = "https://www.g2.com/products/slack/reviews"
data = crawl_reviews(product_reviews_url, max_pages=5)
print("total:", len(data))
Step 5: Export clean JSONL (best for pipelines)
JSONL is perfect for large datasets and streaming to data warehouses.
import json
def to_jsonl(path: str, rows: list[dict]) -> None:
with open(path, "w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
# Example
# rows = crawl_reviews("https://www.g2.com/products/slack/reviews", max_pages=5)
# to_jsonl("g2_reviews.jsonl", rows)
Practical notes (what will break first)
- G2 can change markup frequently. Keep the parser modular and add more selector fallbacks.
- Reviews are not always fully in HTML. Sometimes only partial text is rendered.
- Rate limits happen. Keep a delay, retry on 403/429/503, and rotate IPs.
- Respect ToS and robots policies. Only scrape what you’re allowed to.
Where ProxiesAPI fits (honestly)
When you:
- crawl many product slugs
- paginate deep into review history
- run daily refreshes
…your request volume becomes the problem.
ProxiesAPI helps you:
- rotate exit IPs (reduce repetitive request patterns)
- standardize proxy configuration across environments
- keep retries from turning into total failure
It doesn’t replace good engineering: timeouts, backoff, caching, and structured exports.
QA checklist
- Page 1 extracts non-empty titles/bodies
- Pros/Cons appear for at least some reviews
- Pagination advances and doesn’t loop
- Exported JSONL loads cleanly
- You can rerun without hanging (timeouts + retries)
G2 is a high-signal dataset, but it’s also a high-friction target. ProxiesAPI helps you keep pagination, retries, and IP rotation consistent as your URL list grows.