How to Scrape G2 Software Reviews (Ratings, Pros/Cons) with Python + ProxiesAPI

G2 reviews are one of the most useful public datasets for building:

  • lead lists ("who uses what")
  • competitive intel (feature gaps, pricing complaints)
  • product research dashboards
  • AI summarization pipelines (topic clustering, sentiment)

In this guide we’ll build a real Python scraper that extracts software reviews from G2, including:

  • overall rating
  • review title + body
  • pros and cons sections
  • reviewer metadata (role, company size where present)
  • review date
  • pagination across review pages

We’ll also show where ProxiesAPI fits into the fetch layer so the scraper stays reliable when you scale.

G2 reviews page (we’ll scrape rating + pros/cons + review body)

Make review scraping stable with ProxiesAPI

G2 is a high-signal dataset, but it’s also a high-friction target. ProxiesAPI helps you keep pagination, retries, and IP rotation consistent as your URL list grows.


What we’re scraping (G2 URL patterns)

G2 product pages usually look like:

  • Product overview: https://www.g2.com/products/<slug>/reviews
  • With pagination: query params change over time, but G2 typically supports a page param in the URL or query.

Because G2’s front-end evolves, the most robust approach is:

  1. Fetch the reviews page HTML.
  2. Extract review cards from the DOM.
  3. For pagination, discover the next page URL from the page itself.

That avoids hard-coding a fragile ?page= convention.

A quick sanity check (HTTP works)

curl -s "https://www.g2.com/products/slack/reviews" | head -n 20

If you see an HTML page with review content, you can proceed.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing
  • tenacity for robust retries

Step 1: A fetch layer you can trust (with ProxiesAPI)

Scrapers fail in boring ways: timeouts, TLS hiccups, occasional 403/429, and inconsistent responses.

So we’ll start with a fetch function that has:

  • connect/read timeouts
  • retry with exponential backoff
  • a consistent User-Agent
  • optional ProxiesAPI proxy configuration

Configure ProxiesAPI

ProxiesAPI typically gives you a proxy endpoint + credentials.

Set these env vars:

export PROXIESAPI_PROXY_URL="http://USERNAME:PASSWORD@gateway.proxiesapi.com:PORT"

If you don’t have your proxy URL handy, you can still run the scraper without it.

Fetch code

import os
import random
import time
from urllib.parse import urljoin

import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

BASE = "https://www.g2.com"
TIMEOUT = (10, 40)  # connect, read

UA_POOL = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

session = requests.Session()


def build_proxies():
    proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
    if not proxy_url:
        return None
    return {
        "http": proxy_url,
        "https": proxy_url,
    }


@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=1, max=20))
def fetch(url: str) -> str:
    headers = {
        "User-Agent": random.choice(UA_POOL),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
    }

    r = session.get(
        url,
        headers=headers,
        timeout=TIMEOUT,
        proxies=build_proxies(),
        allow_redirects=True,
    )

    # Handle occasional anti-bot / rate limiting gracefully.
    if r.status_code in (403, 429, 503):
        raise RuntimeError(f"Blocked or rate limited: {r.status_code}")

    r.raise_for_status()
    return r.text


def abs_url(href: str) -> str:
    return href if href.startswith("http") else urljoin(BASE, href)

Note the honest reality: ProxiesAPI doesn’t magically bypass all defenses. What it gives you is a more reliable network layer and IP rotation patterns that reduce failure rates when you’re crawling lots of pages.


Step 2: Inspect the HTML (don’t guess selectors)

G2 is a modern web app. You may see:

  • review cards in HTML
  • embedded JSON data in <script> tags

We’ll support both:

  1. Try to parse structured JSON if present.
  2. Fall back to extracting from review card HTML.

This “two-lane” approach keeps your scraper alive when front-end markup shifts.


Step 3: Parse reviews from a page

HTML extraction (fallback)

The exact class names on G2 can change. So we focus on stable cues:

  • text labels like "Pros" / "Cons"
  • star rating blocks (often with aria-label)
  • time tags for dates

Here’s a pragmatic parser that tries multiple selectors.

import re
from bs4 import BeautifulSoup


def clean_text(x: str) -> str:
    return re.sub(r"\s+", " ", (x or "").strip())


def first_text(el) -> str | None:
    if not el:
        return None
    return clean_text(el.get_text(" ", strip=True))


def parse_rating(soup: BeautifulSoup) -> float | None:
    # Common pattern: aria-label="4.5 out of 5"
    star = soup.select_one('[aria-label$="out of 5"]')
    if star:
        m = re.search(r"([0-9.]+)\s+out of\s+5", star.get("aria-label", ""))
        if m:
            return float(m.group(1))

    # Fallback: text like "4.5"
    txt = first_text(soup.select_one("[data-testid*='rating'], .rating, .stars"))
    if txt:
        m = re.search(r"([0-9.]+)", txt)
        if m:
            return float(m.group(1))

    return None


def extract_section(card: BeautifulSoup, label: str) -> str | None:
    # Find heading containing label, then grab adjacent text.
    # Works across many card layouts.
    heading = None
    for h in card.select("h1,h2,h3,h4,span,div"):
        t = h.get_text(" ", strip=True)
        if t and t.strip().lower() == label.lower():
            heading = h
            break

    if not heading:
        return None

    # Next sibling text block
    nxt = heading.find_next()
    if nxt:
        # If the next node is the heading itself, move one more
        if nxt == heading:
            nxt = heading.find_next_sibling()

    return first_text(nxt)


def parse_reviews_html(html: str) -> tuple[list[dict], str | None]:
    soup = BeautifulSoup(html, "lxml")

    # Review cards often have a repeating container; we try a few candidates.
    cards = soup.select("[data-testid*='review']")
    if not cards:
        cards = soup.select("article")

    reviews = []
    for card in cards[:50]:
        title = first_text(card.select_one("h3, h4"))
        body = first_text(card.select_one("p"))

        pros = extract_section(card, "Pros")
        cons = extract_section(card, "Cons")

        rating = parse_rating(card)

        date = None
        time_el = card.select_one("time")
        if time_el and time_el.get("datetime"):
            date = time_el.get("datetime")
        else:
            date = first_text(card.select_one("time"))

        reviews.append({
            "title": title,
            "body": body,
            "pros": pros,
            "cons": cons,
            "rating": rating,
            "date": date,
        })

    # Discover next page link
    next_url = None
    next_a = soup.select_one("a[rel='next'], a[aria-label*='Next']")
    if next_a and next_a.get("href"):
        next_url = abs_url(next_a.get("href"))

    return reviews, next_url

This isn’t pretty, but it’s resilient: it looks for meaning (Pros/Cons, ratings) rather than brittle CSS class names.


Step 4: Pagination (crawl multiple review pages)

Now we’ll crawl pages until:

  • we hit max_pages
  • there’s no next_url
  • the page repeats (loop protection)
from typing import Iterable


def crawl_reviews(start_url: str, max_pages: int = 10, sleep_s: float = 1.5) -> list[dict]:
    url = start_url
    out: list[dict] = []
    seen_urls = set()

    for page in range(1, max_pages + 1):
        if url in seen_urls:
            break
        seen_urls.add(url)

        html = fetch(url)
        reviews, next_url = parse_reviews_html(html)

        for r in reviews:
            r["source_url"] = url
            r["page"] = page
            out.append(r)

        print(f"page {page}: +{len(reviews)} reviews (total {len(out)})")

        if not next_url:
            break

        url = next_url
        time.sleep(sleep_s)

    return out


if __name__ == "__main__":
    product_reviews_url = "https://www.g2.com/products/slack/reviews"
    data = crawl_reviews(product_reviews_url, max_pages=5)
    print("total:", len(data))

Step 5: Export clean JSONL (best for pipelines)

JSONL is perfect for large datasets and streaming to data warehouses.

import json


def to_jsonl(path: str, rows: list[dict]) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")


# Example
# rows = crawl_reviews("https://www.g2.com/products/slack/reviews", max_pages=5)
# to_jsonl("g2_reviews.jsonl", rows)

Practical notes (what will break first)

  1. G2 can change markup frequently. Keep the parser modular and add more selector fallbacks.
  2. Reviews are not always fully in HTML. Sometimes only partial text is rendered.
  3. Rate limits happen. Keep a delay, retry on 403/429/503, and rotate IPs.
  4. Respect ToS and robots policies. Only scrape what you’re allowed to.

Where ProxiesAPI fits (honestly)

When you:

  • crawl many product slugs
  • paginate deep into review history
  • run daily refreshes

…your request volume becomes the problem.

ProxiesAPI helps you:

  • rotate exit IPs (reduce repetitive request patterns)
  • standardize proxy configuration across environments
  • keep retries from turning into total failure

It doesn’t replace good engineering: timeouts, backoff, caching, and structured exports.


QA checklist

  • Page 1 extracts non-empty titles/bodies
  • Pros/Cons appear for at least some reviews
  • Pagination advances and doesn’t loop
  • Exported JSONL loads cleanly
  • You can rerun without hanging (timeouts + retries)
Make review scraping stable with ProxiesAPI

G2 is a high-signal dataset, but it’s also a high-friction target. ProxiesAPI helps you keep pagination, retries, and IP rotation consistent as your URL list grows.

Related guides

Scrape Product Comparisons from CNET (Python + ProxiesAPI)
Collect CNET comparison tables and spec blocks, normalize the data into a clean dataset, and keep the crawl stable with retries + ProxiesAPI. Includes screenshot workflow.
tutorial#python#cnet#web-scraping
Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)
Extract Glassdoor company reviews and salary ranges more reliably: discover URLs, handle pagination, keep sessions consistent, rotate proxies when blocked, and export clean JSON.
tutorial#python#glassdoor#web-scraping
Scrape Restaurant Data from TripAdvisor (Reviews, Ratings, and Locations)
Build a practical TripAdvisor scraper in Python: discover restaurant listing URLs, extract name/rating/review count/address, and export clean CSV/JSON with ProxiesAPI in the fetch layer.
tutorial#python#web-scraping#beautifulsoup
How to Scrape Walmart Grocery Prices with Python (Search + Product Pages)
Build a practical Walmart grocery price scraper: search for items, follow product links, extract price/size/availability, and export clean JSON. Includes ProxiesAPI integration, retries, and selector fallbacks.
tutorial#python#walmart#price-scraping