How to Scrape G2 Software Reviews (Ratings, Pros/Cons) with Python + ProxiesAPI

Apr 12, 2026 · tutorial · #python, #g2, #reviews, #web-scraping, #beautifulsoup, #proxies, #data-extraction

G2 reviews are one of the most useful public datasets for building:

lead lists ("who uses what")
competitive intel (feature gaps, pricing complaints)
product research dashboards
AI summarization pipelines (topic clustering, sentiment)

In this guide we’ll build a real Python scraper that extracts software reviews from G2, including:

overall rating
review title + body
pros and cons sections
reviewer metadata (role, company size where present)
review date
pagination across review pages

We’ll also show where ProxiesAPI fits into the fetch layer so the scraper stays reliable when you scale.

G2 reviews page (we’ll scrape rating + pros/cons + review body)

Make review scraping stable with ProxiesAPI

G2 is a high-signal dataset, but it’s also a high-friction target. ProxiesAPI helps you keep pagination, retries, and IP rotation consistent as your URL list grows.

Get 1,000 free API calls View pricing

What we’re scraping (G2 URL patterns)

G2 product pages usually look like:

Product overview: https://www.g2.com/products/<slug>/reviews
With pagination: query params change over time, but G2 typically supports a page param in the URL or query.

Because G2’s front-end evolves, the most robust approach is:

Fetch the reviews page HTML.
Extract review cards from the DOM.
For pagination, discover the next page URL from the page itself.

That avoids hard-coding a fragile ?page= convention.

A quick sanity check (HTTP works)

curl -s "https://www.g2.com/products/slack/reviews" | head -n 20

If you see an HTML page with review content, you can proceed.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
tenacity for robust retries

Step 1: A fetch layer you can trust (with ProxiesAPI)

Scrapers fail in boring ways: timeouts, TLS hiccups, occasional 403/429, and inconsistent responses.

So we’ll start with a fetch function that has:

connect/read timeouts
retry with exponential backoff
a consistent User-Agent
optional ProxiesAPI proxy configuration

Configure ProxiesAPI

ProxiesAPI typically gives you a proxy endpoint + credentials.

Set these env vars:

export PROXIESAPI_PROXY_URL="http://USERNAME:PASSWORD@gateway.proxiesapi.com:PORT"

If you don’t have your proxy URL handy, you can still run the scraper without it.

Fetch code

import os
import random
import time
from urllib.parse import urljoin

import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

BASE = "https://www.g2.com"
TIMEOUT = (10, 40)  # connect, read

UA_POOL = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

session = requests.Session()


def build_proxies():
    proxy_url = os.getenv("PROXIESAPI_PROXY_URL")
    if not proxy_url:
        return None
    return {
        "http": proxy_url,
        "https": proxy_url,
    }


@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=1, max=20))
def fetch(url: str) -> str:
    headers = {
        "User-Agent": random.choice(UA_POOL),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
    }

    r = session.get(
        url,
        headers=headers,
        timeout=TIMEOUT,
        proxies=build_proxies(),
        allow_redirects=True,
    )

    # Handle occasional anti-bot / rate limiting gracefully.
    if r.status_code in (403, 429, 503):
        raise RuntimeError(f"Blocked or rate limited: {r.status_code}")

    r.raise_for_status()
    return r.text


def abs_url(href: str) -> str:
    return href if href.startswith("http") else urljoin(BASE, href)

Note the honest reality: ProxiesAPI doesn’t magically bypass all defenses. What it gives you is a more reliable network layer and IP rotation patterns that reduce failure rates when you’re crawling lots of pages.

Step 2: Inspect the HTML (don’t guess selectors)

G2 is a modern web app. You may see:

review cards in HTML
embedded JSON data in <script> tags

We’ll support both:

Try to parse structured JSON if present.
Fall back to extracting from review card HTML.

This “two-lane” approach keeps your scraper alive when front-end markup shifts.

Step 3: Parse reviews from a page

HTML extraction (fallback)

The exact class names on G2 can change. So we focus on stable cues:

text labels like "Pros" / "Cons"
star rating blocks (often with aria-label)
time tags for dates

Here’s a pragmatic parser that tries multiple selectors.

import re
from bs4 import BeautifulSoup


def clean_text(x: str) -> str:
    return re.sub(r"\s+", " ", (x or "").strip())


def first_text(el) -> str | None:
    if not el:
        return None
    return clean_text(el.get_text(" ", strip=True))


def parse_rating(soup: BeautifulSoup) -> float | None:
    # Common pattern: aria-label="4.5 out of 5"
    star = soup.select_one('[aria-label$="out of 5"]')
    if star:
        m = re.search(r"([0-9.]+)\s+out of\s+5", star.get("aria-label", ""))
        if m:
            return float(m.group(1))

    # Fallback: text like "4.5"
    txt = first_text(soup.select_one("[data-testid*='rating'], .rating, .stars"))
    if txt:
        m = re.search(r"([0-9.]+)", txt)
        if m:
            return float(m.group(1))

    return None


def extract_section(card: BeautifulSoup, label: str) -> str | None:
    # Find heading containing label, then grab adjacent text.
    # Works across many card layouts.
    heading = None
    for h in card.select("h1,h2,h3,h4,span,div"):
        t = h.get_text(" ", strip=True)
        if t and t.strip().lower() == label.lower():
            heading = h
            break

    if not heading:
        return None

    # Next sibling text block
    nxt = heading.find_next()
    if nxt:
        # If the next node is the heading itself, move one more
        if nxt == heading:
            nxt = heading.find_next_sibling()

    return first_text(nxt)


def parse_reviews_html(html: str) -> tuple[list[dict], str | None]:
    soup = BeautifulSoup(html, "lxml")

    # Review cards often have a repeating container; we try a few candidates.
    cards = soup.select("[data-testid*='review']")
    if not cards:
        cards = soup.select("article")

    reviews = []
    for card in cards[:50]:
        title = first_text(card.select_one("h3, h4"))
        body = first_text(card.select_one("p"))

        pros = extract_section(card, "Pros")
        cons = extract_section(card, "Cons")

        rating = parse_rating(card)

        date = None
        time_el = card.select_one("time")
        if time_el and time_el.get("datetime"):
            date = time_el.get("datetime")
        else:
            date = first_text(card.select_one("time"))

        reviews.append({
            "title": title,
            "body": body,
            "pros": pros,
            "cons": cons,
            "rating": rating,
            "date": date,
        })

    # Discover next page link
    next_url = None
    next_a = soup.select_one("a[rel='next'], a[aria-label*='Next']")
    if next_a and next_a.get("href"):
        next_url = abs_url(next_a.get("href"))

    return reviews, next_url

This isn’t pretty, but it’s resilient: it looks for meaning (Pros/Cons, ratings) rather than brittle CSS class names.

Step 4: Pagination (crawl multiple review pages)

Now we’ll crawl pages until:

we hit max_pages
there’s no next_url
the page repeats (loop protection)

from typing import Iterable


def crawl_reviews(start_url: str, max_pages: int = 10, sleep_s: float = 1.5) -> list[dict]:
    url = start_url
    out: list[dict] = []
    seen_urls = set()

    for page in range(1, max_pages + 1):
        if url in seen_urls:
            break
        seen_urls.add(url)

        html = fetch(url)
        reviews, next_url = parse_reviews_html(html)

        for r in reviews:
            r["source_url"] = url
            r["page"] = page
            out.append(r)

        print(f"page {page}: +{len(reviews)} reviews (total {len(out)})")

        if not next_url:
            break

        url = next_url
        time.sleep(sleep_s)

    return out


if __name__ == "__main__":
    product_reviews_url = "https://www.g2.com/products/slack/reviews"
    data = crawl_reviews(product_reviews_url, max_pages=5)
    print("total:", len(data))

Step 5: Export clean JSONL (best for pipelines)

JSONL is perfect for large datasets and streaming to data warehouses.

import json


def to_jsonl(path: str, rows: list[dict]) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")


# Example
# rows = crawl_reviews("https://www.g2.com/products/slack/reviews", max_pages=5)
# to_jsonl("g2_reviews.jsonl", rows)

Practical notes (what will break first)

G2 can change markup frequently. Keep the parser modular and add more selector fallbacks.
Reviews are not always fully in HTML. Sometimes only partial text is rendered.
Rate limits happen. Keep a delay, retry on 403/429/503, and rotate IPs.
Respect ToS and robots policies. Only scrape what you’re allowed to.

Where ProxiesAPI fits (honestly)

When you:

crawl many product slugs
paginate deep into review history
run daily refreshes

…your request volume becomes the problem.

ProxiesAPI helps you:

rotate exit IPs (reduce repetitive request patterns)
standardize proxy configuration across environments
keep retries from turning into total failure

It doesn’t replace good engineering: timeouts, backoff, caching, and structured exports.

QA checklist

Page 1 extracts non-empty titles/bodies
Pros/Cons appear for at least some reviews
Pagination advances and doesn’t loop
Exported JSONL loads cleanly
You can rerun without hanging (timeouts + retries)

Make review scraping stable with ProxiesAPI

G2 is a high-signal dataset, but it’s also a high-friction target. ProxiesAPI helps you keep pagination, retries, and IP rotation consistent as your URL list grows.

Get 1,000 free API calls View pricing

Extract Goodreads book metadata, average rating, rating counts, review counts, and top review snippets with Python using JSON-LD plus __NEXT_DATA__ review objects.

tutorial#python#goodreads#books

Scrape Restaurant Data from TripAdvisor

Show how to collect restaurant names, ratings, review counts, and location details from TripAdvisor into a clean dataset.

tutorial#python#web-scraping#beautifulsoup

Scrape Trustpilot Category Rankings (Top Companies + Ratings) with ProxiesAPI

Extract top companies in a Trustpilot category (name, website, rating, review count) across pages using stable DOM anchors, then export to CSV. Includes selector rationale and a proof screenshot.

tutorial#python#trustpilot#reviews

Scrape TripAdvisor Hotel Reviews with Python (Pagination + Rate Limits)

Extract TripAdvisor hotel review text, ratings, dates, and reviewer metadata with a resilient Python scraper (pagination, retries, and a proxy-backed fetch layer via ProxiesAPI).

tutorial#python#tripadvisor#reviews

How to Scrape G2 Software Reviews (Ratings, Pros/Cons) with Python + ProxiesAPI

Related guides