Scrape Product Comparisons from CNET (Python + ProxiesAPI)

CNET comparison pages are a goldmine for structured data:

  • product names and variants
  • key specs (screen size, battery, CPU, etc.)
  • pros/cons summaries
  • editorial scoring (sometimes)

The great part: much of this content is already structured as tables or repeated blocks.

The hard part: doing it reliably across multiple comparison pages, categories, and updates.

In this guide we’ll build a scraper that:

  1. Fetches a CNET comparison page
  2. Extracts a normalized “products” list
  3. Extracts comparison tables/spec blocks into a consistent schema
  4. Handles pagination / discovery (optional)
  5. Adds retries + timeouts + ProxiesAPI rotation
  6. Exports JSON/CSV-ready output

We’ll also include a screenshot step so your guide (or internal runbooks) stay verifiable.

Make comparison-table scraping reliable with ProxiesAPI

Comparison pages look simple, but production crawls fail on blocks, throttling, and flaky responses. ProxiesAPI helps keep your fetch layer stable so your table parser can do its job.


What we’re scraping

CNET has multiple page templates. A “comparison” page typically contains:

  • a header describing the comparison
  • product tiles/cards
  • one or more spec tables (sometimes multiple sections)

Because templates change, we’ll write parsers that:

  • try a few common selectors
  • fall back to extracting HTML tables generically
  • always return a dataset you can inspect

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv pandas

Step 1: HTTP client with ProxiesAPI support

import os
import random
from dataclasses import dataclass

import requests
from dotenv import load_dotenv
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

load_dotenv()

TIMEOUT = (10, 30)

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


class HttpClient:
    def __init__(self, use_proxiesapi: bool = True):
        self.session = requests.Session()
        self.use_proxiesapi = use_proxiesapi
        self.session.headers.update(
            {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en-US,en;q=0.9",
                "Connection": "keep-alive",
            }
        )

    def _proxies(self):
        if not self.use_proxiesapi:
            return None
        key = os.getenv("PROXIESAPI_KEY")
        if not key:
            raise RuntimeError("Missing PROXIESAPI_KEY")
        proxy = f"http://{key}:@proxy.proxiesapi.com:10000"
        return {"http": proxy, "https": proxy}

    @retry(stop=stop_after_attempt(5), wait=wait_exponential_jitter(initial=1, max=20))
    def get(self, url: str) -> FetchResult:
        self.session.headers["User-Agent"] = random.choice(USER_AGENTS)
        r = self.session.get(url, timeout=TIMEOUT, proxies=self._proxies())
        return FetchResult(url=url, status_code=r.status_code, text=r.text)

Step 2: Parse comparison content (generic + resilient)

We’ll combine two approaches:

  1. Generic HTML table parsing (works when CNET renders a <table>)
  2. Key/value spec block parsing (works when specs are cards/divs)
import re
from bs4 import BeautifulSoup


def soupify(html: str) -> BeautifulSoup:
    return BeautifulSoup(html, "lxml")


def clean_text(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def html_table_to_rows(table) -> list[list[str]]:
    rows = []
    for tr in table.select("tr"):
        cells = [clean_text(td.get_text(" ", strip=True)) for td in tr.select("th, td")]
        if any(cells):
            rows.append(cells)
    return rows


def extract_tables(html: str) -> list[dict]:
    soup = soupify(html)
    out = []

    for i, table in enumerate(soup.select("table")):
        rows = html_table_to_rows(table)
        if len(rows) < 2:
            continue

        out.append({
            "index": i,
            "rows": rows,
        })

    return out


def extract_product_names(html: str) -> list[str]:
    soup = soupify(html)

    # Try a few generic headings within product cards.
    names = []
    for el in soup.select("h2, h3, a"):
        t = clean_text(el.get_text(" ", strip=True))
        if not t:
            continue
        # Heuristic: skip nav/footer noise
        if len(t) < 4 or len(t) > 80:
            continue
        # Comparison product names usually have brand/model patterns.
        if any(x in t.lower() for x in ["sign in", "newsletter", "privacy", "terms"]):
            continue
        names.append(t)

    # de-dupe while preserving order
    seen = set()
    uniq = []
    for n in names:
        if n in seen:
            continue
        seen.add(n)
        uniq.append(n)

    return uniq[:20]

This won’t perfectly isolate only the products (because templates vary), but it gives you a baseline dataset that you can refine with more specific selectors as you encounter them.


Step 3: Normalize into a dataset you can actually use

A practical schema is:

  • source_url
  • products[] (names + optional identifiers)
  • tables[] (each with rows)

Later, you can transform tables[].rows into a tidy DataFrame.

import json


def scrape_cnet_comparison(url: str, use_proxiesapi: bool = True) -> dict:
    client = HttpClient(use_proxiesapi=use_proxiesapi)
    res = client.get(url)

    data = {
        "source_url": url,
        "status_code": res.status_code,
        "products": extract_product_names(res.text),
        "tables": extract_tables(res.text),
    }

    return data


def main():
    url = "https://www.cnet.com/"  # replace with a real comparison URL

    data = scrape_cnet_comparison(url, use_proxiesapi=True)

    with open("cnet_comparison.json", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    print("status:", data["status_code"])
    print("products:", len(data["products"]))
    print("tables:", len(data["tables"]))


if __name__ == "__main__":
    main()

Turning table rows into a DataFrame (example)

If you have at least one table, you can do something like:

import pandas as pd

with open("cnet_comparison.json", "r", encoding="utf-8") as f:
    data = json.load(f)

if data["tables"]:
    rows = data["tables"][0]["rows"]
    header = rows[0]
    body = rows[1:]

    df = pd.DataFrame(body, columns=header)
    print(df.head())

    df.to_csv("cnet_table_0.csv", index=False)
    print("wrote cnet_table_0.csv")

Screenshot workflow (mandatory for A-track)

For tutorial posts, screenshots remove ambiguity. The simplest workflow is:

  1. Open the URL in a browser
  2. Take a screenshot of the comparison header + product cards
  3. Save under:

public/images/posts/{slug}/cnet-comparison.jpg

In the next section we’ll do this with the project’s browser tooling and save it inside the repo.


Where ProxiesAPI fits (honestly)

CNET comparison pages aren’t the hardest targets on the web, but crawls break for boring reasons:

  • inconsistent responses (timeouts)
  • throttling during bursts
  • periodic 403/429s

ProxiesAPI doesn’t replace good engineering—it complements it by:

  • reducing the chance that one IP gets rate-limited mid-run
  • keeping your request success rate high during pagination
  • making scheduled crawls more stable

QA checklist

  • HTML fetch uses timeouts + retries
  • At least one table parsed into rows[]
  • products[] list looks plausible (spot-check)
  • Screenshot saved under /public/images/posts/{slug}/
  • Export JSON is valid and readable
Make comparison-table scraping reliable with ProxiesAPI

Comparison pages look simple, but production crawls fail on blocks, throttling, and flaky responses. ProxiesAPI helps keep your fetch layer stable so your table parser can do its job.

Related guides

Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)
Extract Glassdoor company reviews and salary ranges more reliably: discover URLs, handle pagination, keep sessions consistent, rotate proxies when blocked, and export clean JSON.
tutorial#python#glassdoor#web-scraping
How to Scrape Etsy Product Listings with Python (ProxiesAPI + Pagination)
Extract title, price, rating, and shop info from Etsy search pages reliably with rotating proxies, retries, and pagination.
tutorial#python#etsy#web-scraping
Scrape NBA Scores and Standings from ESPN with Python (Box Scores + Schedule)
Build a clean dataset of today’s NBA games and standings from ESPN pages using robust selectors and proxy-safe requests.
tutorial#python#nba#espn
Scrape Google Maps Business Listings with Python: Search → Place Details → Reviews (ProxiesAPI)
Extract local leads from Google Maps: search results → place details → reviews, with a resilient fetch pipeline and a screenshot-driven selector approach.
tutorial#python#google-maps#local-leads