Scrape Product Comparisons from CNET (Python + ProxiesAPI)

Apr 04, 2026 · tutorial · #python, #cnet, #web-scraping, #beautifulsoup, #requests, #proxies, #data-extraction

CNET comparison pages are a goldmine for structured data:

product names and variants
key specs (screen size, battery, CPU, etc.)
pros/cons summaries
editorial scoring (sometimes)

The great part: much of this content is already structured as tables or repeated blocks.

The hard part: doing it reliably across multiple comparison pages, categories, and updates.

In this guide we’ll build a scraper that:

Fetches a CNET comparison page
Extracts a normalized “products” list
Extracts comparison tables/spec blocks into a consistent schema
Handles pagination / discovery (optional)
Adds retries + timeouts + ProxiesAPI rotation
Exports JSON/CSV-ready output

We’ll also include a screenshot step so your guide (or internal runbooks) stay verifiable.

Make comparison-table scraping reliable with ProxiesAPI

Comparison pages look simple, but production crawls fail on blocks, throttling, and flaky responses. ProxiesAPI helps keep your fetch layer stable so your table parser can do its job.

Get 1,000 free API calls View pricing

What we’re scraping

CNET has multiple page templates. A “comparison” page typically contains:

a header describing the comparison
product tiles/cards
one or more spec tables (sometimes multiple sections)

Because templates change, we’ll write parsers that:

try a few common selectors
fall back to extracting HTML tables generically
always return a dataset you can inspect

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv pandas

Step 1: HTTP client with ProxiesAPI support

import os
import random
from dataclasses import dataclass

import requests
from dotenv import load_dotenv
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

load_dotenv()

TIMEOUT = (10, 30)

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


class HttpClient:
    def __init__(self, use_proxiesapi: bool = True):
        self.session = requests.Session()
        self.use_proxiesapi = use_proxiesapi
        self.session.headers.update(
            {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en-US,en;q=0.9",
                "Connection": "keep-alive",
            }
        )

    def _proxies(self):
        if not self.use_proxiesapi:
            return None
        key = os.getenv("PROXIESAPI_KEY")
        if not key:
            raise RuntimeError("Missing PROXIESAPI_KEY")
        proxy = f"http://{key}:@proxy.proxiesapi.com:10000"
        return {"http": proxy, "https": proxy}

    @retry(stop=stop_after_attempt(5), wait=wait_exponential_jitter(initial=1, max=20))
    def get(self, url: str) -> FetchResult:
        self.session.headers["User-Agent"] = random.choice(USER_AGENTS)
        r = self.session.get(url, timeout=TIMEOUT, proxies=self._proxies())
        return FetchResult(url=url, status_code=r.status_code, text=r.text)

Step 2: Parse comparison content (generic + resilient)

We’ll combine two approaches:

Generic HTML table parsing (works when CNET renders a <table>)
Key/value spec block parsing (works when specs are cards/divs)

import re
from bs4 import BeautifulSoup


def soupify(html: str) -> BeautifulSoup:
    return BeautifulSoup(html, "lxml")


def clean_text(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def html_table_to_rows(table) -> list[list[str]]:
    rows = []
    for tr in table.select("tr"):
        cells = [clean_text(td.get_text(" ", strip=True)) for td in tr.select("th, td")]
        if any(cells):
            rows.append(cells)
    return rows


def extract_tables(html: str) -> list[dict]:
    soup = soupify(html)
    out = []

    for i, table in enumerate(soup.select("table")):
        rows = html_table_to_rows(table)
        if len(rows) < 2:
            continue

        out.append({
            "index": i,
            "rows": rows,
        })

    return out


def extract_product_names(html: str) -> list[str]:
    soup = soupify(html)

    # Try a few generic headings within product cards.
    names = []
    for el in soup.select("h2, h3, a"):
        t = clean_text(el.get_text(" ", strip=True))
        if not t:
            continue
        # Heuristic: skip nav/footer noise
        if len(t) < 4 or len(t) > 80:
            continue
        # Comparison product names usually have brand/model patterns.
        if any(x in t.lower() for x in ["sign in", "newsletter", "privacy", "terms"]):
            continue
        names.append(t)

    # de-dupe while preserving order
    seen = set()
    uniq = []
    for n in names:
        if n in seen:
            continue
        seen.add(n)
        uniq.append(n)

    return uniq[:20]

This won’t perfectly isolate only the products (because templates vary), but it gives you a baseline dataset that you can refine with more specific selectors as you encounter them.

Step 3: Normalize into a dataset you can actually use

A practical schema is:

source_url
products[] (names + optional identifiers)
tables[] (each with rows)

Later, you can transform tables[].rows into a tidy DataFrame.

import json


def scrape_cnet_comparison(url: str, use_proxiesapi: bool = True) -> dict:
    client = HttpClient(use_proxiesapi=use_proxiesapi)
    res = client.get(url)

    data = {
        "source_url": url,
        "status_code": res.status_code,
        "products": extract_product_names(res.text),
        "tables": extract_tables(res.text),
    }

    return data


def main():
    url = "https://www.cnet.com/"  # replace with a real comparison URL

    data = scrape_cnet_comparison(url, use_proxiesapi=True)

    with open("cnet_comparison.json", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    print("status:", data["status_code"])
    print("products:", len(data["products"]))
    print("tables:", len(data["tables"]))


if __name__ == "__main__":
    main()

Turning table rows into a DataFrame (example)

If you have at least one table, you can do something like:

import pandas as pd

with open("cnet_comparison.json", "r", encoding="utf-8") as f:
    data = json.load(f)

if data["tables"]:
    rows = data["tables"][0]["rows"]
    header = rows[0]
    body = rows[1:]

    df = pd.DataFrame(body, columns=header)
    print(df.head())

    df.to_csv("cnet_table_0.csv", index=False)
    print("wrote cnet_table_0.csv")

Screenshot workflow (mandatory for A-track)

For tutorial posts, screenshots remove ambiguity. The simplest workflow is:

Open the URL in a browser
Take a screenshot of the comparison header + product cards
Save under:

public/images/posts/{slug}/cnet-comparison.jpg

In the next section we’ll do this with the project’s browser tooling and save it inside the repo.

Where ProxiesAPI fits (honestly)

CNET comparison pages aren’t the hardest targets on the web, but crawls break for boring reasons:

inconsistent responses (timeouts)
throttling during bursts
periodic 403/429s

ProxiesAPI doesn’t replace good engineering—it complements it by:

reducing the chance that one IP gets rate-limited mid-run
keeping your request success rate high during pagination
making scheduled crawls more stable

QA checklist

HTML fetch uses timeouts + retries
At least one table parsed into rows[]
products[] list looks plausible (spot-check)
Screenshot saved under /public/images/posts/{slug}/
Export JSON is valid and readable

Make comparison-table scraping reliable with ProxiesAPI

Comparison pages look simple, but production crawls fail on blocks, throttling, and flaky responses. ProxiesAPI helps keep your fetch layer stable so your table parser can do its job.

Get 1,000 free API calls View pricing

Extract Glassdoor company reviews and salary ranges more reliably: discover URLs, handle pagination, keep sessions consistent, rotate proxies when blocked, and export clean JSON.

tutorial#python#glassdoor#web-scraping

Scrape eBay Listings and Prices

Build an eBay scraper that captures titles, prices, item URLs, and pagination into CSV-ready output.

tutorial#python#ebay#web-scraping

Scrape Secondhand Fashion Listings from Vinted

Show how to collect Vinted search listings, prices, brands, and image URLs into a resale market dataset.

tutorial#python#vinted#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads book metadata, average rating, rating counts, review counts, and top review snippets with Python using JSON-LD plus __NEXT_DATA__ review objects.

tutorial#python#goodreads#books

Scrape Product Comparisons from CNET (Python + ProxiesAPI)

Related guides