Scrape Product Comparisons from CNET (Python + ProxiesAPI)
CNET comparison pages are a goldmine for structured data:
- product names and variants
- key specs (screen size, battery, CPU, etc.)
- pros/cons summaries
- editorial scoring (sometimes)
The great part: much of this content is already structured as tables or repeated blocks.
The hard part: doing it reliably across multiple comparison pages, categories, and updates.
In this guide we’ll build a scraper that:
- Fetches a CNET comparison page
- Extracts a normalized “products” list
- Extracts comparison tables/spec blocks into a consistent schema
- Handles pagination / discovery (optional)
- Adds retries + timeouts + ProxiesAPI rotation
- Exports JSON/CSV-ready output
We’ll also include a screenshot step so your guide (or internal runbooks) stay verifiable.
Comparison pages look simple, but production crawls fail on blocks, throttling, and flaky responses. ProxiesAPI helps keep your fetch layer stable so your table parser can do its job.
What we’re scraping
CNET has multiple page templates. A “comparison” page typically contains:
- a header describing the comparison
- product tiles/cards
- one or more spec tables (sometimes multiple sections)
Because templates change, we’ll write parsers that:
- try a few common selectors
- fall back to extracting HTML tables generically
- always return a dataset you can inspect
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv pandas
Step 1: HTTP client with ProxiesAPI support
import os
import random
from dataclasses import dataclass
import requests
from dotenv import load_dotenv
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
load_dotenv()
TIMEOUT = (10, 30)
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]
@dataclass
class FetchResult:
url: str
status_code: int
text: str
class HttpClient:
def __init__(self, use_proxiesapi: bool = True):
self.session = requests.Session()
self.use_proxiesapi = use_proxiesapi
self.session.headers.update(
{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
}
)
def _proxies(self):
if not self.use_proxiesapi:
return None
key = os.getenv("PROXIESAPI_KEY")
if not key:
raise RuntimeError("Missing PROXIESAPI_KEY")
proxy = f"http://{key}:@proxy.proxiesapi.com:10000"
return {"http": proxy, "https": proxy}
@retry(stop=stop_after_attempt(5), wait=wait_exponential_jitter(initial=1, max=20))
def get(self, url: str) -> FetchResult:
self.session.headers["User-Agent"] = random.choice(USER_AGENTS)
r = self.session.get(url, timeout=TIMEOUT, proxies=self._proxies())
return FetchResult(url=url, status_code=r.status_code, text=r.text)
Step 2: Parse comparison content (generic + resilient)
We’ll combine two approaches:
- Generic HTML table parsing (works when CNET renders a
<table>) - Key/value spec block parsing (works when specs are cards/divs)
import re
from bs4 import BeautifulSoup
def soupify(html: str) -> BeautifulSoup:
return BeautifulSoup(html, "lxml")
def clean_text(s: str) -> str:
return re.sub(r"\s+", " ", (s or "").strip())
def html_table_to_rows(table) -> list[list[str]]:
rows = []
for tr in table.select("tr"):
cells = [clean_text(td.get_text(" ", strip=True)) for td in tr.select("th, td")]
if any(cells):
rows.append(cells)
return rows
def extract_tables(html: str) -> list[dict]:
soup = soupify(html)
out = []
for i, table in enumerate(soup.select("table")):
rows = html_table_to_rows(table)
if len(rows) < 2:
continue
out.append({
"index": i,
"rows": rows,
})
return out
def extract_product_names(html: str) -> list[str]:
soup = soupify(html)
# Try a few generic headings within product cards.
names = []
for el in soup.select("h2, h3, a"):
t = clean_text(el.get_text(" ", strip=True))
if not t:
continue
# Heuristic: skip nav/footer noise
if len(t) < 4 or len(t) > 80:
continue
# Comparison product names usually have brand/model patterns.
if any(x in t.lower() for x in ["sign in", "newsletter", "privacy", "terms"]):
continue
names.append(t)
# de-dupe while preserving order
seen = set()
uniq = []
for n in names:
if n in seen:
continue
seen.add(n)
uniq.append(n)
return uniq[:20]
This won’t perfectly isolate only the products (because templates vary), but it gives you a baseline dataset that you can refine with more specific selectors as you encounter them.
Step 3: Normalize into a dataset you can actually use
A practical schema is:
source_urlproducts[](names + optional identifiers)tables[](each with rows)
Later, you can transform tables[].rows into a tidy DataFrame.
import json
def scrape_cnet_comparison(url: str, use_proxiesapi: bool = True) -> dict:
client = HttpClient(use_proxiesapi=use_proxiesapi)
res = client.get(url)
data = {
"source_url": url,
"status_code": res.status_code,
"products": extract_product_names(res.text),
"tables": extract_tables(res.text),
}
return data
def main():
url = "https://www.cnet.com/" # replace with a real comparison URL
data = scrape_cnet_comparison(url, use_proxiesapi=True)
with open("cnet_comparison.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print("status:", data["status_code"])
print("products:", len(data["products"]))
print("tables:", len(data["tables"]))
if __name__ == "__main__":
main()
Turning table rows into a DataFrame (example)
If you have at least one table, you can do something like:
import pandas as pd
with open("cnet_comparison.json", "r", encoding="utf-8") as f:
data = json.load(f)
if data["tables"]:
rows = data["tables"][0]["rows"]
header = rows[0]
body = rows[1:]
df = pd.DataFrame(body, columns=header)
print(df.head())
df.to_csv("cnet_table_0.csv", index=False)
print("wrote cnet_table_0.csv")
Screenshot workflow (mandatory for A-track)
For tutorial posts, screenshots remove ambiguity. The simplest workflow is:
- Open the URL in a browser
- Take a screenshot of the comparison header + product cards
- Save under:
public/images/posts/{slug}/cnet-comparison.jpg
In the next section we’ll do this with the project’s browser tooling and save it inside the repo.
Where ProxiesAPI fits (honestly)
CNET comparison pages aren’t the hardest targets on the web, but crawls break for boring reasons:
- inconsistent responses (timeouts)
- throttling during bursts
- periodic 403/429s
ProxiesAPI doesn’t replace good engineering—it complements it by:
- reducing the chance that one IP gets rate-limited mid-run
- keeping your request success rate high during pagination
- making scheduled crawls more stable
QA checklist
- HTML fetch uses timeouts + retries
- At least one table parsed into
rows[] -
products[]list looks plausible (spot-check) - Screenshot saved under
/public/images/posts/{slug}/ - Export JSON is valid and readable
Comparison pages look simple, but production crawls fail on blocks, throttling, and flaky responses. ProxiesAPI helps keep your fetch layer stable so your table parser can do its job.