Web Scraping with TypeScript in 2026: Playwright + Cheerio End-to-End Guide
If you’re scraping in TypeScript, the winning combo in 2026 is:
- Playwright for navigation + rendering (handles JS-heavy pages)
- Cheerio for parsing HTML fast (jQuery-like selectors, no browser required)
This guide gives you an end-to-end blueprint you can reuse across sites:
- define a URL queue
- fetch pages (rendered or plain)
- parse with Cheerio selectors
- normalize records
- export JSON/CSV
- add guardrails (retries, backoff, dedupe)
Most scraping failures are network failures. Keep extraction logic clean, and make reliability a fetch-layer concern (retries, timeouts, and optional ProxiesAPI).
When to use Playwright vs Cheerio
Use Playwright when:
- the page is JS-rendered (content missing from raw HTML)
- pagination needs clicks / XHR
- you need to scroll / wait for content
Use Cheerio-only when:
- the site is server-rendered (fast!)
- you’re processing thousands of pages where browser rendering is too slow
The workflow here uses both: Playwright fetches a fully rendered HTML snapshot, then Cheerio parses it.
Project setup
mkdir ts-scraper && cd ts-scraper
npm init -y
npm i playwright cheerio p-limit csv-stringify
npm i -D typescript tsx @types/node
npx playwright install
We’ll use:
playwrightfor renderingcheeriofor parsingp-limitfor concurrency limitscsv-stringifyfor CSV output
Step 1: Define your “fetch” layer (rendered HTML snapshot)
Keep a single function responsible for:
- timeouts
- retries
- (optional) proxy/proxiesapi usage
import { chromium } from "playwright";
type FetchOptions = {
timeoutMs?: number;
maxRetries?: number;
};
export async function fetchRenderedHtml(url: string, opts: FetchOptions = {}): Promise<string> {
const timeoutMs = opts.timeoutMs ?? 45_000;
const maxRetries = opts.maxRetries ?? 3;
let lastErr: unknown = null;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
userAgent:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
viewport: { width: 1280, height: 720 },
});
try {
await page.goto(url, { timeout: timeoutMs, waitUntil: "domcontentloaded" });
// Minimal wait helps on lazy-loaded content without turning into “sleep 10s”.
await page.waitForTimeout(500);
const html = await page.content();
if (!html || html.length < 2000) throw new Error("Suspiciously small HTML");
return html;
} catch (e) {
lastErr = e;
await page.close().catch(() => {});
await browser.close().catch(() => {});
const backoffMs = Math.min(8000, 2 ** (attempt - 1) * 600) + Math.random() * 250;
await new Promise((r) => setTimeout(r, backoffMs));
} finally {
await browser.close().catch(() => {});
}
}
throw new Error(`Failed to fetch after ${maxRetries} attempts: ${String(lastErr)}`);
}
Where proxies fit
If you need proxies, put them here (fetch-layer):
- Playwright supports per-context proxies
- or you can fetch through a proxy-backed service and still parse the returned HTML the same way
Keep parsing pure and boring.
Step 2: Parse with Cheerio (fast selectors)
import * as cheerio from "cheerio";
export type Item = {
title: string;
url: string;
price?: string | null;
};
export function parseListing(html: string, baseUrl: string): Item[] {
const $ = cheerio.load(html);
const items: Item[] = [];
// Replace selectors with the target site’s structure.
$("a").each((_, el) => {
const href = $(el).attr("href");
const title = $(el).text().trim();
if (!href || !title) return;
const abs = href.startsWith("http") ? href : new URL(href, baseUrl).toString();
items.push({ title, url: abs });
});
return items;
}
This is intentionally generic. Your real scraper should use site-specific selectors:
- “card” containers
- title link selector
- price selector
- pagination selector
Step 3: Queue design (dedupe + concurrency)
You want three guarantees:
- dedupe URLs
- limit concurrency (avoid bans)
- isolate failures (one bad URL doesn’t kill the run)
import pLimit from "p-limit";
import { fetchRenderedHtml } from "./fetch";
import { parseListing, Item } from "./parse";
const limit = pLimit(3); // start low
export async function crawl(urls: string[]): Promise<Item[]> {
const seen = new Set<string>();
const out: Item[] = [];
const tasks = urls.map((url) =>
limit(async () => {
if (seen.has(url)) return;
seen.add(url);
const html = await fetchRenderedHtml(url, { maxRetries: 3 });
const items = parseListing(html, url);
out.push(...items);
})
);
await Promise.allSettled(tasks);
return out;
}
Step 4: Export JSON and CSV
import { stringify } from "csv-stringify/sync";
import { writeFileSync } from "node:fs";
export function exportData(rows: any[], slug: string) {
writeFileSync(`${slug}.json`, JSON.stringify(rows, null, 2), "utf-8");
const csv = stringify(rows, { header: true });
writeFileSync(`${slug}.csv`, csv, "utf-8");
}
A practical “starter” main script
import { crawl } from "./crawl";
import { exportData } from "./export";
const URLS = [
"https://example.com/listing-page-1",
"https://example.com/listing-page-2",
];
const rows = await crawl(URLS);
console.log("rows:", rows.length);
exportData(rows, "ts_scrape_out");
Run it with:
npx tsx src/main.ts
Common failure modes (and fixes)
| Symptom | Likely cause | Fix |
|---|---|---|
| HTML too small | JS not loaded / blocked | wait for selector, slower concurrency, fetch-layer stability |
| Random timeouts | flaky network | retries + backoff in fetch layer |
| Getting blocked | too many requests | lower concurrency, add delays, rotate proxies |
| Duplicate rows | URL variants | normalize URLs + dedupe by canonical key |
Where ProxiesAPI fits (honestly)
If your scraper is cleanly structured as:
fetch → parse → normalize → export
…then ProxiesAPI is just a fetch-layer swap for harder targets.
Don’t tie your parser to your proxy provider. Keep it boring, testable, and easy to evolve.
Most scraping failures are network failures. Keep extraction logic clean, and make reliability a fetch-layer concern (retries, timeouts, and optional ProxiesAPI).