Web Scraping with TypeScript in 2026: Playwright + Cheerio End-to-End Guide

If you’re scraping in TypeScript, the winning combo in 2026 is:

  • Playwright for navigation + rendering (handles JS-heavy pages)
  • Cheerio for parsing HTML fast (jQuery-like selectors, no browser required)

This guide gives you an end-to-end blueprint you can reuse across sites:

  1. define a URL queue
  2. fetch pages (rendered or plain)
  3. parse with Cheerio selectors
  4. normalize records
  5. export JSON/CSV
  6. add guardrails (retries, backoff, dedupe)
When the crawl gets flaky, move stability into the fetch layer

Most scraping failures are network failures. Keep extraction logic clean, and make reliability a fetch-layer concern (retries, timeouts, and optional ProxiesAPI).


When to use Playwright vs Cheerio

Use Playwright when:

  • the page is JS-rendered (content missing from raw HTML)
  • pagination needs clicks / XHR
  • you need to scroll / wait for content

Use Cheerio-only when:

  • the site is server-rendered (fast!)
  • you’re processing thousands of pages where browser rendering is too slow

The workflow here uses both: Playwright fetches a fully rendered HTML snapshot, then Cheerio parses it.


Project setup

mkdir ts-scraper && cd ts-scraper
npm init -y
npm i playwright cheerio p-limit csv-stringify
npm i -D typescript tsx @types/node
npx playwright install

We’ll use:

  • playwright for rendering
  • cheerio for parsing
  • p-limit for concurrency limits
  • csv-stringify for CSV output

Step 1: Define your “fetch” layer (rendered HTML snapshot)

Keep a single function responsible for:

  • timeouts
  • retries
  • (optional) proxy/proxiesapi usage
import { chromium } from "playwright";

type FetchOptions = {
  timeoutMs?: number;
  maxRetries?: number;
};

export async function fetchRenderedHtml(url: string, opts: FetchOptions = {}): Promise<string> {
  const timeoutMs = opts.timeoutMs ?? 45_000;
  const maxRetries = opts.maxRetries ?? 3;

  let lastErr: unknown = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    const browser = await chromium.launch({ headless: true });
    const page = await browser.newPage({
      userAgent:
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
      viewport: { width: 1280, height: 720 },
    });

    try {
      await page.goto(url, { timeout: timeoutMs, waitUntil: "domcontentloaded" });
      // Minimal wait helps on lazy-loaded content without turning into “sleep 10s”.
      await page.waitForTimeout(500);

      const html = await page.content();
      if (!html || html.length < 2000) throw new Error("Suspiciously small HTML");
      return html;
    } catch (e) {
      lastErr = e;
      await page.close().catch(() => {});
      await browser.close().catch(() => {});
      const backoffMs = Math.min(8000, 2 ** (attempt - 1) * 600) + Math.random() * 250;
      await new Promise((r) => setTimeout(r, backoffMs));
    } finally {
      await browser.close().catch(() => {});
    }
  }

  throw new Error(`Failed to fetch after ${maxRetries} attempts: ${String(lastErr)}`);
}

Where proxies fit

If you need proxies, put them here (fetch-layer):

  • Playwright supports per-context proxies
  • or you can fetch through a proxy-backed service and still parse the returned HTML the same way

Keep parsing pure and boring.


Step 2: Parse with Cheerio (fast selectors)

import * as cheerio from "cheerio";

export type Item = {
  title: string;
  url: string;
  price?: string | null;
};

export function parseListing(html: string, baseUrl: string): Item[] {
  const $ = cheerio.load(html);

  const items: Item[] = [];

  // Replace selectors with the target site’s structure.
  $("a").each((_, el) => {
    const href = $(el).attr("href");
    const title = $(el).text().trim();
    if (!href || !title) return;

    const abs = href.startsWith("http") ? href : new URL(href, baseUrl).toString();
    items.push({ title, url: abs });
  });

  return items;
}

This is intentionally generic. Your real scraper should use site-specific selectors:

  • “card” containers
  • title link selector
  • price selector
  • pagination selector

Step 3: Queue design (dedupe + concurrency)

You want three guarantees:

  1. dedupe URLs
  2. limit concurrency (avoid bans)
  3. isolate failures (one bad URL doesn’t kill the run)
import pLimit from "p-limit";
import { fetchRenderedHtml } from "./fetch";
import { parseListing, Item } from "./parse";

const limit = pLimit(3); // start low

export async function crawl(urls: string[]): Promise<Item[]> {
  const seen = new Set<string>();
  const out: Item[] = [];

  const tasks = urls.map((url) =>
    limit(async () => {
      if (seen.has(url)) return;
      seen.add(url);

      const html = await fetchRenderedHtml(url, { maxRetries: 3 });
      const items = parseListing(html, url);
      out.push(...items);
    })
  );

  await Promise.allSettled(tasks);
  return out;
}

Step 4: Export JSON and CSV

import { stringify } from "csv-stringify/sync";
import { writeFileSync } from "node:fs";

export function exportData(rows: any[], slug: string) {
  writeFileSync(`${slug}.json`, JSON.stringify(rows, null, 2), "utf-8");

  const csv = stringify(rows, { header: true });
  writeFileSync(`${slug}.csv`, csv, "utf-8");
}

A practical “starter” main script

import { crawl } from "./crawl";
import { exportData } from "./export";

const URLS = [
  "https://example.com/listing-page-1",
  "https://example.com/listing-page-2",
];

const rows = await crawl(URLS);
console.log("rows:", rows.length);
exportData(rows, "ts_scrape_out");

Run it with:

npx tsx src/main.ts

Common failure modes (and fixes)

SymptomLikely causeFix
HTML too smallJS not loaded / blockedwait for selector, slower concurrency, fetch-layer stability
Random timeoutsflaky networkretries + backoff in fetch layer
Getting blockedtoo many requestslower concurrency, add delays, rotate proxies
Duplicate rowsURL variantsnormalize URLs + dedupe by canonical key

Where ProxiesAPI fits (honestly)

If your scraper is cleanly structured as:

fetch → parse → normalize → export

…then ProxiesAPI is just a fetch-layer swap for harder targets.

Don’t tie your parser to your proxy provider. Keep it boring, testable, and easy to evolve.

When the crawl gets flaky, move stability into the fetch layer

Most scraping failures are network failures. Keep extraction logic clean, and make reliability a fetch-layer concern (retries, timeouts, and optional ProxiesAPI).

Related guides

Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)
A practical Node.js scraping stack for 2026: HTTP-first with Cheerio, then Playwright for JS-rendered sites — plus proxy rotation, retries, and a clean project template.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)
Learn a modern Node.js web scraping stack: fetch + Cheerio for fast HTML parsing, a Playwright fallback for JS-heavy sites, and a production-ready layer for retries, rate limits, and ProxiesAPI proxy rotation.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial
A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.
guide#javascript#nodejs#web-scraping