Web Scraping with TypeScript in 2026: Playwright + Cheerio End-to-End Guide

May 24, 2026 · guide · #typescript, #nodejs, #playwright, #cheerio, #web-scraping, #csv, #json

If you’re scraping in TypeScript, the winning combo in 2026 is:

Playwright for navigation + rendering (handles JS-heavy pages)
Cheerio for parsing HTML fast (jQuery-like selectors, no browser required)

This guide gives you an end-to-end blueprint you can reuse across sites:

define a URL queue
fetch pages (rendered or plain)
parse with Cheerio selectors
normalize records
export JSON/CSV
add guardrails (retries, backoff, dedupe)

When the crawl gets flaky, move stability into the fetch layer

Most scraping failures are network failures. Keep extraction logic clean, and make reliability a fetch-layer concern (retries, timeouts, and optional ProxiesAPI).

Get 1,000 free API calls View pricing

When to use Playwright vs Cheerio

Use Playwright when:

the page is JS-rendered (content missing from raw HTML)
pagination needs clicks / XHR
you need to scroll / wait for content

Use Cheerio-only when:

the site is server-rendered (fast!)
you’re processing thousands of pages where browser rendering is too slow

The workflow here uses both: Playwright fetches a fully rendered HTML snapshot, then Cheerio parses it.

Project setup

mkdir ts-scraper && cd ts-scraper
npm init -y
npm i playwright cheerio p-limit csv-stringify
npm i -D typescript tsx @types/node
npx playwright install

We’ll use:

playwright for rendering
cheerio for parsing
p-limit for concurrency limits
csv-stringify for CSV output

Step 1: Define your “fetch” layer (rendered HTML snapshot)

Keep a single function responsible for:

timeouts
retries
(optional) proxy/proxiesapi usage

import { chromium } from "playwright";

type FetchOptions = {
  timeoutMs?: number;
  maxRetries?: number;
};

export async function fetchRenderedHtml(url: string, opts: FetchOptions = {}): Promise<string> {
  const timeoutMs = opts.timeoutMs ?? 45_000;
  const maxRetries = opts.maxRetries ?? 3;

  let lastErr: unknown = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    const browser = await chromium.launch({ headless: true });
    const page = await browser.newPage({
      userAgent:
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123 Safari/537.36",
      viewport: { width: 1280, height: 720 },
    });

    try {
      await page.goto(url, { timeout: timeoutMs, waitUntil: "domcontentloaded" });
      // Minimal wait helps on lazy-loaded content without turning into “sleep 10s”.
      await page.waitForTimeout(500);

      const html = await page.content();
      if (!html || html.length < 2000) throw new Error("Suspiciously small HTML");
      return html;
    } catch (e) {
      lastErr = e;
      await page.close().catch(() => {});
      await browser.close().catch(() => {});
      const backoffMs = Math.min(8000, 2 ** (attempt - 1) * 600) + Math.random() * 250;
      await new Promise((r) => setTimeout(r, backoffMs));
    } finally {
      await browser.close().catch(() => {});
    }
  }

  throw new Error(`Failed to fetch after ${maxRetries} attempts: ${String(lastErr)}`);
}

Where proxies fit

If you need proxies, put them here (fetch-layer):

Playwright supports per-context proxies
or you can fetch through a proxy-backed service and still parse the returned HTML the same way

Keep parsing pure and boring.

Step 2: Parse with Cheerio (fast selectors)

import * as cheerio from "cheerio";

export type Item = {
  title: string;
  url: string;
  price?: string | null;
};

export function parseListing(html: string, baseUrl: string): Item[] {
  const $ = cheerio.load(html);

  const items: Item[] = [];

  // Replace selectors with the target site’s structure.
  $("a").each((_, el) => {
    const href = $(el).attr("href");
    const title = $(el).text().trim();
    if (!href || !title) return;

    const abs = href.startsWith("http") ? href : new URL(href, baseUrl).toString();
    items.push({ title, url: abs });
  });

  return items;
}

This is intentionally generic. Your real scraper should use site-specific selectors:

“card” containers
title link selector
price selector
pagination selector

Step 3: Queue design (dedupe + concurrency)

You want three guarantees:

dedupe URLs
limit concurrency (avoid bans)
isolate failures (one bad URL doesn’t kill the run)

import pLimit from "p-limit";
import { fetchRenderedHtml } from "./fetch";
import { parseListing, Item } from "./parse";

const limit = pLimit(3); // start low

export async function crawl(urls: string[]): Promise<Item[]> {
  const seen = new Set<string>();
  const out: Item[] = [];

  const tasks = urls.map((url) =>
    limit(async () => {
      if (seen.has(url)) return;
      seen.add(url);

      const html = await fetchRenderedHtml(url, { maxRetries: 3 });
      const items = parseListing(html, url);
      out.push(...items);
    })
  );

  await Promise.allSettled(tasks);
  return out;
}

Step 4: Export JSON and CSV

import { stringify } from "csv-stringify/sync";
import { writeFileSync } from "node:fs";

export function exportData(rows: any[], slug: string) {
  writeFileSync(`${slug}.json`, JSON.stringify(rows, null, 2), "utf-8");

  const csv = stringify(rows, { header: true });
  writeFileSync(`${slug}.csv`, csv, "utf-8");
}

A practical “starter” main script

import { crawl } from "./crawl";
import { exportData } from "./export";

const URLS = [
  "https://example.com/listing-page-1",
  "https://example.com/listing-page-2",
];

const rows = await crawl(URLS);
console.log("rows:", rows.length);
exportData(rows, "ts_scrape_out");

Run it with:

npx tsx src/main.ts

Common failure modes (and fixes)

Symptom	Likely cause	Fix
HTML too small	JS not loaded / blocked	wait for selector, slower concurrency, fetch-layer stability
Random timeouts	flaky network	retries + backoff in fetch layer
Getting blocked	too many requests	lower concurrency, add delays, rotate proxies
Duplicate rows	URL variants	normalize URLs + dedupe by canonical key

Where ProxiesAPI fits (honestly)

If your scraper is cleanly structured as:

fetch → parse → normalize → export

…then ProxiesAPI is just a fetch-layer swap for harder targets.

Don’t tie your parser to your proxy provider. Keep it boring, testable, and easy to evolve.

When the crawl gets flaky, move stability into the fetch layer

Most scraping failures are network failures. Keep extraction logic clean, and make reliability a fetch-layer concern (retries, timeouts, and optional ProxiesAPI).

Get 1,000 free API calls View pricing

A practical Node.js scraping stack for 2026: HTTP-first with Cheerio, then Playwright for JS-rendered sites — plus proxy rotation, retries, and a clean project template.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)

Learn a modern Node.js web scraping stack: fetch + Cheerio for fast HTML parsing, a Playwright fallback for JS-heavy sites, and a production-ready layer for retries, rate limits, and ProxiesAPI proxy rotation.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial

A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.

guide#javascript#nodejs#web-scraping

Web Scraping with TypeScript in 2026: Playwright + Cheerio End-to-End Guide

Related guides