Node.js Web Scraping with Cheerio: Quick Start Guide

Apr 15, 2026 · guide · #nodejs, #cheerio, #web-scraping, #javascript, #http, #proxies

If you’re scraping in Node.js, Cheerio is the fastest way to parse server-rendered HTML.

It gives you a jQuery-like API ($('selector')) without the overhead of launching a browser.

This quick start guide shows how to go from “I can parse one page” to “I can run a real crawl”:

fetch a page with timeouts + retries
load HTML into Cheerio
extract fields with real selectors
paginate safely
export JSONL (stream-friendly)
plug in ProxiesAPI so your requests don’t fall over at scale

Target keyword: node.js web scraping with cheerio.

Make your Node.js scraper resilient with ProxiesAPI

Cheerio is fast—but production scrapers fail in the network layer first. ProxiesAPI helps keep your requests stable as you add pagination, concurrency, and long-running crawls.

Get 1,000 free API calls View pricing

When Cheerio is the right tool (and when it isn’t)

Use Cheerio when:

the page is mostly server-rendered HTML
the data you want is in the initial response
you want speed and low cost

Avoid Cheerio (or combine it with a browser) when:

content loads only after JS runs
you need to click, scroll, or solve an interactive flow

A common hybrid architecture is:

Cheerio for 80% of pages (fast)
browser automation for the hard 20%

Project setup

mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
npm i cheerio undici p-retry p-limit dotenv

Create .env:

PROXIESAPI_KEY=your_api_key_here

We’ll use:

undici: Node’s modern HTTP client
cheerio: HTML parsing
p-retry: retries with backoff
p-limit: concurrency limits

ProxiesAPI request helper (with retries)

Scrapers fail in the network layer first.

This helper:

routes requests through ProxiesAPI
uses a realistic timeout
retries transient HTTP failures

import 'dotenv/config';
import { request } from 'undici';
import pRetry from 'p-retry';

function proxiesApiUrl(targetUrl) {
  const key = process.env.PROXIESAPI_KEY;
  if (!key) throw new Error('Missing PROXIESAPI_KEY');
  return `https://api.proxiesapi.com/?auth_key=${key}&url=${encodeURIComponent(targetUrl)}`;
}

async function fetchHtml(url) {
  return pRetry(async () => {
    const gateway = proxiesApiUrl(url);

    const { statusCode, body } = await request(gateway, {
      method: 'GET',
      headers: {
        'user-agent':
          'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36',
        accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'accept-language': 'en-US,en;q=0.9'
      },
      bodyTimeout: 40_000,
      headersTimeout: 10_000
    });

    if ([403, 408, 429, 500, 502, 503, 504].includes(statusCode)) {
      throw new Error(`Transient HTTP ${statusCode}`);
    }

    if (statusCode < 200 || statusCode >= 300) {
      throw new Error(`HTTP ${statusCode}`);
    }

    return await body.text();
  }, {
    retries: 5,
    minTimeout: 800,
    maxTimeout: 10_000
  });
}

export { fetchHtml };

Parse HTML with Cheerio (real selectors)

Cheerio uses CSS selectors.

A good pattern is:

fetch HTML
cheerio.load(html)
extract a list of items with a stable container selector
extract fields relative to each item

Here’s a toy example for a blog-like page:

import * as cheerio from 'cheerio';

function parseCards(html) {
  const $ = cheerio.load(html);

  const cards = [];
  $('.card, article, .post, li').each((_, el) => {
    const title = $(el).find('h1,h2,h3').first().text().trim() || null;
    const link = $(el).find('a[href]').first().attr('href') || null;

    if (!title || !link) return;
    cards.push({ title, link });
  });

  return cards;
}

export { parseCards };

Selector sanity check

Before you code, run:

node -e "console.log('ok')"

Then inspect HTML in DevTools:

right click → Inspect
find a stable parent container
prefer semantic tags (article, h2) over hashed classnames

Pagination patterns (what you’ll see in the wild)

Most pagination falls into one of these:

?page=2
/page/2/
?offset=20

You don’t need a fancy crawler to handle this.

Start with:

a function that builds the next URL
a max pages cap
a seen set to avoid loops

function nextPageUrl(baseUrl, page) {
  const u = new URL(baseUrl);
  u.searchParams.set('page', String(page));
  return u.toString();
}

export { nextPageUrl };

A complete quick start scraper (JSONL export)

This script:

crawls N pages
parses cards from each page
de-dupes by URL
writes JSONL so you can stream results

import fs from 'node:fs';
import * as cheerio from 'cheerio';
import { fetchHtml } from './fetch.js';

function parseCards(html, baseUrl) {
  const $ = cheerio.load(html);
  const out = [];

  $('a[href]').each((_, a) => {
    const href = $(a).attr('href');
    const text = $(a).text().trim();
    if (!href || !text) return;

    // avoid nav/footer junk
    if (text.length < 8) return;

    let abs;
    try {
      abs = new URL(href, baseUrl).toString();
    } catch {
      return;
    }

    out.push({ title: text, url: abs });
  });

  return out;
}

function nextPageUrl(baseUrl, page) {
  const u = new URL(baseUrl);
  u.searchParams.set('page', String(page));
  return u.toString();
}

async function run({ startUrl, pages = 3 }) {
  const seen = new Set();
  const out = fs.createWriteStream('results.jsonl', { flags: 'w' });

  for (let p = 1; p <= pages; p++) {
    const url = p === 1 ? startUrl : nextPageUrl(startUrl, p);
    const html = await fetchHtml(url);

    const items = parseCards(html, startUrl);
    console.log('page', p, 'items', items.length);

    for (const it of items) {
      if (seen.has(it.url)) continue;
      seen.add(it.url);
      out.write(JSON.stringify(it) + '\n');
    }
  }

  out.end();
  console.log('unique items', seen.size);
}

await run({
  startUrl: 'https://example.com/blog',
  pages: 5
});

Make it production-ish

Add p-limit to cap concurrency when fetching detail pages
Persist crawl state (SQLite)
Record HTTP status + error strings for debugging

Comparison: Cheerio vs Playwright

Feature	Cheerio	Playwright
Cost per page	Low	Higher
Handles JS-rendered sites	No	Yes
Speed	Very fast	Slower
Best for	HTML pages, feeds	Interactive flows

The winning approach for most products is: Cheerio first, browser only when needed.

Common scraping mistakes in Node.js

No timeouts → your job hangs.
No retries → transient 429/503 kills your run.
No dedupe → you store the same item 10 times.
Selectors too brittle → one redesign breaks everything.

Your solution is boring engineering:

timeouts
retries
backoff
dedupe
debug snapshots

Where ProxiesAPI fits (honestly)

Cheerio is parsing.

But reliability comes from your fetch layer:

if you’re crawling page lists, ProxiesAPI can reduce random 403/429 spikes
if you’re scraping across multiple domains, you get a consistent interface
if you’re pulling thousands of pages, stability matters more than micro-optimizations

Start with the helper above, keep concurrency modest, and you’ll have a scraper you can actually run nightly.

Make your Node.js scraper resilient with ProxiesAPI

Cheerio is fast—but production scrapers fail in the network layer first. ProxiesAPI helps keep your requests stable as you add pagination, concurrency, and long-running crawls.

Get 1,000 free API calls View pricing

A practical Node.js scraping stack for 2026: HTTP-first with Cheerio, then Playwright for JS-rendered sites — plus proxy rotation, retries, and a clean project template.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)

Learn a modern Node.js web scraping stack: fetch + Cheerio for fast HTML parsing, a Playwright fallback for JS-heavy sites, and a production-ready layer for retries, rate limits, and ProxiesAPI proxy rotation.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

An end-to-end Node.js scraping workflow: fetch pages with retries, parse HTML, handle pagination, rotate proxies with ProxiesAPI, and export clean JSON.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.

guide#javascript#nodejs#web-scraping

Node.js Web Scraping with Cheerio: Quick Start Guide

Related guides