Node.js Web Scraping with Cheerio: Quick Start Guide

If you’re scraping in Node.js, Cheerio is the fastest way to parse server-rendered HTML.

It gives you a jQuery-like API ($('selector')) without the overhead of launching a browser.

This quick start guide shows how to go from “I can parse one page” to “I can run a real crawl”:

  • fetch a page with timeouts + retries
  • load HTML into Cheerio
  • extract fields with real selectors
  • paginate safely
  • export JSONL (stream-friendly)
  • plug in ProxiesAPI so your requests don’t fall over at scale

Target keyword: node.js web scraping with cheerio.

Make your Node.js scraper resilient with ProxiesAPI

Cheerio is fast—but production scrapers fail in the network layer first. ProxiesAPI helps keep your requests stable as you add pagination, concurrency, and long-running crawls.


When Cheerio is the right tool (and when it isn’t)

Use Cheerio when:

  • the page is mostly server-rendered HTML
  • the data you want is in the initial response
  • you want speed and low cost

Avoid Cheerio (or combine it with a browser) when:

  • content loads only after JS runs
  • you need to click, scroll, or solve an interactive flow

A common hybrid architecture is:

  • Cheerio for 80% of pages (fast)
  • browser automation for the hard 20%

Project setup

mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
npm i cheerio undici p-retry p-limit dotenv

Create .env:

PROXIESAPI_KEY=your_api_key_here

We’ll use:

  • undici: Node’s modern HTTP client
  • cheerio: HTML parsing
  • p-retry: retries with backoff
  • p-limit: concurrency limits

ProxiesAPI request helper (with retries)

Scrapers fail in the network layer first.

This helper:

  • routes requests through ProxiesAPI
  • uses a realistic timeout
  • retries transient HTTP failures
import 'dotenv/config';
import { request } from 'undici';
import pRetry from 'p-retry';

function proxiesApiUrl(targetUrl) {
  const key = process.env.PROXIESAPI_KEY;
  if (!key) throw new Error('Missing PROXIESAPI_KEY');
  return `https://api.proxiesapi.com/?auth_key=${key}&url=${encodeURIComponent(targetUrl)}`;
}

async function fetchHtml(url) {
  return pRetry(async () => {
    const gateway = proxiesApiUrl(url);

    const { statusCode, body } = await request(gateway, {
      method: 'GET',
      headers: {
        'user-agent':
          'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36',
        accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'accept-language': 'en-US,en;q=0.9'
      },
      bodyTimeout: 40_000,
      headersTimeout: 10_000
    });

    if ([403, 408, 429, 500, 502, 503, 504].includes(statusCode)) {
      throw new Error(`Transient HTTP ${statusCode}`);
    }

    if (statusCode < 200 || statusCode >= 300) {
      throw new Error(`HTTP ${statusCode}`);
    }

    return await body.text();
  }, {
    retries: 5,
    minTimeout: 800,
    maxTimeout: 10_000
  });
}

export { fetchHtml };

Parse HTML with Cheerio (real selectors)

Cheerio uses CSS selectors.

A good pattern is:

  1. fetch HTML
  2. cheerio.load(html)
  3. extract a list of items with a stable container selector
  4. extract fields relative to each item

Here’s a toy example for a blog-like page:

import * as cheerio from 'cheerio';

function parseCards(html) {
  const $ = cheerio.load(html);

  const cards = [];
  $('.card, article, .post, li').each((_, el) => {
    const title = $(el).find('h1,h2,h3').first().text().trim() || null;
    const link = $(el).find('a[href]').first().attr('href') || null;

    if (!title || !link) return;
    cards.push({ title, link });
  });

  return cards;
}

export { parseCards };

Selector sanity check

Before you code, run:

node -e "console.log('ok')"

Then inspect HTML in DevTools:

  • right click → Inspect
  • find a stable parent container
  • prefer semantic tags (article, h2) over hashed classnames

Pagination patterns (what you’ll see in the wild)

Most pagination falls into one of these:

  1. ?page=2
  2. /page/2/
  3. ?offset=20

You don’t need a fancy crawler to handle this.

Start with:

  • a function that builds the next URL
  • a max pages cap
  • a seen set to avoid loops
function nextPageUrl(baseUrl, page) {
  const u = new URL(baseUrl);
  u.searchParams.set('page', String(page));
  return u.toString();
}

export { nextPageUrl };

A complete quick start scraper (JSONL export)

This script:

  • crawls N pages
  • parses cards from each page
  • de-dupes by URL
  • writes JSONL so you can stream results
import fs from 'node:fs';
import * as cheerio from 'cheerio';
import { fetchHtml } from './fetch.js';

function parseCards(html, baseUrl) {
  const $ = cheerio.load(html);
  const out = [];

  $('a[href]').each((_, a) => {
    const href = $(a).attr('href');
    const text = $(a).text().trim();
    if (!href || !text) return;

    // avoid nav/footer junk
    if (text.length < 8) return;

    let abs;
    try {
      abs = new URL(href, baseUrl).toString();
    } catch {
      return;
    }

    out.push({ title: text, url: abs });
  });

  return out;
}

function nextPageUrl(baseUrl, page) {
  const u = new URL(baseUrl);
  u.searchParams.set('page', String(page));
  return u.toString();
}

async function run({ startUrl, pages = 3 }) {
  const seen = new Set();
  const out = fs.createWriteStream('results.jsonl', { flags: 'w' });

  for (let p = 1; p <= pages; p++) {
    const url = p === 1 ? startUrl : nextPageUrl(startUrl, p);
    const html = await fetchHtml(url);

    const items = parseCards(html, startUrl);
    console.log('page', p, 'items', items.length);

    for (const it of items) {
      if (seen.has(it.url)) continue;
      seen.add(it.url);
      out.write(JSON.stringify(it) + '\n');
    }
  }

  out.end();
  console.log('unique items', seen.size);
}

await run({
  startUrl: 'https://example.com/blog',
  pages: 5
});

Make it production-ish

  • Add p-limit to cap concurrency when fetching detail pages
  • Persist crawl state (SQLite)
  • Record HTTP status + error strings for debugging

Comparison: Cheerio vs Playwright

FeatureCheerioPlaywright
Cost per pageLowHigher
Handles JS-rendered sitesNoYes
SpeedVery fastSlower
Best forHTML pages, feedsInteractive flows

The winning approach for most products is: Cheerio first, browser only when needed.


Common scraping mistakes in Node.js

  1. No timeouts → your job hangs.
  2. No retries → transient 429/503 kills your run.
  3. No dedupe → you store the same item 10 times.
  4. Selectors too brittle → one redesign breaks everything.

Your solution is boring engineering:

  • timeouts
  • retries
  • backoff
  • dedupe
  • debug snapshots

Where ProxiesAPI fits (honestly)

Cheerio is parsing.

But reliability comes from your fetch layer:

  • if you’re crawling page lists, ProxiesAPI can reduce random 403/429 spikes
  • if you’re scraping across multiple domains, you get a consistent interface
  • if you’re pulling thousands of pages, stability matters more than micro-optimizations

Start with the helper above, keep concurrency modest, and you’ll have a scraper you can actually run nightly.

Make your Node.js scraper resilient with ProxiesAPI

Cheerio is fast—but production scrapers fail in the network layer first. ProxiesAPI helps keep your requests stable as you add pagination, concurrency, and long-running crawls.

Related guides

Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)
Learn Cheerio by building a reusable Node.js scraper: robust fetch layer (timeouts, retries), parsing patterns, pagination, and where ProxiesAPI fits for stability.
guide#nodejs#javascript#cheerio
Screen Scraping vs API: When to Use What
A decision framework for choosing between scraping and APIs—by cost, reliability, time-to-data, and real failure modes (with practical mitigation patterns).
guide#web-scraping#api#data
Best SERP APIs Compared (2026): Pricing, Speed, Accuracy, and When to Use Each
A practical SERP API comparison for 2026: pricing models, geo/device support, parsing accuracy, anti-bot reliability, and how to choose based on volume and use case. Includes a decision framework and comparison tables.
guide#serp api#seo#web-scraping
Screen Scraping vs API (2026): When to Use Which (Cost, Reliability, Time-to-Data)
A practical decision framework for choosing screen scraping vs APIs: cost, reliability, time-to-data, maintenance burden, and common failure modes. Includes real examples and a comparison table.
guide#screen scraping vs api#web-scraping#automation