Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)

If you’re building web scrapers in Node.js, Cheerio is the workhorse:

  • fast, lightweight HTML parsing
  • jQuery-style selectors ($('a.some-class'))
  • perfect for server-rendered pages and “fetch → parse → export” pipelines

This guide is a practical quick start for the keyword “Node.js web scraping with Cheerio”.

We’ll build a reusable scraper that:

  1. fetches HTML with timeouts + retries
  2. parses a real page structure with Cheerio
  3. paginates across list pages
  4. exports JSON
  5. optionally routes requests through ProxiesAPI when you need more stability
Make your Node.js scrapers more reliable with ProxiesAPI

Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.


When Cheerio is a good fit (and when it isn’t)

Cheerio works great when the page content exists in the raw HTML.

Use Cheerio for:

  • blogs, docs sites, directories
  • e-commerce category pages (sometimes)
  • news sites
  • community sites with server-rendered HTML

Don’t use Cheerio (alone) when:

  • the content is loaded via XHR after page load
  • the HTML is mostly a JS app shell

In those cases you either:

  • hit the site’s JSON endpoints directly (often best)
  • or use a headless browser (Playwright/Puppeteer)

Setup

mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
npm i cheerio
npm i undici

We’ll use:

  • undici for HTTP (fast, built-in fetch in modern Node is also fine)
  • cheerio for parsing

Step 1: A robust fetch() wrapper (timeouts + retries)

Scrapers fail at the network layer first.

A good fetch wrapper should:

  • set a timeout
  • retry with backoff
  • optionally detect “blocked” pages
import { fetch } from 'undici';

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

function looksBlocked(html) {
  const h = (html || '').toLowerCase();
  return [
    'captcha',
    'access denied',
    'unusual traffic',
    'verify you are',
  ].some((s) => h.includes(s));
}

export async function fetchHtml(url, { timeoutMs = 30000, retries = 4 } = {}) {
  let lastErr;

  for (let attempt = 1; attempt <= retries + 1; attempt++) {
    try {
      const controller = new AbortController();
      const t = setTimeout(() => controller.abort(), timeoutMs);

      const res = await fetch(url, {
        signal: controller.signal,
        headers: {
          'user-agent': 'Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
          'accept-language': 'en-US,en;q=0.9',
        },
      });

      clearTimeout(t);

      if (!res.ok) {
        throw new Error(`HTTP ${res.status} for ${url}`);
      }

      const html = await res.text();
      if (looksBlocked(html)) {
        throw new Error('Blocked page detected (captcha/bot page)');
      }

      return html;
    } catch (e) {
      lastErr = e;
      const backoff = Math.min(2000 * 2 ** (attempt - 1), 15000);
      const jitter = Math.floor(Math.random() * 500);
      await sleep(backoff + jitter);
    }
  }

  throw lastErr;
}

Step 2: Parse with Cheerio (real selectors)

Cheerio gives you a $ function like jQuery.

Example: parse a list page containing “cards” with:

  • title link
  • short description
import * as cheerio from 'cheerio';

export function parseCards(html, { baseUrl }) {
  const $ = cheerio.load(html);

  const items = [];

  $('.card').each((_, el) => {
    const title = $(el).find('a.card__title').text().trim();
    const href = $(el).find('a.card__title').attr('href');
    const url = href ? new URL(href, baseUrl).toString() : null;

    const desc = $(el).find('.card__desc').text().trim();

    if (!url) return;

    items.push({ title: title || null, desc: desc || null, url });
  });

  return items;
}

Tip: Always view the site HTML (right-click → “View page source”) before you assume the content is there.


Step 3: Pagination (the pattern you’ll reuse everywhere)

Most list pages use either:

  • ?page=2
  • /page/2/
  • cursor-based pagination in JSON endpoints

Here’s a reusable “query param page” pagination:

export function setPage(url, page) {
  const u = new URL(url);
  u.searchParams.set('page', String(page));
  return u.toString();
}

And the crawler:

import { fetchHtml } from './fetch.js';
import { parseCards } from './parse.js';
import { setPage } from './paginate.js';

export async function crawlList(startUrl, { pages = 5 } = {}) {
  const all = [];
  const seen = new Set();

  for (let p = 1; p <= pages; p++) {
    const url = p === 1 ? startUrl : setPage(startUrl, p);

    const html = await fetchHtml(url);
    const batch = parseCards(html, { baseUrl: startUrl });

    for (const it of batch) {
      if (seen.has(it.url)) continue;
      seen.add(it.url);
      all.push(it);
    }

    console.log(`page ${p}/${pages} -> ${batch.length} items (total ${all.length})`);
  }

  return all;
}

Add ProxiesAPI (proxy-backed fetch URL)

When you scale up (more pages, more keywords, more targets), you’ll start to see:

  • 429 / rate limits
  • random timeouts
  • occasional bot pages

That’s where ProxiesAPI can help at the network layer.

A common pattern is to transform:

  • target URL: https://example.com/list?page=1

into a ProxiesAPI fetch URL:

  • https://api.proxiesapi.com/?auth_key=YOUR_KEY&url=https%3A%2F%2Fexample.com%2Flist%3Fpage%3D1

Here’s a helper:

export function proxiesApiUrl(targetUrl) {
  const key = process.env.PROXIESAPI_KEY;
  if (!key) throw new Error('Missing PROXIESAPI_KEY');

  const u = new URL('https://api.proxiesapi.com/');
  u.searchParams.set('auth_key', key);
  u.searchParams.set('url', targetUrl);
  return u.toString();
}

Then in fetchHtml, call fetch(proxiesApiUrl(url)) instead of fetch(url).

A clean implementation (toggle)

import { fetch } from 'undici';
import { proxiesApiUrl } from './proxiesapi.js';

export async function fetchHtml(url, { useProxiesApi = false, timeoutMs = 30000 } = {}) {
  const finalUrl = useProxiesApi ? proxiesApiUrl(url) : url;
  // ... same fetch/retry logic as before
}

Comparison: Cheerio vs alternatives

A quick practical comparison:

ToolBest forNot great for
CheerioHTML parsing, fast extraction, server-rendered pagesJS-rendered SPAs
PlaywrightJS-heavy sites, user flows, complex DOMheavier + slower
Puppeteersimilar to Playwrightsimilar tradeoffs
JSDOMDOM APIs, small scriptsslower than Cheerio

Cheerio is often the right default for SEO/HTML pages.


Practical advice (what keeps scrapers alive)

  • Cache HTML for debugging (save page-3.html when parse fails)
  • Use a seen set to avoid duplicates
  • Treat “blocked” as a first-class error and retry/back off
  • Keep parsing logic separate from crawl logic (easier to fix selectors)

Where ProxiesAPI fits (honestly)

ProxiesAPI doesn’t replace good engineering. It helps with:

  • proxying requests when you can’t rely on a single IP
  • reducing failure rate on long paginations

You still need:

  • correct selectors
  • respectful request pacing
  • error handling and logging

QA checklist

  • Your scraper fetches HTML with a timeout
  • Parsing returns expected title/url/desc
  • Pagination increases unique item count
  • JSON export is valid
  • You can toggle ProxiesAPI on/off
Make your Node.js scrapers more reliable with ProxiesAPI

Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.

Related guides

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
An end-to-end Node.js scraping workflow: fetch pages with retries, parse HTML, handle pagination, rotate proxies with ProxiesAPI, and export clean JSON.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)
A practical Node.js scraping stack for 2026: HTTP-first with Cheerio, then Playwright for JS-rendered sites — plus proxy rotation, retries, and a clean project template.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)
Learn a modern Node.js web scraping stack: fetch + Cheerio for fast HTML parsing, a Playwright fallback for JS-heavy sites, and a production-ready layer for retries, rate limits, and ProxiesAPI proxy rotation.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.
guide#javascript#nodejs#web-scraping