Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)

Apr 13, 2026 · guide · #nodejs, #javascript, #cheerio, #web-scraping, #pagination, #proxies, #requests

If you’re building web scrapers in Node.js, Cheerio is the workhorse:

fast, lightweight HTML parsing
jQuery-style selectors ($('a.some-class'))
perfect for server-rendered pages and “fetch → parse → export” pipelines

This guide is a practical quick start for the keyword “Node.js web scraping with Cheerio”.

We’ll build a reusable scraper that:

fetches HTML with timeouts + retries
parses a real page structure with Cheerio
paginates across list pages
exports JSON
optionally routes requests through ProxiesAPI when you need more stability

Make your Node.js scrapers more reliable with ProxiesAPI

Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.

Get 1,000 free API calls View pricing

When Cheerio is a good fit (and when it isn’t)

Cheerio works great when the page content exists in the raw HTML.

Use Cheerio for:

blogs, docs sites, directories
e-commerce category pages (sometimes)
news sites
community sites with server-rendered HTML

Don’t use Cheerio (alone) when:

the content is loaded via XHR after page load
the HTML is mostly a JS app shell

In those cases you either:

hit the site’s JSON endpoints directly (often best)
or use a headless browser (Playwright/Puppeteer)

Setup

mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
npm i cheerio
npm i undici

We’ll use:

undici for HTTP (fast, built-in fetch in modern Node is also fine)
cheerio for parsing

Step 1: A robust fetch() wrapper (timeouts + retries)

Scrapers fail at the network layer first.

A good fetch wrapper should:

set a timeout
retry with backoff
optionally detect “blocked” pages

import { fetch } from 'undici';

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

function looksBlocked(html) {
  const h = (html || '').toLowerCase();
  return [
    'captcha',
    'access denied',
    'unusual traffic',
    'verify you are',
  ].some((s) => h.includes(s));
}

export async function fetchHtml(url, { timeoutMs = 30000, retries = 4 } = {}) {
  let lastErr;

  for (let attempt = 1; attempt <= retries + 1; attempt++) {
    try {
      const controller = new AbortController();
      const t = setTimeout(() => controller.abort(), timeoutMs);

      const res = await fetch(url, {
        signal: controller.signal,
        headers: {
          'user-agent': 'Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
          'accept-language': 'en-US,en;q=0.9',
        },
      });

      clearTimeout(t);

      if (!res.ok) {
        throw new Error(`HTTP ${res.status} for ${url}`);
      }

      const html = await res.text();
      if (looksBlocked(html)) {
        throw new Error('Blocked page detected (captcha/bot page)');
      }

      return html;
    } catch (e) {
      lastErr = e;
      const backoff = Math.min(2000 * 2 ** (attempt - 1), 15000);
      const jitter = Math.floor(Math.random() * 500);
      await sleep(backoff + jitter);
    }
  }

  throw lastErr;
}

Step 2: Parse with Cheerio (real selectors)

Cheerio gives you a $ function like jQuery.

Example: parse a list page containing “cards” with:

title link
short description

import * as cheerio from 'cheerio';

export function parseCards(html, { baseUrl }) {
  const $ = cheerio.load(html);

  const items = [];

  $('.card').each((_, el) => {
    const title = $(el).find('a.card__title').text().trim();
    const href = $(el).find('a.card__title').attr('href');
    const url = href ? new URL(href, baseUrl).toString() : null;

    const desc = $(el).find('.card__desc').text().trim();

    if (!url) return;

    items.push({ title: title || null, desc: desc || null, url });
  });

  return items;
}

Tip: Always view the site HTML (right-click → “View page source”) before you assume the content is there.

Step 3: Pagination (the pattern you’ll reuse everywhere)

Most list pages use either:

?page=2
/page/2/
cursor-based pagination in JSON endpoints

Here’s a reusable “query param page” pagination:

export function setPage(url, page) {
  const u = new URL(url);
  u.searchParams.set('page', String(page));
  return u.toString();
}

And the crawler:

import { fetchHtml } from './fetch.js';
import { parseCards } from './parse.js';
import { setPage } from './paginate.js';

export async function crawlList(startUrl, { pages = 5 } = {}) {
  const all = [];
  const seen = new Set();

  for (let p = 1; p <= pages; p++) {
    const url = p === 1 ? startUrl : setPage(startUrl, p);

    const html = await fetchHtml(url);
    const batch = parseCards(html, { baseUrl: startUrl });

    for (const it of batch) {
      if (seen.has(it.url)) continue;
      seen.add(it.url);
      all.push(it);
    }

    console.log(`page ${p}/${pages} -> ${batch.length} items (total ${all.length})`);
  }

  return all;
}

Add ProxiesAPI (proxy-backed fetch URL)

When you scale up (more pages, more keywords, more targets), you’ll start to see:

429 / rate limits
random timeouts
occasional bot pages

That’s where ProxiesAPI can help at the network layer.

A common pattern is to transform:

target URL: https://example.com/list?page=1

into a ProxiesAPI fetch URL:

https://api.proxiesapi.com/?auth_key=YOUR_KEY&url=https%3A%2F%2Fexample.com%2Flist%3Fpage%3D1

Here’s a helper:

export function proxiesApiUrl(targetUrl) {
  const key = process.env.PROXIESAPI_KEY;
  if (!key) throw new Error('Missing PROXIESAPI_KEY');

  const u = new URL('https://api.proxiesapi.com/');
  u.searchParams.set('auth_key', key);
  u.searchParams.set('url', targetUrl);
  return u.toString();
}

Then in fetchHtml, call fetch(proxiesApiUrl(url)) instead of fetch(url).

A clean implementation (toggle)

import { fetch } from 'undici';
import { proxiesApiUrl } from './proxiesapi.js';

export async function fetchHtml(url, { useProxiesApi = false, timeoutMs = 30000 } = {}) {
  const finalUrl = useProxiesApi ? proxiesApiUrl(url) : url;
  // ... same fetch/retry logic as before
}

Comparison: Cheerio vs alternatives

A quick practical comparison:

Tool	Best for	Not great for
Cheerio	HTML parsing, fast extraction, server-rendered pages	JS-rendered SPAs
Playwright	JS-heavy sites, user flows, complex DOM	heavier + slower
Puppeteer	similar to Playwright	similar tradeoffs
JSDOM	DOM APIs, small scripts	slower than Cheerio

Cheerio is often the right default for SEO/HTML pages.

Practical advice (what keeps scrapers alive)

Cache HTML for debugging (save page-3.html when parse fails)
Use a seen set to avoid duplicates
Treat “blocked” as a first-class error and retry/back off
Keep parsing logic separate from crawl logic (easier to fix selectors)

Where ProxiesAPI fits (honestly)

ProxiesAPI doesn’t replace good engineering. It helps with:

proxying requests when you can’t rely on a single IP
reducing failure rate on long paginations

You still need:

correct selectors
respectful request pacing
error handling and logging

QA checklist

Your scraper fetches HTML with a timeout
Parsing returns expected title/url/desc
Pagination increases unique item count
JSON export is valid
You can toggle ProxiesAPI on/off

Make your Node.js scrapers more reliable with ProxiesAPI

Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.

Get 1,000 free API calls View pricing

An end-to-end Node.js scraping workflow: fetch pages with retries, parse HTML, handle pagination, rotate proxies with ProxiesAPI, and export clean JSON.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)

A practical Node.js scraping stack for 2026: HTTP-first with Cheerio, then Playwright for JS-rendered sites — plus proxy rotation, retries, and a clean project template.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)

Learn a modern Node.js web scraping stack: fetch + Cheerio for fast HTML parsing, a Playwright fallback for JS-heavy sites, and a production-ready layer for retries, rate limits, and ProxiesAPI proxy rotation.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.

guide#javascript#nodejs#web-scraping

Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)

Related guides