Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)

If you’re building web scrapers in Node.js, Cheerio is the workhorse:

  • fast, lightweight HTML parsing
  • jQuery-style selectors ($('a.some-class'))
  • perfect for server-rendered pages and “fetch → parse → export” pipelines

This guide is a practical quick start for the keyword “Node.js web scraping with Cheerio”.

We’ll build a reusable scraper that:

  1. fetches HTML with timeouts + retries
  2. parses a real page structure with Cheerio
  3. paginates across list pages
  4. exports JSON
  5. optionally routes requests through ProxiesAPI when you need more stability
Make your Node.js scrapers more reliable with ProxiesAPI

Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.


When Cheerio is a good fit (and when it isn’t)

Cheerio works great when the page content exists in the raw HTML.

Use Cheerio for:

  • blogs, docs sites, directories
  • e-commerce category pages (sometimes)
  • news sites
  • community sites with server-rendered HTML

Don’t use Cheerio (alone) when:

  • the content is loaded via XHR after page load
  • the HTML is mostly a JS app shell

In those cases you either:

  • hit the site’s JSON endpoints directly (often best)
  • or use a headless browser (Playwright/Puppeteer)

Setup

mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
npm i cheerio
npm i undici

We’ll use:

  • undici for HTTP (fast, built-in fetch in modern Node is also fine)
  • cheerio for parsing

Step 1: A robust fetch() wrapper (timeouts + retries)

Scrapers fail at the network layer first.

A good fetch wrapper should:

  • set a timeout
  • retry with backoff
  • optionally detect “blocked” pages
import { fetch } from 'undici';

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

function looksBlocked(html) {
  const h = (html || '').toLowerCase();
  return [
    'captcha',
    'access denied',
    'unusual traffic',
    'verify you are',
  ].some((s) => h.includes(s));
}

export async function fetchHtml(url, { timeoutMs = 30000, retries = 4 } = {}) {
  let lastErr;

  for (let attempt = 1; attempt <= retries + 1; attempt++) {
    try {
      const controller = new AbortController();
      const t = setTimeout(() => controller.abort(), timeoutMs);

      const res = await fetch(url, {
        signal: controller.signal,
        headers: {
          'user-agent': 'Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
          'accept-language': 'en-US,en;q=0.9',
        },
      });

      clearTimeout(t);

      if (!res.ok) {
        throw new Error(`HTTP ${res.status} for ${url}`);
      }

      const html = await res.text();
      if (looksBlocked(html)) {
        throw new Error('Blocked page detected (captcha/bot page)');
      }

      return html;
    } catch (e) {
      lastErr = e;
      const backoff = Math.min(2000 * 2 ** (attempt - 1), 15000);
      const jitter = Math.floor(Math.random() * 500);
      await sleep(backoff + jitter);
    }
  }

  throw lastErr;
}

Step 2: Parse with Cheerio (real selectors)

Cheerio gives you a $ function like jQuery.

Example: parse a list page containing “cards” with:

  • title link
  • short description
import * as cheerio from 'cheerio';

export function parseCards(html, { baseUrl }) {
  const $ = cheerio.load(html);

  const items = [];

  $('.card').each((_, el) => {
    const title = $(el).find('a.card__title').text().trim();
    const href = $(el).find('a.card__title').attr('href');
    const url = href ? new URL(href, baseUrl).toString() : null;

    const desc = $(el).find('.card__desc').text().trim();

    if (!url) return;

    items.push({ title: title || null, desc: desc || null, url });
  });

  return items;
}

Tip: Always view the site HTML (right-click → “View page source”) before you assume the content is there.


Step 3: Pagination (the pattern you’ll reuse everywhere)

Most list pages use either:

  • ?page=2
  • /page/2/
  • cursor-based pagination in JSON endpoints

Here’s a reusable “query param page” pagination:

export function setPage(url, page) {
  const u = new URL(url);
  u.searchParams.set('page', String(page));
  return u.toString();
}

And the crawler:

import { fetchHtml } from './fetch.js';
import { parseCards } from './parse.js';
import { setPage } from './paginate.js';

export async function crawlList(startUrl, { pages = 5 } = {}) {
  const all = [];
  const seen = new Set();

  for (let p = 1; p <= pages; p++) {
    const url = p === 1 ? startUrl : setPage(startUrl, p);

    const html = await fetchHtml(url);
    const batch = parseCards(html, { baseUrl: startUrl });

    for (const it of batch) {
      if (seen.has(it.url)) continue;
      seen.add(it.url);
      all.push(it);
    }

    console.log(`page ${p}/${pages} -> ${batch.length} items (total ${all.length})`);
  }

  return all;
}

Add ProxiesAPI (proxy-backed fetch URL)

When you scale up (more pages, more keywords, more targets), you’ll start to see:

  • 429 / rate limits
  • random timeouts
  • occasional bot pages

That’s where ProxiesAPI can help at the network layer.

A common pattern is to transform:

  • target URL: https://example.com/list?page=1

into a ProxiesAPI fetch URL:

  • https://api.proxiesapi.com/?auth_key=YOUR_KEY&url=https%3A%2F%2Fexample.com%2Flist%3Fpage%3D1

Here’s a helper:

export function proxiesApiUrl(targetUrl) {
  const key = process.env.PROXIESAPI_KEY;
  if (!key) throw new Error('Missing PROXIESAPI_KEY');

  const u = new URL('https://api.proxiesapi.com/');
  u.searchParams.set('auth_key', key);
  u.searchParams.set('url', targetUrl);
  return u.toString();
}

Then in fetchHtml, call fetch(proxiesApiUrl(url)) instead of fetch(url).

A clean implementation (toggle)

import { fetch } from 'undici';
import { proxiesApiUrl } from './proxiesapi.js';

export async function fetchHtml(url, { useProxiesApi = false, timeoutMs = 30000 } = {}) {
  const finalUrl = useProxiesApi ? proxiesApiUrl(url) : url;
  // ... same fetch/retry logic as before
}

Comparison: Cheerio vs alternatives

A quick practical comparison:

ToolBest forNot great for
CheerioHTML parsing, fast extraction, server-rendered pagesJS-rendered SPAs
PlaywrightJS-heavy sites, user flows, complex DOMheavier + slower
Puppeteersimilar to Playwrightsimilar tradeoffs
JSDOMDOM APIs, small scriptsslower than Cheerio

Cheerio is often the right default for SEO/HTML pages.


Practical advice (what keeps scrapers alive)

  • Cache HTML for debugging (save page-3.html when parse fails)
  • Use a seen set to avoid duplicates
  • Treat “blocked” as a first-class error and retry/back off
  • Keep parsing logic separate from crawl logic (easier to fix selectors)

Where ProxiesAPI fits (honestly)

ProxiesAPI doesn’t replace good engineering. It helps with:

  • proxying requests when you can’t rely on a single IP
  • reducing failure rate on long paginations

You still need:

  • correct selectors
  • respectful request pacing
  • error handling and logging

QA checklist

  • Your scraper fetches HTML with a timeout
  • Parsing returns expected title/url/desc
  • Pagination increases unique item count
  • JSON export is valid
  • You can toggle ProxiesAPI on/off
Make your Node.js scrapers more reliable with ProxiesAPI

Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.

Related guides

Scrape TripAdvisor Hotel Reviews with Python (Pagination + Rate Limits)
Extract TripAdvisor hotel review text, ratings, dates, and reviewer metadata with a resilient Python scraper (pagination, retries, and a proxy-backed fetch layer via ProxiesAPI).
tutorial#python#tripadvisor#reviews
Scrape Vinted Listings with Python: Search, Prices, Images, and Pagination
Build a dataset from Vinted search results (title, price, size, condition, seller, images) with a production-minded Python scraper + a proxy-backed fetch layer via ProxiesAPI.
tutorial#python#vinted#ecommerce
How to Scrape Etsy Product Listings with Python (ProxiesAPI + Pagination)
Extract title, price, rating, and shop info from Etsy search pages reliably with rotating proxies, retries, and pagination.
tutorial#python#etsy#web-scraping
Web Scraping with Ruby: Nokogiri + HTTParty Tutorial (2026)
A practical Ruby scraping guide: fetch pages with HTTParty, parse HTML with Nokogiri, handle pagination, add retries, and rotate proxies responsibly.
guide#ruby#nokogiri#httparty