Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)
If you’re building web scrapers in Node.js, Cheerio is the workhorse:
- fast, lightweight HTML parsing
- jQuery-style selectors (
$('a.some-class')) - perfect for server-rendered pages and “fetch → parse → export” pipelines
This guide is a practical quick start for the keyword “Node.js web scraping with Cheerio”.
We’ll build a reusable scraper that:
- fetches HTML with timeouts + retries
- parses a real page structure with Cheerio
- paginates across list pages
- exports JSON
- optionally routes requests through ProxiesAPI when you need more stability
Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.
When Cheerio is a good fit (and when it isn’t)
Cheerio works great when the page content exists in the raw HTML.
Use Cheerio for:
- blogs, docs sites, directories
- e-commerce category pages (sometimes)
- news sites
- community sites with server-rendered HTML
Don’t use Cheerio (alone) when:
- the content is loaded via XHR after page load
- the HTML is mostly a JS app shell
In those cases you either:
- hit the site’s JSON endpoints directly (often best)
- or use a headless browser (Playwright/Puppeteer)
Setup
mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
npm i cheerio
npm i undici
We’ll use:
undicifor HTTP (fast, built-in fetch in modern Node is also fine)cheeriofor parsing
Step 1: A robust fetch() wrapper (timeouts + retries)
Scrapers fail at the network layer first.
A good fetch wrapper should:
- set a timeout
- retry with backoff
- optionally detect “blocked” pages
import { fetch } from 'undici';
function sleep(ms) {
return new Promise((r) => setTimeout(r, ms));
}
function looksBlocked(html) {
const h = (html || '').toLowerCase();
return [
'captcha',
'access denied',
'unusual traffic',
'verify you are',
].some((s) => h.includes(s));
}
export async function fetchHtml(url, { timeoutMs = 30000, retries = 4 } = {}) {
let lastErr;
for (let attempt = 1; attempt <= retries + 1; attempt++) {
try {
const controller = new AbortController();
const t = setTimeout(() => controller.abort(), timeoutMs);
const res = await fetch(url, {
signal: controller.signal,
headers: {
'user-agent': 'Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
'accept-language': 'en-US,en;q=0.9',
},
});
clearTimeout(t);
if (!res.ok) {
throw new Error(`HTTP ${res.status} for ${url}`);
}
const html = await res.text();
if (looksBlocked(html)) {
throw new Error('Blocked page detected (captcha/bot page)');
}
return html;
} catch (e) {
lastErr = e;
const backoff = Math.min(2000 * 2 ** (attempt - 1), 15000);
const jitter = Math.floor(Math.random() * 500);
await sleep(backoff + jitter);
}
}
throw lastErr;
}
Step 2: Parse with Cheerio (real selectors)
Cheerio gives you a $ function like jQuery.
Example: parse a list page containing “cards” with:
- title link
- short description
import * as cheerio from 'cheerio';
export function parseCards(html, { baseUrl }) {
const $ = cheerio.load(html);
const items = [];
$('.card').each((_, el) => {
const title = $(el).find('a.card__title').text().trim();
const href = $(el).find('a.card__title').attr('href');
const url = href ? new URL(href, baseUrl).toString() : null;
const desc = $(el).find('.card__desc').text().trim();
if (!url) return;
items.push({ title: title || null, desc: desc || null, url });
});
return items;
}
Tip: Always view the site HTML (right-click → “View page source”) before you assume the content is there.
Step 3: Pagination (the pattern you’ll reuse everywhere)
Most list pages use either:
?page=2/page/2/- cursor-based pagination in JSON endpoints
Here’s a reusable “query param page” pagination:
export function setPage(url, page) {
const u = new URL(url);
u.searchParams.set('page', String(page));
return u.toString();
}
And the crawler:
import { fetchHtml } from './fetch.js';
import { parseCards } from './parse.js';
import { setPage } from './paginate.js';
export async function crawlList(startUrl, { pages = 5 } = {}) {
const all = [];
const seen = new Set();
for (let p = 1; p <= pages; p++) {
const url = p === 1 ? startUrl : setPage(startUrl, p);
const html = await fetchHtml(url);
const batch = parseCards(html, { baseUrl: startUrl });
for (const it of batch) {
if (seen.has(it.url)) continue;
seen.add(it.url);
all.push(it);
}
console.log(`page ${p}/${pages} -> ${batch.length} items (total ${all.length})`);
}
return all;
}
Add ProxiesAPI (proxy-backed fetch URL)
When you scale up (more pages, more keywords, more targets), you’ll start to see:
- 429 / rate limits
- random timeouts
- occasional bot pages
That’s where ProxiesAPI can help at the network layer.
A common pattern is to transform:
- target URL:
https://example.com/list?page=1
into a ProxiesAPI fetch URL:
https://api.proxiesapi.com/?auth_key=YOUR_KEY&url=https%3A%2F%2Fexample.com%2Flist%3Fpage%3D1
Here’s a helper:
export function proxiesApiUrl(targetUrl) {
const key = process.env.PROXIESAPI_KEY;
if (!key) throw new Error('Missing PROXIESAPI_KEY');
const u = new URL('https://api.proxiesapi.com/');
u.searchParams.set('auth_key', key);
u.searchParams.set('url', targetUrl);
return u.toString();
}
Then in fetchHtml, call fetch(proxiesApiUrl(url)) instead of fetch(url).
A clean implementation (toggle)
import { fetch } from 'undici';
import { proxiesApiUrl } from './proxiesapi.js';
export async function fetchHtml(url, { useProxiesApi = false, timeoutMs = 30000 } = {}) {
const finalUrl = useProxiesApi ? proxiesApiUrl(url) : url;
// ... same fetch/retry logic as before
}
Comparison: Cheerio vs alternatives
A quick practical comparison:
| Tool | Best for | Not great for |
|---|---|---|
| Cheerio | HTML parsing, fast extraction, server-rendered pages | JS-rendered SPAs |
| Playwright | JS-heavy sites, user flows, complex DOM | heavier + slower |
| Puppeteer | similar to Playwright | similar tradeoffs |
| JSDOM | DOM APIs, small scripts | slower than Cheerio |
Cheerio is often the right default for SEO/HTML pages.
Practical advice (what keeps scrapers alive)
- Cache HTML for debugging (save
page-3.htmlwhen parse fails) - Use a seen set to avoid duplicates
- Treat “blocked” as a first-class error and retry/back off
- Keep parsing logic separate from crawl logic (easier to fix selectors)
Where ProxiesAPI fits (honestly)
ProxiesAPI doesn’t replace good engineering. It helps with:
- proxying requests when you can’t rely on a single IP
- reducing failure rate on long paginations
You still need:
- correct selectors
- respectful request pacing
- error handling and logging
QA checklist
- Your scraper fetches HTML with a timeout
- Parsing returns expected
title/url/desc - Pagination increases unique item count
- JSON export is valid
- You can toggle ProxiesAPI on/off
Once you paginate or crawl lots of pages, IP-based throttling becomes a real failure mode. ProxiesAPI gives you a proxy-backed fetch URL so your Cheerio scrapers can retry with fewer sudden blocks.