Node.js Web Scraping with Cheerio: Quick Start Guide
If you’re scraping in Node.js, Cheerio is the fastest way to parse server-rendered HTML.
It gives you a jQuery-like API ($('selector')) without the overhead of launching a browser.
This quick start guide shows how to go from “I can parse one page” to “I can run a real crawl”:
- fetch a page with timeouts + retries
- load HTML into Cheerio
- extract fields with real selectors
- paginate safely
- export JSONL (stream-friendly)
- plug in ProxiesAPI so your requests don’t fall over at scale
Target keyword: node.js web scraping with cheerio.
Cheerio is fast—but production scrapers fail in the network layer first. ProxiesAPI helps keep your requests stable as you add pagination, concurrency, and long-running crawls.
When Cheerio is the right tool (and when it isn’t)
Use Cheerio when:
- the page is mostly server-rendered HTML
- the data you want is in the initial response
- you want speed and low cost
Avoid Cheerio (or combine it with a browser) when:
- content loads only after JS runs
- you need to click, scroll, or solve an interactive flow
A common hybrid architecture is:
- Cheerio for 80% of pages (fast)
- browser automation for the hard 20%
Project setup
mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
npm i cheerio undici p-retry p-limit dotenv
Create .env:
PROXIESAPI_KEY=your_api_key_here
We’ll use:
- undici: Node’s modern HTTP client
- cheerio: HTML parsing
- p-retry: retries with backoff
- p-limit: concurrency limits
ProxiesAPI request helper (with retries)
Scrapers fail in the network layer first.
This helper:
- routes requests through ProxiesAPI
- uses a realistic timeout
- retries transient HTTP failures
import 'dotenv/config';
import { request } from 'undici';
import pRetry from 'p-retry';
function proxiesApiUrl(targetUrl) {
const key = process.env.PROXIESAPI_KEY;
if (!key) throw new Error('Missing PROXIESAPI_KEY');
return `https://api.proxiesapi.com/?auth_key=${key}&url=${encodeURIComponent(targetUrl)}`;
}
async function fetchHtml(url) {
return pRetry(async () => {
const gateway = proxiesApiUrl(url);
const { statusCode, body } = await request(gateway, {
method: 'GET',
headers: {
'user-agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36',
accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'accept-language': 'en-US,en;q=0.9'
},
bodyTimeout: 40_000,
headersTimeout: 10_000
});
if ([403, 408, 429, 500, 502, 503, 504].includes(statusCode)) {
throw new Error(`Transient HTTP ${statusCode}`);
}
if (statusCode < 200 || statusCode >= 300) {
throw new Error(`HTTP ${statusCode}`);
}
return await body.text();
}, {
retries: 5,
minTimeout: 800,
maxTimeout: 10_000
});
}
export { fetchHtml };
Parse HTML with Cheerio (real selectors)
Cheerio uses CSS selectors.
A good pattern is:
- fetch HTML
cheerio.load(html)- extract a list of items with a stable container selector
- extract fields relative to each item
Here’s a toy example for a blog-like page:
import * as cheerio from 'cheerio';
function parseCards(html) {
const $ = cheerio.load(html);
const cards = [];
$('.card, article, .post, li').each((_, el) => {
const title = $(el).find('h1,h2,h3').first().text().trim() || null;
const link = $(el).find('a[href]').first().attr('href') || null;
if (!title || !link) return;
cards.push({ title, link });
});
return cards;
}
export { parseCards };
Selector sanity check
Before you code, run:
node -e "console.log('ok')"
Then inspect HTML in DevTools:
- right click → Inspect
- find a stable parent container
- prefer semantic tags (
article,h2) over hashed classnames
Pagination patterns (what you’ll see in the wild)
Most pagination falls into one of these:
?page=2/page/2/?offset=20
You don’t need a fancy crawler to handle this.
Start with:
- a function that builds the next URL
- a max pages cap
- a
seenset to avoid loops
function nextPageUrl(baseUrl, page) {
const u = new URL(baseUrl);
u.searchParams.set('page', String(page));
return u.toString();
}
export { nextPageUrl };
A complete quick start scraper (JSONL export)
This script:
- crawls N pages
- parses cards from each page
- de-dupes by URL
- writes JSONL so you can stream results
import fs from 'node:fs';
import * as cheerio from 'cheerio';
import { fetchHtml } from './fetch.js';
function parseCards(html, baseUrl) {
const $ = cheerio.load(html);
const out = [];
$('a[href]').each((_, a) => {
const href = $(a).attr('href');
const text = $(a).text().trim();
if (!href || !text) return;
// avoid nav/footer junk
if (text.length < 8) return;
let abs;
try {
abs = new URL(href, baseUrl).toString();
} catch {
return;
}
out.push({ title: text, url: abs });
});
return out;
}
function nextPageUrl(baseUrl, page) {
const u = new URL(baseUrl);
u.searchParams.set('page', String(page));
return u.toString();
}
async function run({ startUrl, pages = 3 }) {
const seen = new Set();
const out = fs.createWriteStream('results.jsonl', { flags: 'w' });
for (let p = 1; p <= pages; p++) {
const url = p === 1 ? startUrl : nextPageUrl(startUrl, p);
const html = await fetchHtml(url);
const items = parseCards(html, startUrl);
console.log('page', p, 'items', items.length);
for (const it of items) {
if (seen.has(it.url)) continue;
seen.add(it.url);
out.write(JSON.stringify(it) + '\n');
}
}
out.end();
console.log('unique items', seen.size);
}
await run({
startUrl: 'https://example.com/blog',
pages: 5
});
Make it production-ish
- Add
p-limitto cap concurrency when fetching detail pages - Persist crawl state (SQLite)
- Record HTTP status + error strings for debugging
Comparison: Cheerio vs Playwright
| Feature | Cheerio | Playwright |
|---|---|---|
| Cost per page | Low | Higher |
| Handles JS-rendered sites | No | Yes |
| Speed | Very fast | Slower |
| Best for | HTML pages, feeds | Interactive flows |
The winning approach for most products is: Cheerio first, browser only when needed.
Common scraping mistakes in Node.js
- No timeouts → your job hangs.
- No retries → transient 429/503 kills your run.
- No dedupe → you store the same item 10 times.
- Selectors too brittle → one redesign breaks everything.
Your solution is boring engineering:
- timeouts
- retries
- backoff
- dedupe
- debug snapshots
Where ProxiesAPI fits (honestly)
Cheerio is parsing.
But reliability comes from your fetch layer:
- if you’re crawling page lists, ProxiesAPI can reduce random 403/429 spikes
- if you’re scraping across multiple domains, you get a consistent interface
- if you’re pulling thousands of pages, stability matters more than micro-optimizations
Start with the helper above, keep concurrency modest, and you’ll have a scraper you can actually run nightly.
Cheerio is fast—but production scrapers fail in the network layer first. ProxiesAPI helps keep your requests stable as you add pagination, concurrency, and long-running crawls.