Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)
This is a hands-on guide to web scraping with JavaScript + Node.js.
You’ll learn two scraping modes:
- Fast mode (HTML):
fetch+cheerio— best for server-rendered pages - Fallback mode (browser):
playwright— best for JS-heavy pages
And we’ll wrap both in a production-friendly structure:
- timeouts
- retries with exponential backoff
- concurrency control
- rate limiting
- proxy support (including ProxiesAPI)
If your goal is: “I want data reliably, at scale” — this is the stack.
When your Node scraper goes from 20 URLs to 20,000, you’ll see more timeouts, blocks, and flaky responses. ProxiesAPI gives you a proxy layer (rotation + reputation) so you can keep the crawl stable without hand-rolling infra.
When Node.js is a great choice for scraping
Node shines when:
- you already build with JS/TS
- you want to share parsing code with a front-end
- you need a great async I/O model
- you want easy access to Playwright/Puppeteer
Python still has a massive ecosystem, but Node is absolutely production-grade for scraping.
Project setup
Create a folder and initialize:
mkdir node-scraper
cd node-scraper
npm init -y
Install dependencies:
npm install cheerio p-limit
Node 18+ includes fetch built-in.
If you want the browser fallback:
npm install playwright
npx playwright install
Part 1: Fast HTML scraping with fetch + Cheerio
1) Fetch a page with a real timeout
Node’s built-in fetch doesn’t include a timeout by default. Use AbortController:
// fetch-with-timeout.js
export async function fetchWithTimeout(url, { timeoutMs = 20000, ...opts } = {}) {
const controller = new AbortController();
const id = setTimeout(() => controller.abort(), timeoutMs);
try {
const res = await fetch(url, {
...opts,
signal: controller.signal,
headers: {
"user-agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9",
...(opts.headers || {}),
},
});
if (!res.ok) {
throw new Error(`HTTP ${res.status} for ${url}`);
}
const text = await res.text();
return text;
} finally {
clearTimeout(id);
}
}
2) Parse HTML with Cheerio
Cheerio gives you a jQuery-like API.
Example: scrape a blog list page into {title, url}.
// parse-list.js
import * as cheerio from "cheerio";
export function parseList(html, baseUrl) {
const $ = cheerio.load(html);
const items = [];
$("article a").each((_, el) => {
const href = $(el).attr("href");
const title = $(el).text().trim();
if (!href || !title) return;
const url = href.startsWith("http") ? href : new URL(href, baseUrl).toString();
items.push({ title, url });
});
return items;
}
3) Run it
// index.js
import { fetchWithTimeout } from "./fetch-with-timeout.js";
import { parseList } from "./parse-list.js";
const URL = "https://example.com";
const html = await fetchWithTimeout(URL, { timeoutMs: 20000 });
const items = parseList(html, URL);
console.log("items:", items.length);
console.log(items.slice(0, 5));
Run:
node index.js
Part 2: Make it production-friendly (retries + concurrency)
Retries with backoff
You can roll a tiny retry helper:
export async function retry(fn, { attempts = 5, minDelayMs = 500 } = {}) {
let lastErr;
for (let i = 1; i <= attempts; i++) {
try {
return await fn();
} catch (err) {
lastErr = err;
const delay = Math.min(20000, minDelayMs * 2 ** (i - 1));
await new Promise((r) => setTimeout(r, delay));
}
}
throw lastErr;
}
Concurrency limiting
When you have 10,000 URLs, you don’t want 10,000 parallel requests.
Use p-limit:
import pLimit from "p-limit";
const limit = pLimit(5); // 5 concurrent tasks
const pages = await Promise.all(
urls.map((u) =>
limit(async () => {
const html = await fetchWithTimeout(u, { timeoutMs: 20000 });
return { url: u, htmlLen: html.length };
})
)
);
console.log("done", pages.length);
Part 3: Proxy support (including ProxiesAPI)
In Node, proxying depends on whether you:
- use
fetch(needs a proxy agent) - use Playwright (proxy option is built-in)
Option A: Use Playwright with a proxy (simplest)
import { chromium } from "playwright";
const browser = await chromium.launch({
headless: true,
proxy: {
server: "http://YOUR_PROXIESAPI_PROXY_HOST:PORT",
username: "YOUR_USERNAME",
password: "YOUR_PASSWORD",
},
});
const page = await browser.newPage();
await page.goto("https://example.com", { waitUntil: "domcontentloaded" });
const html = await page.content();
await browser.close();
This is a clean way to integrate ProxiesAPI for JS-heavy targets.
Option B: fetch through an HTTP proxy
For fetch, you typically use a proxy agent library. A common approach is https-proxy-agent:
npm install https-proxy-agent
import { HttpsProxyAgent } from "https-proxy-agent";
const proxyUrl = process.env.PROXY_URL; // e.g. http://user:pass@host:port
const agent = new HttpsProxyAgent(proxyUrl);
const res = await fetch("https://example.com", { agent });
(Exact options can vary by Node version and fetch implementation. If your environment uses undici, you may prefer undici’s ProxyAgent.)
Part 4: Browser fallback for JS-heavy sites (Playwright)
When HTML scraping fails because content is rendered client-side, use Playwright:
import { chromium } from "playwright";
export async function fetchRenderedHtml(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle", timeout: 60000 });
// If the site lazy-loads, a small scroll can help
await page.mouse.wheel(0, 1500);
await page.waitForTimeout(800);
const html = await page.content();
await browser.close();
return html;
}
This “rendered snapshot” is also great for debugging selectors.
Comparison: Cheerio vs Playwright
| Feature | Cheerio (HTML) | Playwright (Browser) |
|---|---|---|
| Speed | Very fast | Slower |
| Cost | Low | Higher |
| JS-heavy sites | Fails | Works |
| Anti-bot exposure | Lower | Often higher |
| Selector stability | Depends on markup | Often better |
Rule of thumb:
- Start with Cheerio.
- Add Playwright only when needed.
Common scraping mistakes (and fixes)
- No timeouts → requests hang forever
- Too much concurrency → you DDoS yourself (and get blocked)
- No retries → flaky pages break your pipeline
- No dedupe → you scrape the same page repeatedly
- No proxy plan → your IP gets burned at scale
A tiny “real” scraper template (put it all together)
import pLimit from "p-limit";
import * as cheerio from "cheerio";
import { fetchWithTimeout } from "./fetch-with-timeout.js";
import { retry } from "./retry.js";
const startUrl = "https://example.com";
const limit = pLimit(5);
function parseLinks(html, baseUrl) {
const $ = cheerio.load(html);
const links = new Set();
$("a").each((_, el) => {
const href = $(el).attr("href");
if (!href) return;
try {
const u = new URL(href, baseUrl);
if (u.protocol.startsWith("http")) links.add(u.toString());
} catch {}
});
return [...links];
}
const html = await retry(() => fetchWithTimeout(startUrl, { timeoutMs: 20000 }));
const urls = parseLinks(html, startUrl).slice(0, 50);
const results = await Promise.all(
urls.map((u) =>
limit(() =>
retry(async () => {
const h = await fetchWithTimeout(u, { timeoutMs: 20000 });
return { url: u, bytes: h.length };
})
)
)
);
console.log("scraped", results.length);
console.log(results.slice(0, 5));
FAQ
Is scraping with JavaScript “worse” than Python?
No. The language matters less than:
- your fetch/retry/rate-limit strategy
- whether you can handle JS-heavy pages
- your proxy approach
Do I always need a browser?
No. Browsers are powerful but expensive. Prefer HTML scraping first.
Next steps
- add a URL frontier (queue) + visited set
- store results in SQLite or Postgres
- add ProxiesAPI rotation when the crawl becomes flaky
- add a Playwright mode for the hard targets
When your Node scraper goes from 20 URLs to 20,000, you’ll see more timeouts, blocks, and flaky responses. ProxiesAPI gives you a proxy layer (rotation + reputation) so you can keep the crawl stable without hand-rolling infra.