Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
The keyword “web scraping with javascript” is popular for a reason: Node.js is a great fit for scraping pipelines.
- You can write scrapers, ETL, and APIs in one language
- You have excellent tooling for concurrency and retries
- HTML parsing (Cheerio) feels like jQuery
In this tutorial you’ll build a production-style scraper in Node.js that:
- fetches HTML pages reliably (timeouts + retries + backoff)
- parses real DOM structure with Cheerio
- crawls pagination
- rotates proxies using ProxiesAPI
- exports a clean dataset (JSON/JSONL)
To keep the tutorial concrete, we’ll use a “blog-like listing → detail pages” pattern that maps to many sites:
- listing page: many items + “next page”
- detail page: each item has content you want
Scrapers fail in the network layer first: timeouts, 429s, and blocks. ProxiesAPI gives you a clean way to rotate IPs and keep retries from cascading into downtime.
0) Before you scrape: basic rules that save you hours
- Prefer server-rendered HTML targets when possible.
- Start with one page → then add pagination → then add detail pages.
- Keep concurrency low at first. Reliability beats speed.
- Put every network request behind a function with:
- timeouts
- retries
- jitter/backoff
1) Project setup
mkdir node-scraper
cd node-scraper
npm init -y
npm install cheerio p-limit dotenv
Node 18+ includes fetch(). If you’re on an older Node version, install undici or node-fetch.
Create .env:
PROXIESAPI_KEY="YOUR_KEY"
PROXIESAPI_ENDPOINT="https://proxiesapi.com" # example; use your real endpoint
2) A resilient fetch() wrapper (retries + backoff)
Create src/http.js:
import "dotenv/config";
const TIMEOUT_MS = 30_000;
function sleep(ms) {
return new Promise((r) => setTimeout(r, ms));
}
function withTimeout(signal, ms) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), ms);
// If a signal is passed in, forward abort
if (signal) {
signal.addEventListener("abort", () => controller.abort(), { once: true });
}
return { signal: controller.signal, cancel: () => clearTimeout(timeout) };
}
export function proxiesapiUrl(targetUrl) {
const key = process.env.PROXIESAPI_KEY;
const endpoint = process.env.PROXIESAPI_ENDPOINT;
if (!key || !endpoint) throw new Error("Missing PROXIESAPI_KEY/PROXIESAPI_ENDPOINT");
const u = new URL(endpoint);
// Many proxy APIs use ?api_key=...&url=...
u.searchParams.set("api_key", key);
u.searchParams.set("url", targetUrl);
return u.toString();
}
export async function fetchHtml(targetUrl, { headers = {}, maxRetries = 4 } = {}) {
const ua =
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 (KHTML, like Gecko) " +
"Chrome/123.0.0.0 Safari/537.36";
const finalHeaders = {
"user-agent": ua,
"accept-language": "en-US,en;q=0.9",
...headers,
};
let attempt = 0;
while (true) {
attempt += 1;
const jitter = 250 + Math.floor(Math.random() * 800);
await sleep(jitter);
const { signal, cancel } = withTimeout(undefined, TIMEOUT_MS);
try {
const res = await fetch(proxiesapiUrl(targetUrl), {
method: "GET",
headers: finalHeaders,
signal,
});
if (!res.ok) {
// retry on transient errors
const retryable = [429, 500, 502, 503, 504].includes(res.status);
const body = await res.text().catch(() => "");
if (retryable && attempt < maxRetries) {
const backoff = Math.min(20_000, 500 * 2 ** (attempt - 1));
await sleep(backoff);
continue;
}
throw new Error(`HTTP ${res.status} ${res.statusText} :: ${body.slice(0, 200)}`);
}
return await res.text();
} finally {
cancel();
}
}
}
Notes:
- We deliberately keep the logic simple.
- We retry on 429/5xx, not on everything.
- We add jitter to avoid request bursts.
3) Parse a listing page with Cheerio
Create src/parse.js:
import * as cheerio from "cheerio";
export function parseListing(html, { baseUrl }) {
const $ = cheerio.load(html);
// Example: grab all links. In real targets, restrict selectors.
const links = new Set();
$("a[href]").each((_, a) => {
const href = $(a).attr("href");
if (!href) return;
// Normalize relative → absolute
try {
const u = new URL(href, baseUrl);
links.add(u.toString());
} catch {
// ignore invalid URLs
}
});
return Array.from(links);
}
export function parseTitle(html) {
const $ = cheerio.load(html);
return $("title").text().trim();
}
This is a generic parser. On your real target, you’ll do something like:
$(".product-card a")$("article h2 a")$("a.question-hyperlink")
The concept is the same.
4) Crawl pagination (the pattern that works everywhere)
Here’s the “crawl N pages” loop:
- fetch listing page
- parse item links
- enqueue detail pages
- follow “next page” (or build
?page=NURLs)
Create src/crawl.js:
import fs from "node:fs";
import path from "node:path";
import pLimit from "p-limit";
import { fetchHtml } from "./http.js";
import { parseTitle, parseListing } from "./parse.js";
const limit = pLimit(3); // keep concurrency modest
export async function crawl({ startUrl, pages = 3 }) {
const base = new URL(startUrl).origin;
const listingUrls = [];
for (let i = 1; i <= pages; i++) {
// If your target uses ?page=N, build it here.
// Otherwise, you can parse a “next” link from HTML.
listingUrls.push(startUrl.replace("{page}", String(i)));
}
const detailUrls = new Set();
for (const u of listingUrls) {
const html = await fetchHtml(u);
const links = parseListing(html, { baseUrl: base });
// Heuristic: keep only same-origin links
for (const link of links) {
if (link.startsWith(base)) detailUrls.add(link);
}
console.log("listing", u, "=> links", links.length, "detail set", detailUrls.size);
}
// Fetch details (concurrently)
const results = await Promise.all(
Array.from(detailUrls).slice(0, 50).map((u) =>
limit(async () => {
const html = await fetchHtml(u);
return { url: u, title: parseTitle(html) };
})
)
);
return results;
}
export function writeJson(outPath, data) {
fs.mkdirSync(path.dirname(outPath), { recursive: true });
fs.writeFileSync(outPath, JSON.stringify(data, null, 2), "utf-8");
}
5) Run it: a complete working script
Create index.js:
import { crawl, writeJson } from "./src/crawl.js";
// Example pattern: a listing with ?page=1..N
// Replace with your real target.
const START_URL = "https://example.com/list?page={page}";
const data = await crawl({ startUrl: START_URL, pages: 3 });
writeJson("out/results.json", data);
console.log("wrote", data.length, "items");
console.log(data.slice(0, 3));
Run:
node index.js
Comparison: Requests vs Browser (Node.js)
Some sites are easy with HTTP + HTML parsing. Others are JS-rendered.
Here’s how to decide:
| Situation | Use | Why |
|---|---|---|
| Server-rendered pages | fetch + Cheerio | Fast, cheap, reliable |
| Needs JS to render data | Playwright/Puppeteer | You’ll otherwise parse empty HTML |
| Heavily blocked / fingerprinted | Browser + proxies + pacing | You need realistic behavior |
| Bulk scraping (many URLs) | HTTP + proxies | Cost-effective |
Practical anti-blocking tips (Node.js)
- Use timeouts and retry only on transient failures.
- Add jitter between requests.
- Keep concurrency modest (
p-limit). - Rotate IPs for scale (ProxiesAPI).
- Don’t run “infinite crawl” without:
- dedupe
- max pages
- max depth
Where ProxiesAPI fits (honestly)
ProxiesAPI isn’t a “scrape anything instantly” button.
It’s a pragmatic tool to make the boring part of scraping reliable:
- rotating IPs
- re-trying failures cleanly
- reducing downtime from blocks
Once your fetch layer is stable, you can focus on the parts that actually create value:
- parsers
- data model
- exports
- alerts
Next upgrades
- Save HTML snapshots for debugging (
out/html/...). - Add a URL queue and persist progress to SQLite.
- Add a robots.txt + compliance check step.
- Implement per-domain rate limiting.
Scrapers fail in the network layer first: timeouts, 429s, and blocks. ProxiesAPI gives you a clean way to rotate IPs and keep retries from cascading into downtime.