Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
If you already know JavaScript, Node.js is one of the fastest ways to build a web scraper:
- great HTTP tooling
- strong HTML parsers (Cheerio)
- easy concurrency control
- perfect for pipelines (scrape → transform → store)
In this 2026 tutorial, you’ll build a complete Node scraper that handles the stuff that actually matters:
- fetching pages with timeouts
- parsing HTML using Cheerio selectors
- crawling pagination safely
- retrying failures with backoff
- exporting data to CSV
- routing requests through ProxiesAPI when you need more reliability
Most scrapers fail in the network layer: timeouts, throttling, and blocks. ProxiesAPI gives your Node.js scraper a simple proxy route so your crawling stays more reliable as you add more targets and more URLs.
When Node.js is a good fit for scraping
Node is especially good for:
- scraping server-rendered HTML pages
- building “ETL-style” scrapers
- running lots of small jobs (cron, queues)
- building internal dashboards that refresh regularly
Node is not a magic bullet for:
- heavily client-rendered apps (React apps that fetch everything via XHR)
- pages that require solving complex bot challenges
For those cases, you usually move to a browser automation stack (Playwright) or a first-party API.
Setup
Create a new project:
mkdir node-scraper
cd node-scraper
npm init -y
Install dependencies:
npm install cheerio dotenv
Node 18+ already includes fetch. (If you’re on an older Node, install node-fetch.)
Create a .env:
PROXIESAPI_KEY="YOUR_KEY"
Part 1 — Build a robust fetch() with retries
Scrapers die in the network layer. So we’ll start there.
We want:
- timeouts (never hang)
- retries for transient failures
- a stable User-Agent
// scraper.js
import * as cheerio from "cheerio";
import "dotenv/config";
const PROXIESAPI_KEY = process.env.PROXIESAPI_KEY;
const PROXIESAPI_ENDPOINT = "http://api.proxiesapi.com/";
function sleep(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
function proxiesApiUrl(targetUrl) {
if (!PROXIESAPI_KEY) throw new Error("Missing PROXIESAPI_KEY");
const u = new URL(PROXIESAPI_ENDPOINT);
u.searchParams.set("auth_key", PROXIESAPI_KEY);
u.searchParams.set("url", targetUrl);
return u.toString();
}
async function fetchHtml(url, { useProxiesApi = false, timeoutMs = 45000, retries = 4 } = {}) {
let lastErr;
for (let attempt = 1; attempt <= retries; attempt++) {
const controller = new AbortController();
const t = setTimeout(() => controller.abort(), timeoutMs);
try {
const finalUrl = useProxiesApi ? proxiesApiUrl(url) : url;
const res = await fetch(finalUrl, {
signal: controller.signal,
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
},
});
if (!res.ok) {
throw new Error(`HTTP ${res.status} ${res.statusText}`);
}
const html = await res.text();
if (html.length < 2000) {
throw new Error(`Response too small (${html.length} bytes)`);
}
return html;
} catch (e) {
lastErr = e;
if (attempt === retries) break;
const backoff = (2 ** attempt) * 1000 + Math.floor(Math.random() * 400);
console.log(`attempt ${attempt} failed: ${e}. sleeping ${backoff}ms`);
await sleep(backoff);
} finally {
clearTimeout(t);
}
}
throw new Error(`Failed to fetch ${url}: ${lastErr}`);
}
Use ProxiesAPI or not?
- For friendly sites: set
useProxiesApi: false - For targets that start throttling/denying: set
useProxiesApi: true
In real pipelines, you often start without proxies, then switch on proxies for specific hosts.
Part 2 — Parse HTML with Cheerio (selectors that make sense)
Cheerio gives you jQuery-like selectors.
Here’s a generic example for parsing a “listing card” page:
function parseListingPage(html, { baseUrl }) {
const $ = cheerio.load(html);
const out = [];
$("article, .card, .product, .result").each((_, el) => {
// Try to find a reasonable link + title inside a card.
const a = $(el).find("a[href]").first();
const href = a.attr("href");
const title = a.text().trim();
if (!href || title.length < 4) return;
const url = new URL(href, baseUrl).toString();
// Best-effort price extraction.
const textBlob = $(el).text().replace(/\s+/g, " ").trim();
const priceMatch = textBlob.match(/(\$\s?\d[\d\.,]*|€\s?\d[\d\.,]*|£\s?\d[\d\.,]*)/);
out.push({
title,
url,
price: priceMatch ? priceMatch[1] : null,
});
});
return out;
}
This pattern is useful because it’s resilient across sites: you don’t overfit to one class name.
For high accuracy on a single target, you should:
- inspect the HTML in DevTools
- target specific containers (e.g.
div[data-testid='product-card']) - write selectors based on attributes instead of CSS classes
Part 3 — Crawl pagination (without guessing)
The cleanest pagination approach is: follow the Next link.
We’ll implement:
rel="next"if present- a fallback selector for anchors containing “Next”
function findNextPageUrl(html, currentUrl) {
const $ = cheerio.load(html);
const relNext = $("link[rel='next']").attr("href");
if (relNext) return new URL(relNext, currentUrl).toString();
// Fallback (English sites): anchor text contains "Next"
let nextHref = null;
$("a[href]").each((_, a) => {
const text = $(a).text().trim().toLowerCase();
if (text === "next" || text.includes("next ") || text.includes(" next")) {
nextHref = $(a).attr("href");
return false;
}
});
return nextHref ? new URL(nextHref, currentUrl).toString() : null;
}
async function crawl(startUrl, { maxPages = 5, useProxiesApi = false } = {}) {
let url = startUrl;
const all = [];
const seen = new Set();
for (let page = 1; page <= maxPages; page++) {
const html = await fetchHtml(url, { useProxiesApi });
const rows = parseListingPage(html, { baseUrl: url });
for (const r of rows) {
if (seen.has(r.url)) continue;
seen.add(r.url);
all.push(r);
}
console.log(`page ${page}: rows=${rows.length} total_unique=${all.length}`);
const nextUrl = findNextPageUrl(html, url);
if (!nextUrl) break;
url = nextUrl;
}
return all;
}
Part 4 — Export to CSV (no dependencies)
import fs from "node:fs";
function toCsv(rows) {
const headers = Object.keys(rows[0]);
const escape = (v) => {
if (v === null || v === undefined) return "";
const s = String(v);
if (s.includes(",") || s.includes("\n") || s.includes('"')) {
return '"' + s.replace(/"/g, '""') + '"';
}
return s;
};
const lines = [headers.join(",")];
for (const row of rows) {
lines.push(headers.map((h) => escape(row[h])).join(","));
}
return lines.join("\n") + "\n";
}
async function main() {
const startUrl = "https://news.ycombinator.com/"; // replace with your target
const rows = await crawl(startUrl, { maxPages: 3, useProxiesApi: true });
if (!rows.length) throw new Error("No rows scraped");
const csv = toCsv(rows);
fs.writeFileSync("scraped.csv", csv, "utf-8");
console.log("wrote scraped.csv rows:", rows.length);
}
main().catch((e) => {
console.error(e);
process.exit(1);
});
Practical advice (what experienced scrapers do)
1) Start with a “selector budget”
Don’t start by writing 25 selectors.
Start with:
- a URL selector
- a title selector
- a price selector
Ship a dataset. Then iterate.
2) Prefer attributes over CSS classes
Classes change. Attributes like data-testid, aria-label, and semantic tags change less.
3) Treat every scrape as an ETL job
A good scraper has three stages:
- Extract: fetch HTML
- Transform: parse into normalized rows
- Load: write to CSV/DB/API
Keeping these separate makes your code maintainable.
4) Expect failures
Plan for:
- timeouts
- 429 rate limits
- intermittent HTML differences
Retries + backoff + logging are not optional.
ProxiesAPI integration patterns (without overpromising)
You have three common options:
- Always on: every request goes through ProxiesAPI
- Host-based: only “difficult” hosts go through ProxiesAPI
- Fallback: try direct first, then retry with ProxiesAPI
If you run a lot of scraping jobs, host-based routing is usually the sweet spot.
Comparison: Node.js scraping libraries (2026)
| Approach | Best for | Pros | Cons |
|---|---|---|---|
| fetch + Cheerio | HTML pages | Fast, simple, cheap | No JS rendering |
| Axios + Cheerio | HTML pages | Mature ecosystem | Extra dependency |
| Playwright | JS-heavy sites | Accurate, renders pages | Slower, heavier |
| Puppeteer | JS-heavy sites | Popular | Similar tradeoffs |
FAQ
Is web scraping legal?
It depends on what you scrape, how you use it, and the jurisdiction.
Common best practices:
- scrape public pages
- respect rate limits
- don’t collect sensitive personal data
- check the target’s terms and local law
What about robots.txt?
robots.txt is not a law, but it’s a strong signal of what the site expects.
If you’re doing something commercial, treat it as part of your compliance process.
QA checklist
- Fetch HTML with timeouts (no hanging)
- Parse a page and log the first 3 extracted rows
- Pagination increases unique URLs
- CSV opens cleanly in Excel/Sheets
- Retries/backoff are visible in logs
Most scrapers fail in the network layer: timeouts, throttling, and blocks. ProxiesAPI gives your Node.js scraper a simple proxy route so your crawling stays more reliable as you add more targets and more URLs.