Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
If you’re building scrapers in 2026, JavaScript + Node.js is a surprisingly strong default:
- same language as the browser (easy DOM mental model)
- best-in-class tooling for JS-rendered sites (Playwright)
- good performance for I/O-heavy crawlers
This tutorial is a practical, end-to-end “starter kit” for scraping with Node:
- HTTP + HTML parsing (fast path)
- Playwright rendering (JS-heavy path)
- retries, backoff, and caching
- proxy integration with ProxiesAPI
Along the way, we’ll show the tradeoffs and give you copy-pasteable code.
As your Node scraper scales (more URLs, more targets), IP-based throttling becomes the #1 failure mode. ProxiesAPI gives you a stable proxy layer so retries actually work and crawls don’t die on a single blocked IP.
Keyword focus: web scraping with javascript
This post targets the keyword “web scraping with javascript”.
If you only take one thing away: in Node you typically use either:
- Cheerio (parse HTML strings like jQuery) for server-rendered pages
- Playwright (real headless browser) for JS-rendered pages
…and you should decide which one based on the target site’s rendering model.
Quick comparison table (what to use when)
| Approach | Best for | Pros | Cons |
|---|---|---|---|
fetch/axios + Cheerio | server-rendered HTML | very fast, cheap, easy to deploy | fails on JS apps, fragile selectors |
| Playwright | JS-heavy sites, dynamic UI | accurate DOM, can click/scroll/login | slower, heavier, more detectable |
| Hybrid (HTTP list → browser detail) | catalogs, pagination | cheaper than full-browser crawl | more code complexity |
Part 1: Scrape a server-rendered page with Cheerio (fast path)
Setup
mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-limit
We’ll scrape a simple HTML page and extract titles + links.
Code: fetch + parse
// scrape-cheerio.js
import axios from "axios";
import * as cheerio from "cheerio";
const URL = "https://news.ycombinator.com/";
async function fetchHtml(url) {
const res = await axios.get(url, {
timeout: 30000,
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
Accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
},
});
return res.data;
}
function parseHn(html) {
const $ = cheerio.load(html);
const out = [];
$("tr.athing").each((_, el) => {
const id = $(el).attr("id");
const a = $(el).find("span.titleline > a").first();
out.push({
id,
title: a.text().trim(),
url: a.attr("href"),
});
});
return out;
}
const html = await fetchHtml(URL);
const items = parseHn(html);
console.log("items", items.length);
console.log(items.slice(0, 3));
Run:
node scrape-cheerio.js
Part 2: Add retries + backoff (so your crawler doesn’t crumble)
Scrapers fail for boring reasons:
- intermittent 502/503
- TCP timeouts
- temporary throttling
If you don’t retry correctly, you’ll get random holes in your data.
npm i p-retry
import axios from "axios";
import pRetry from "p-retry";
async function fetchHtml(url) {
return pRetry(
async () => {
const res = await axios.get(url, { timeout: 30000 });
if (res.status >= 500) throw new Error(`server error ${res.status}`);
return res.data;
},
{
retries: 3,
onFailedAttempt: (err) => {
console.log(
`fetch failed: attempt ${err.attemptNumber} / ${err.retriesLeft + err.attemptNumber}`,
err.message
);
},
}
);
}
Add jitter between requests:
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
await sleep(500 + Math.random() * 1500);
Part 3: Scrape JS-heavy sites with Playwright (browser path)
Setup
npm i playwright
npx playwright install --with-deps chromium
Code: render and extract DOM
// scrape-playwright.js
import { chromium } from "playwright";
const URL = "https://example.com"; // replace with your target
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
userAgent:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
});
await page.goto(URL, { waitUntil: "domcontentloaded", timeout: 60000 });
// Wait for a selector that indicates the content loaded
await page.waitForTimeout(1500);
// Example extraction (replace selectors)
const items = await page.$$eval("a", (as) =>
as.slice(0, 10).map((a) => ({ text: a.textContent?.trim(), href: a.href }))
);
console.log(items);
await browser.close();
If you don’t know whether you need Playwright:
- view page source (
curl -s URL | head) — if it’s mostly blank divs, you need a browser - open DevTools → Network → see if data comes from XHR/GraphQL
Part 4: Integrate ProxiesAPI (Node)
The most common reason scrapers fail at scale is IP reputation + rate limits.
ProxiesAPI fits as the network layer that:
- routes requests through a proxy endpoint
- can rotate IPs between requests
4.1 ProxiesAPI with Axios
Set an environment variable:
export PROXIESAPI_PROXY_URL="http://USER:PASS@proxy.proxiesapi.com:PORT"
Then configure Axios to use an HTTP proxy agent.
npm i https-proxy-agent
import axios from "axios";
import { HttpsProxyAgent } from "https-proxy-agent";
const proxyUrl = process.env.PROXIESAPI_PROXY_URL;
const agent = proxyUrl ? new HttpsProxyAgent(proxyUrl) : undefined;
const res = await axios.get("https://httpbin.org/ip", {
timeout: 30000,
httpsAgent: agent,
// For some setups you may also set httpAgent
});
console.log(res.data);
4.2 ProxiesAPI with Playwright
For Playwright, you can set the proxy at the browser or context level:
import { chromium } from "playwright";
const proxyUrl = process.env.PROXIESAPI_PROXY_URL;
const browser = await chromium.launch({
headless: true,
proxy: proxyUrl ? { server: proxyUrl } : undefined,
});
const page = await browser.newPage();
await page.goto("https://httpbin.org/ip");
console.log(await page.textContent("body"));
await browser.close();
Note: exact proxy format depends on the endpoint ProxiesAPI gives you. Use ProxiesAPI’s docs for the correct server string and authentication style.
Part 5: Concurrency control (don’t DDoS your own success)
Even if you can run 100 concurrent requests, you usually shouldn’t.
In Node, a good default is p-limit:
import pLimit from "p-limit";
const limit = pLimit(5);
const results = await Promise.all(
urls.map((url) =>
limit(async () => {
const html = await fetchHtml(url);
return parse(html);
})
)
);
The goal is stable completion, not maximum speed.
Practical anti-block playbook (Node edition)
- Keep concurrency low (3–10)
- Add jittery sleeps
- Rotate IPs (ProxiesAPI) when scale increases
- Cache successful responses
- Detect block pages and stop, don’t hammer
FAQ: web scraping with JavaScript
Is web scraping legal?
Depends on the site and jurisdiction. Don’t scrape private data, and respect robots/ToS where applicable.
Cheerio vs Playwright?
Cheerio for HTML you can fetch; Playwright when the content requires JS.
Do I always need proxies?
No. But as your URL count grows, proxies become the simplest way to prevent one IP from being throttled.
Next steps
If you want to go from “works on my machine” to production:
- store results in a database (SQLite/Postgres)
- re-crawl incrementally
- build a failure dashboard (URLs failing by reason)
- move Playwright scrapes to worker queues
ProxiesAPI slots in as the network reliability layer when your crawler starts hitting IP limits.
As your Node scraper scales (more URLs, more targets), IP-based throttling becomes the #1 failure mode. ProxiesAPI gives you a stable proxy layer so retries actually work and crawls don’t die on a single blocked IP.