Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial
If you’re building scrapers in 2026, Node.js is a fantastic choice — especially when you care about:
- high concurrency (many URLs in parallel)
- a strong ecosystem (Cheerio, Playwright, p-queue)
- shipping scrapers as production services (Docker, serverless, queues)
This tutorial is a practical, end-to-end guide to web scraping with JavaScript.
You’ll learn:
- when simple HTTP + HTML parsing is enough
- how to scrape with
fetch/axios+ Cheerio - how to handle pagination and detail pages
- how to avoid getting blocked (pacing, headers, retries)
- when to switch to Playwright (dynamic sites)
- where ProxiesAPI fits in
Target keyword: web scraping with javascript (used naturally throughout)
As your Node scrapers scale from 50 URLs to 50,000, transient failures and IP-based throttling become the bottleneck. ProxiesAPI helps stabilize your fetch layer with proxy rotation and predictable connectivity.
The mental model: 3 layers of scraping
Almost every scraper has the same shape:
- Fetch a page (HTTP request or headless browser)
- Parse it (HTML → structured data)
- Persist it (JSON/CSV/DB)
If you make each layer clean, you can swap components:
axios→fetch- Cheerio → JSDOM
- HTTP → Playwright
- local JSON → PostgreSQL
Option 1: Scrape static HTML with Node + Cheerio
If the site is server-rendered, you can scrape it without a browser.
Setup
mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-queue
We’ll use:
axiosfor HTTPcheeriofor DOM parsing (jQuery-like selectors)p-queueto cap concurrency (a huge anti-block lever)
Build a reliable fetch()
// fetch.js
import axios from "axios";
export async function fetchHtml(url, { timeoutMs = 30000 } = {}) {
const res = await axios.get(url, {
timeout: timeoutMs,
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
// If needed, you can pass proxy settings here (see ProxiesAPI section)
validateStatus: (s) => s >= 200 && s < 400,
});
return res.data;
}
Parse with Cheerio
Let’s scrape a simple list page (example target: quotes.toscrape.com).
// scrape-quotes.js
import * as cheerio from "cheerio";
import { fetchHtml } from "./fetch.js";
const BASE = "https://quotes.toscrape.com";
function parseQuotes(html) {
const $ = cheerio.load(html);
const out = [];
$(".quote").each((_, el) => {
const text = $(el).find(".text").text().trim();
const author = $(el).find(".author").text().trim();
const tags = $(el)
.find(".tags a.tag")
.map((_, a) => $(a).text().trim())
.get();
out.push({ text, author, tags });
});
const nextHref = $("li.next a").attr("href");
const nextUrl = nextHref ? new URL(nextHref, BASE).toString() : null;
return { out, nextUrl };
}
async function run(pages = 3) {
let url = BASE;
const all = [];
for (let i = 0; i < pages; i++) {
const html = await fetchHtml(url);
const { out, nextUrl } = parseQuotes(html);
all.push(...out);
if (!nextUrl) break;
url = nextUrl;
}
console.log("quotes:", all.length);
console.log(all[0]);
}
run();
This is web scraping with JavaScript at its cleanest: one HTTP request per page, parse selectors, and paginate.
Option 2: Add concurrency (without getting blocked)
A common mistake is Promise.all(urls.map(fetch)) on thousands of URLs.
Instead, use a queue with controlled concurrency.
import PQueue from "p-queue";
import { fetchHtml } from "./fetch.js";
const queue = new PQueue({ concurrency: 3, interval: 1000, intervalCap: 3 });
export async function fetchMany(urls) {
const results = [];
for (const url of urls) {
results.push(
queue.add(async () => {
try {
const html = await fetchHtml(url);
return { url, ok: true, htmlLen: html.length };
} catch (e) {
return { url, ok: false, error: String(e) };
}
})
);
}
return Promise.all(results);
}
This does two critical anti-block things:
- caps concurrency (you control burstiness)
- caps rate (interval + intervalCap)
Option 3: Scrape dynamic sites with Playwright
When a site is JS-rendered (React/Vue/Next with client-side fetch), Cheerio won’t see the data.
Use Playwright.
Setup
npm i playwright
npx playwright install
Example: render and extract text
import { chromium } from "playwright";
async function scrapeDynamic(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({ viewport: { width: 1400, height: 900 } });
await page.goto(url, { waitUntil: "domcontentloaded", timeout: 60000 });
await page.waitForTimeout(2000);
// Always wait for a meaningful selector that indicates page is ready
const title = await page.title();
await browser.close();
return { title };
}
scrapeDynamic("https://www.google.com/travel/flights?hl=en").then(console.log);
You can also extract structured data via:
page.locator("...").innerText()page.$$eval("...", els => els.map(...))
Proxies, anti-blocking, and retries (the practical stack)
If you remember nothing else from this guide, remember this:
- most blocks are caused by behavior (too fast, too many requests) more than tooling
- reliability comes from timeouts + retries + pacing + observability
Headers
Send a real UA and language headers. For some sites, also send:
Accept: text/html,...Referer
Timeouts
Never let requests hang:
- connect timeout ~ 10s
- read timeout ~ 30s
Retries with backoff
Retry only on transient issues:
- 429 (rate limited)
- 503 (server overloaded)
- network errors
Cookies and sessions
Keep a session cookie jar (axios does not by default). Use tough-cookie + axios-cookiejar-support when needed.
CAPTCHA and interstitial detection
In Node, you can detect blocks by searching response HTML for:
- “captcha”
- “unusual traffic”
- “verify you are a human”
Don’t keep hammering if you detect those.
Where ProxiesAPI fits in Node.js scrapers
ProxiesAPI typically fits in two places:
- HTTP scraping (axios/fetch through a proxy)
- Browser scraping (Playwright through a proxy)
1) Axios through a proxy
Axios can use an HTTP proxy agent.
npm i https-proxy-agent
import axios from "axios";
import { HttpsProxyAgent } from "https-proxy-agent";
const PROXY_URL = "http://USER:PASS@PROXY_HOST:PORT"; // ProxiesAPI-provided
const agent = new HttpsProxyAgent(PROXY_URL);
export async function fetchHtmlViaProxy(url) {
const res = await axios.get(url, {
httpsAgent: agent,
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
},
timeout: 30000,
});
return res.data;
}
This is usually the simplest way to wire ProxiesAPI into a Node scraper.
2) Playwright through a proxy
import { chromium } from "playwright";
const PROXY = {
server: "http://PROXY_HOST:PORT",
username: "USER",
password: "PASS",
};
const browser = await chromium.launch({ headless: true, proxy: PROXY });
Comparison: Cheerio vs Playwright
| Use case | Cheerio (HTTP) | Playwright (Browser) |
|---|---|---|
| Speed | Fast | Slower |
| Cost | Low | Higher |
| Works on JS-heavy sites | No | Yes |
| Easy to scale concurrency | Yes | Harder |
| More likely to trigger bot detection | Lower | Higher |
| Best for | blogs, listings, docs | flights, ecommerce, dashboards |
Practical advice:
- start with Cheerio
- only upgrade to Playwright when data isn’t in HTML
A minimal production checklist
- Put URLs in a queue (Redis/SQS) not a for-loop
- Cap concurrency and rate
- Log status codes and response sizes
- Store raw HTML for failed pages (debug later)
- Implement “block detected → pause”
- Add proxies only when you’re sure pacing alone isn’t enough
Closing thoughts
Web scraping with JavaScript is less about clever selectors and more about boring engineering:
- stable network layer
- conservative concurrency
- predictable retries
If you build those right, you can swap targets without rewriting everything.
As your Node scrapers scale from 50 URLs to 50,000, transient failures and IP-based throttling become the bottleneck. ProxiesAPI helps stabilize your fetch layer with proxy rotation and predictable connectivity.