Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)
If you’re building scrapers in 2026, Node.js is a killer choice:
- fast iteration
- a huge ecosystem
- excellent browser automation (Playwright/Puppeteer)
But the biggest mistake people make is jumping straight to headless browsers for everything.
A production setup is two-tier:
- HTTP-first (cheap + fast): fetch HTML and parse it
- Browser fallback (expensive + powerful): only for pages that truly need JS
This tutorial gives you a complete, copy-pasteable stack for the keyword web scraping javascript.
As soon as your crawler hits real-world scale (more URLs, more concurrency, more blocks), the proxy layer becomes the difference between a toy script and a reliable pipeline.
The 80/20 architecture (HTTP-first → browser fallback)
Here’s the pattern you want:
- a URL queue
- an HTTP fetcher (with retries, timeouts, and proxies)
- an HTML parser (Cheerio)
- an escalation rule:
- if content is missing
- or you get soft-blocked
- then render with Playwright
This keeps costs down and throughput up.
Project setup
mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-limit
npm i -D playwright
(You can use Puppeteer instead of Playwright — I’ll show both patterns. I prefer Playwright for reliability and multi-browser support.)
Step 1: HTTP scraping with Axios + Cheerio
A production-grade fetcher
- timeouts
- retry with backoff (simple and effective)
- proxy support via env vars (ProxiesAPI)
// fetch.js
import axios from "axios";
const TIMEOUT_MS = 30_000;
function sleep(ms) {
return new Promise((r) => setTimeout(r, ms));
}
function buildProxyFromEnv() {
// Prefer a single URL like: http://USER:PASS@proxy.proxiesapi.com:1234
const proxyUrl = process.env.PROXIESAPI_PROXY_URL;
if (!proxyUrl) return null;
// Axios proxy config is host/port based; for URL proxies,
// the simplest approach is to use an HTTPS proxy agent.
// To keep this tutorial dependency-light, we’ll pass proxies via HTTP(S)_PROXY
// environment variables that many HTTP stacks respect.
// If you want strict proxying in Axios, add `https-proxy-agent`.
return proxyUrl;
}
export async function fetchHtml(url) {
const proxyUrl = buildProxyFromEnv();
const headers = {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
};
// If your environment supports it, this is the easiest way to route traffic.
// Many tools respect HTTP(S)_PROXY.
if (proxyUrl) {
process.env.HTTP_PROXY = proxyUrl;
process.env.HTTPS_PROXY = proxyUrl;
}
let lastErr;
for (let attempt = 1; attempt <= 4; attempt++) {
try {
const res = await axios.get(url, {
timeout: TIMEOUT_MS,
headers,
maxRedirects: 5,
validateStatus: () => true,
});
if (res.status === 403 || res.status === 429) {
throw new Error(`blocked/throttled: HTTP ${res.status}`);
}
if (res.status >= 500) {
throw new Error(`server error: HTTP ${res.status}`);
}
if (res.status < 200 || res.status >= 300) {
throw new Error(`unexpected status: HTTP ${res.status}`);
}
const html = res.data;
if (!html || typeof html !== "string" || html.length < 500) {
throw new Error("HTML too small (possible block or JS-only page)");
}
return html;
} catch (e) {
lastErr = e;
const backoff = Math.min(1500 * 2 ** (attempt - 1), 10_000);
await sleep(backoff);
}
}
throw lastErr;
}
Parse with Cheerio
// parse.js
import * as cheerio from "cheerio";
export function parseHackerNewsLike(html) {
const $ = cheerio.load(html);
// Example extraction: all links
const links = [];
$("a").each((_, el) => {
const href = $(el).attr("href");
const text = $(el).text().trim();
if (href) links.push({ href, text });
});
return links.slice(0, 50);
}
(Replace parsing logic with your site’s selectors. Cheerio is basically jQuery for server-side HTML.)
Step 2: Detect when you need a browser
A good escalation rule is:
- your key selector returns 0 matches
- or the page contains a known block signature
- or HTML is suspiciously small
Example:
export function needsBrowser(html) {
const lower = html.toLowerCase();
return (
lower.includes("captcha") ||
lower.includes("unusual traffic") ||
lower.includes("/sorry/") ||
html.length < 5_000
);
}
Step 3: Browser scraping with Playwright (with proxy support)
Playwright is your “break glass” option. Use it selectively.
// render.js
import { chromium } from "playwright";
export async function renderHtml(url) {
const proxyUrl = process.env.PROXIESAPI_PROXY_URL; // e.g. http://user:pass@host:port
const browser = await chromium.launch({
headless: true,
});
const context = await browser.newContext(
proxyUrl
? {
proxy: { server: proxyUrl },
userAgent:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
}
: {}
);
const page = await context.newPage();
await page.goto(url, { waitUntil: "domcontentloaded", timeout: 45_000 });
// If your target renders late, use networkidle, but it can hang on long-polling sites.
// await page.goto(url, { waitUntil: "networkidle", timeout: 45_000 });
const html = await page.content();
await browser.close();
return html;
}
Puppeteer alternative
If you prefer Puppeteer:
import puppeteer from "puppeteer";
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto(url, { waitUntil: "domcontentloaded" });
const html = await page.content();
await browser.close();
Step 4: Put it together (queue + concurrency)
// index.js
import pLimit from "p-limit";
import { fetchHtml } from "./fetch.js";
import { renderHtml } from "./render.js";
import { needsBrowser } from "./needsBrowser.js";
import { parseHackerNewsLike } from "./parse.js";
const limit = pLimit(3);
async function scrapeUrl(url) {
let html = await fetchHtml(url);
if (needsBrowser(html)) {
html = await renderHtml(url);
}
return parseHackerNewsLike(html);
}
const urls = ["https://news.ycombinator.com/"];
const results = await Promise.all(
urls.map((u) => limit(() => scrapeUrl(u)))
);
console.log(results[0].slice(0, 5));
This structure scales cleanly:
- increase concurrency gradually
- add per-domain rate limits
- add caching
- add database writes
Proxies + blocks: the practical checklist
When a Node scraper fails in production, it’s usually one of these:
- no timeouts (requests hang forever)
- no retries/backoff (transient errors kill the run)
- no block detection (you “successfully” parse a CAPTCHA page)
- too much browser automation (slow + expensive)
A simple checklist:
- timeouts everywhere
- retries with exponential backoff
- block detection (403/429 + HTML signatures)
- HTTP-first; browser only when needed
- low steady request rate per domain
- proxies configured via env vars (ProxiesAPI)
Comparison table: Cheerio vs Playwright vs Puppeteer
| Tool | Best for | Pros | Cons |
|---|---|---|---|
| Cheerio | server-rendered HTML | fast, cheap, simple | can’t execute JS |
| Playwright | JS-rendered pages | reliable, modern, multi-browser | slower, higher cost |
| Puppeteer | Chrome automation | big ecosystem | fewer cross-browser features |
Where ProxiesAPI fits (honestly)
Proxies don’t replace good scraping hygiene — they complement it.
Use ProxiesAPI to:
- rotate outbound IPs when throttling starts
- isolate domains (different sessions / IP pools)
- keep high-volume crawls stable
And keep your scraper disciplined:
- HTTP-first
- browser only when needed
- store results so you can resume instead of restarting
As soon as your crawler hits real-world scale (more URLs, more concurrency, more blocks), the proxy layer becomes the difference between a toy script and a reliable pipeline.