Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)

This is a hands-on guide to web scraping with JavaScript + Node.js.

You’ll learn two scraping modes:

  1. Fast mode (HTML): fetch + cheerio — best for server-rendered pages
  2. Fallback mode (browser): playwright — best for JS-heavy pages

And we’ll wrap both in a production-friendly structure:

  • timeouts
  • retries with exponential backoff
  • concurrency control
  • rate limiting
  • proxy support (including ProxiesAPI)

If your goal is: “I want data reliably, at scale” — this is the stack.

Make your Node scrapers reliable with ProxiesAPI

When your Node scraper goes from 20 URLs to 20,000, you’ll see more timeouts, blocks, and flaky responses. ProxiesAPI gives you a proxy layer (rotation + reputation) so you can keep the crawl stable without hand-rolling infra.


When Node.js is a great choice for scraping

Node shines when:

  • you already build with JS/TS
  • you want to share parsing code with a front-end
  • you need a great async I/O model
  • you want easy access to Playwright/Puppeteer

Python still has a massive ecosystem, but Node is absolutely production-grade for scraping.


Project setup

Create a folder and initialize:

mkdir node-scraper
cd node-scraper
npm init -y

Install dependencies:

npm install cheerio p-limit

Node 18+ includes fetch built-in.

If you want the browser fallback:

npm install playwright
npx playwright install

Part 1: Fast HTML scraping with fetch + Cheerio

1) Fetch a page with a real timeout

Node’s built-in fetch doesn’t include a timeout by default. Use AbortController:

// fetch-with-timeout.js
export async function fetchWithTimeout(url, { timeoutMs = 20000, ...opts } = {}) {
  const controller = new AbortController();
  const id = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const res = await fetch(url, {
      ...opts,
      signal: controller.signal,
      headers: {
        "user-agent":
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "accept-language": "en-US,en;q=0.9",
        ...(opts.headers || {}),
      },
    });

    if (!res.ok) {
      throw new Error(`HTTP ${res.status} for ${url}`);
    }

    const text = await res.text();
    return text;
  } finally {
    clearTimeout(id);
  }
}

2) Parse HTML with Cheerio

Cheerio gives you a jQuery-like API.

Example: scrape a blog list page into {title, url}.

// parse-list.js
import * as cheerio from "cheerio";

export function parseList(html, baseUrl) {
  const $ = cheerio.load(html);

  const items = [];
  $("article a").each((_, el) => {
    const href = $(el).attr("href");
    const title = $(el).text().trim();
    if (!href || !title) return;

    const url = href.startsWith("http") ? href : new URL(href, baseUrl).toString();
    items.push({ title, url });
  });

  return items;
}

3) Run it

// index.js
import { fetchWithTimeout } from "./fetch-with-timeout.js";
import { parseList } from "./parse-list.js";

const URL = "https://example.com";

const html = await fetchWithTimeout(URL, { timeoutMs: 20000 });
const items = parseList(html, URL);

console.log("items:", items.length);
console.log(items.slice(0, 5));

Run:

node index.js

Part 2: Make it production-friendly (retries + concurrency)

Retries with backoff

You can roll a tiny retry helper:

export async function retry(fn, { attempts = 5, minDelayMs = 500 } = {}) {
  let lastErr;

  for (let i = 1; i <= attempts; i++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      const delay = Math.min(20000, minDelayMs * 2 ** (i - 1));
      await new Promise((r) => setTimeout(r, delay));
    }
  }

  throw lastErr;
}

Concurrency limiting

When you have 10,000 URLs, you don’t want 10,000 parallel requests.

Use p-limit:

import pLimit from "p-limit";

const limit = pLimit(5); // 5 concurrent tasks

const pages = await Promise.all(
  urls.map((u) =>
    limit(async () => {
      const html = await fetchWithTimeout(u, { timeoutMs: 20000 });
      return { url: u, htmlLen: html.length };
    })
  )
);

console.log("done", pages.length);

Part 3: Proxy support (including ProxiesAPI)

In Node, proxying depends on whether you:

  • use fetch (needs a proxy agent)
  • use Playwright (proxy option is built-in)

Option A: Use Playwright with a proxy (simplest)

import { chromium } from "playwright";

const browser = await chromium.launch({
  headless: true,
  proxy: {
    server: "http://YOUR_PROXIESAPI_PROXY_HOST:PORT",
    username: "YOUR_USERNAME",
    password: "YOUR_PASSWORD",
  },
});

const page = await browser.newPage();
await page.goto("https://example.com", { waitUntil: "domcontentloaded" });
const html = await page.content();
await browser.close();

This is a clean way to integrate ProxiesAPI for JS-heavy targets.

Option B: fetch through an HTTP proxy

For fetch, you typically use a proxy agent library. A common approach is https-proxy-agent:

npm install https-proxy-agent
import { HttpsProxyAgent } from "https-proxy-agent";

const proxyUrl = process.env.PROXY_URL; // e.g. http://user:pass@host:port
const agent = new HttpsProxyAgent(proxyUrl);

const res = await fetch("https://example.com", { agent });

(Exact options can vary by Node version and fetch implementation. If your environment uses undici, you may prefer undici’s ProxyAgent.)


Part 4: Browser fallback for JS-heavy sites (Playwright)

When HTML scraping fails because content is rendered client-side, use Playwright:

import { chromium } from "playwright";

export async function fetchRenderedHtml(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: "networkidle", timeout: 60000 });

  // If the site lazy-loads, a small scroll can help
  await page.mouse.wheel(0, 1500);
  await page.waitForTimeout(800);

  const html = await page.content();
  await browser.close();
  return html;
}

This “rendered snapshot” is also great for debugging selectors.


Comparison: Cheerio vs Playwright

FeatureCheerio (HTML)Playwright (Browser)
SpeedVery fastSlower
CostLowHigher
JS-heavy sitesFailsWorks
Anti-bot exposureLowerOften higher
Selector stabilityDepends on markupOften better

Rule of thumb:

  • Start with Cheerio.
  • Add Playwright only when needed.

Common scraping mistakes (and fixes)

  1. No timeouts → requests hang forever
  2. Too much concurrency → you DDoS yourself (and get blocked)
  3. No retries → flaky pages break your pipeline
  4. No dedupe → you scrape the same page repeatedly
  5. No proxy plan → your IP gets burned at scale

A tiny “real” scraper template (put it all together)

import pLimit from "p-limit";
import * as cheerio from "cheerio";
import { fetchWithTimeout } from "./fetch-with-timeout.js";
import { retry } from "./retry.js";

const startUrl = "https://example.com";
const limit = pLimit(5);

function parseLinks(html, baseUrl) {
  const $ = cheerio.load(html);
  const links = new Set();

  $("a").each((_, el) => {
    const href = $(el).attr("href");
    if (!href) return;
    try {
      const u = new URL(href, baseUrl);
      if (u.protocol.startsWith("http")) links.add(u.toString());
    } catch {}
  });

  return [...links];
}

const html = await retry(() => fetchWithTimeout(startUrl, { timeoutMs: 20000 }));
const urls = parseLinks(html, startUrl).slice(0, 50);

const results = await Promise.all(
  urls.map((u) =>
    limit(() =>
      retry(async () => {
        const h = await fetchWithTimeout(u, { timeoutMs: 20000 });
        return { url: u, bytes: h.length };
      })
    )
  )
);

console.log("scraped", results.length);
console.log(results.slice(0, 5));

FAQ

Is scraping with JavaScript “worse” than Python?

No. The language matters less than:

  • your fetch/retry/rate-limit strategy
  • whether you can handle JS-heavy pages
  • your proxy approach

Do I always need a browser?

No. Browsers are powerful but expensive. Prefer HTML scraping first.


Next steps

  • add a URL frontier (queue) + visited set
  • store results in SQLite or Postgres
  • add ProxiesAPI rotation when the crawl becomes flaky
  • add a Playwright mode for the hard targets
Make your Node scrapers reliable with ProxiesAPI

When your Node scraper goes from 20 URLs to 20,000, you’ll see more timeouts, blocks, and flaky responses. ProxiesAPI gives you a proxy layer (rotation + reputation) so you can keep the crawl stable without hand-rolling infra.

Related guides

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)
A practical Node.js scraping stack for 2026: HTTP-first with Cheerio, then Playwright for JS-rendered sites — plus proxy rotation, retries, and a clean project template.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial
A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
An end-to-end Node.js scraping workflow: fetch pages with retries, parse HTML, handle pagination, rotate proxies with ProxiesAPI, and export clean JSON.
guide#javascript#nodejs#web-scraping