Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)

May 06, 2026 · guide · #javascript, #nodejs, #web-scraping, #cheerio, #playwright, #proxies, #tutorial

This is a hands-on guide to web scraping with JavaScript + Node.js.

You’ll learn two scraping modes:

Fast mode (HTML): fetch + cheerio — best for server-rendered pages
Fallback mode (browser): playwright — best for JS-heavy pages

And we’ll wrap both in a production-friendly structure:

timeouts
retries with exponential backoff
concurrency control
rate limiting
proxy support (including ProxiesAPI)

If your goal is: “I want data reliably, at scale” — this is the stack.

Make your Node scrapers reliable with ProxiesAPI

When your Node scraper goes from 20 URLs to 20,000, you’ll see more timeouts, blocks, and flaky responses. ProxiesAPI gives you a proxy layer (rotation + reputation) so you can keep the crawl stable without hand-rolling infra.

Get 1,000 free API calls View pricing

When Node.js is a great choice for scraping

Node shines when:

you already build with JS/TS
you want to share parsing code with a front-end
you need a great async I/O model
you want easy access to Playwright/Puppeteer

Python still has a massive ecosystem, but Node is absolutely production-grade for scraping.

Project setup

Create a folder and initialize:

mkdir node-scraper
cd node-scraper
npm init -y

Install dependencies:

npm install cheerio p-limit

Node 18+ includes fetch built-in.

If you want the browser fallback:

npm install playwright
npx playwright install

Part 1: Fast HTML scraping with fetch + Cheerio

1) Fetch a page with a real timeout

Node’s built-in fetch doesn’t include a timeout by default. Use AbortController:

// fetch-with-timeout.js
export async function fetchWithTimeout(url, { timeoutMs = 20000, ...opts } = {}) {
  const controller = new AbortController();
  const id = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const res = await fetch(url, {
      ...opts,
      signal: controller.signal,
      headers: {
        "user-agent":
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "accept-language": "en-US,en;q=0.9",
        ...(opts.headers || {}),
      },
    });

    if (!res.ok) {
      throw new Error(`HTTP ${res.status} for ${url}`);
    }

    const text = await res.text();
    return text;
  } finally {
    clearTimeout(id);
  }
}

2) Parse HTML with Cheerio

Cheerio gives you a jQuery-like API.

Example: scrape a blog list page into {title, url}.

// parse-list.js
import * as cheerio from "cheerio";

export function parseList(html, baseUrl) {
  const $ = cheerio.load(html);

  const items = [];
  $("article a").each((_, el) => {
    const href = $(el).attr("href");
    const title = $(el).text().trim();
    if (!href || !title) return;

    const url = href.startsWith("http") ? href : new URL(href, baseUrl).toString();
    items.push({ title, url });
  });

  return items;
}

3) Run it

// index.js
import { fetchWithTimeout } from "./fetch-with-timeout.js";
import { parseList } from "./parse-list.js";

const URL = "https://example.com";

const html = await fetchWithTimeout(URL, { timeoutMs: 20000 });
const items = parseList(html, URL);

console.log("items:", items.length);
console.log(items.slice(0, 5));

Run:

node index.js

Part 2: Make it production-friendly (retries + concurrency)

Retries with backoff

You can roll a tiny retry helper:

export async function retry(fn, { attempts = 5, minDelayMs = 500 } = {}) {
  let lastErr;

  for (let i = 1; i <= attempts; i++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      const delay = Math.min(20000, minDelayMs * 2 ** (i - 1));
      await new Promise((r) => setTimeout(r, delay));
    }
  }

  throw lastErr;
}

Concurrency limiting

When you have 10,000 URLs, you don’t want 10,000 parallel requests.

Use p-limit:

import pLimit from "p-limit";

const limit = pLimit(5); // 5 concurrent tasks

const pages = await Promise.all(
  urls.map((u) =>
    limit(async () => {
      const html = await fetchWithTimeout(u, { timeoutMs: 20000 });
      return { url: u, htmlLen: html.length };
    })
  )
);

console.log("done", pages.length);

Part 3: Proxy support (including ProxiesAPI)

In Node, proxying depends on whether you:

use fetch (needs a proxy agent)
use Playwright (proxy option is built-in)

Option A: Use Playwright with a proxy (simplest)

import { chromium } from "playwright";

const browser = await chromium.launch({
  headless: true,
  proxy: {
    server: "http://YOUR_PROXIESAPI_PROXY_HOST:PORT",
    username: "YOUR_USERNAME",
    password: "YOUR_PASSWORD",
  },
});

const page = await browser.newPage();
await page.goto("https://example.com", { waitUntil: "domcontentloaded" });
const html = await page.content();
await browser.close();

This is a clean way to integrate ProxiesAPI for JS-heavy targets.

Option B: fetch through an HTTP proxy

For fetch, you typically use a proxy agent library. A common approach is https-proxy-agent:

npm install https-proxy-agent

import { HttpsProxyAgent } from "https-proxy-agent";

const proxyUrl = process.env.PROXY_URL; // e.g. http://user:pass@host:port
const agent = new HttpsProxyAgent(proxyUrl);

const res = await fetch("https://example.com", { agent });

(Exact options can vary by Node version and fetch implementation. If your environment uses undici, you may prefer undici’s ProxyAgent.)

Part 4: Browser fallback for JS-heavy sites (Playwright)

When HTML scraping fails because content is rendered client-side, use Playwright:

import { chromium } from "playwright";

export async function fetchRenderedHtml(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: "networkidle", timeout: 60000 });

  // If the site lazy-loads, a small scroll can help
  await page.mouse.wheel(0, 1500);
  await page.waitForTimeout(800);

  const html = await page.content();
  await browser.close();
  return html;
}

This “rendered snapshot” is also great for debugging selectors.

Comparison: Cheerio vs Playwright

Feature	Cheerio (HTML)	Playwright (Browser)
Speed	Very fast	Slower
Cost	Low	Higher
JS-heavy sites	Fails	Works
Anti-bot exposure	Lower	Often higher
Selector stability	Depends on markup	Often better

Rule of thumb:

Start with Cheerio.
Add Playwright only when needed.

Common scraping mistakes (and fixes)

No timeouts → requests hang forever
Too much concurrency → you DDoS yourself (and get blocked)
No retries → flaky pages break your pipeline
No dedupe → you scrape the same page repeatedly
No proxy plan → your IP gets burned at scale

A tiny “real” scraper template (put it all together)

import pLimit from "p-limit";
import * as cheerio from "cheerio";
import { fetchWithTimeout } from "./fetch-with-timeout.js";
import { retry } from "./retry.js";

const startUrl = "https://example.com";
const limit = pLimit(5);

function parseLinks(html, baseUrl) {
  const $ = cheerio.load(html);
  const links = new Set();

  $("a").each((_, el) => {
    const href = $(el).attr("href");
    if (!href) return;
    try {
      const u = new URL(href, baseUrl);
      if (u.protocol.startsWith("http")) links.add(u.toString());
    } catch {}
  });

  return [...links];
}

const html = await retry(() => fetchWithTimeout(startUrl, { timeoutMs: 20000 }));
const urls = parseLinks(html, startUrl).slice(0, 50);

const results = await Promise.all(
  urls.map((u) =>
    limit(() =>
      retry(async () => {
        const h = await fetchWithTimeout(u, { timeoutMs: 20000 });
        return { url: u, bytes: h.length };
      })
    )
  )
);

console.log("scraped", results.length);
console.log(results.slice(0, 5));