Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

Apr 18, 2026 · guide · #javascript, #nodejs, #web-scraping, #playwright, #cheerio, #axios, #proxies, #tutorial

If you’re building scrapers in 2026, JavaScript + Node.js is a surprisingly strong default:

same language as the browser (easy DOM mental model)
best-in-class tooling for JS-rendered sites (Playwright)
good performance for I/O-heavy crawlers

This tutorial is a practical, end-to-end “starter kit” for scraping with Node:

HTTP + HTML parsing (fast path)
Playwright rendering (JS-heavy path)
retries, backoff, and caching
proxy integration with ProxiesAPI

Along the way, we’ll show the tradeoffs and give you copy-pasteable code.

Make your Node scrapers resilient with ProxiesAPI

As your Node scraper scales (more URLs, more targets), IP-based throttling becomes the #1 failure mode. ProxiesAPI gives you a stable proxy layer so retries actually work and crawls don’t die on a single blocked IP.

Get 1,000 free API calls View pricing

Keyword focus: web scraping with javascript

This post targets the keyword “web scraping with javascript”.

If you only take one thing away: in Node you typically use either:

Cheerio (parse HTML strings like jQuery) for server-rendered pages
Playwright (real headless browser) for JS-rendered pages

…and you should decide which one based on the target site’s rendering model.

Quick comparison table (what to use when)

Approach	Best for	Pros	Cons
`fetch`/`axios` + Cheerio	server-rendered HTML	very fast, cheap, easy to deploy	fails on JS apps, fragile selectors
Playwright	JS-heavy sites, dynamic UI	accurate DOM, can click/scroll/login	slower, heavier, more detectable
Hybrid (HTTP list → browser detail)	catalogs, pagination	cheaper than full-browser crawl	more code complexity

Part 1: Scrape a server-rendered page with Cheerio (fast path)

Setup

mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-limit

We’ll scrape a simple HTML page and extract titles + links.

Code: fetch + parse

// scrape-cheerio.js
import axios from "axios";
import * as cheerio from "cheerio";

const URL = "https://news.ycombinator.com/";

async function fetchHtml(url) {
  const res = await axios.get(url, {
    timeout: 30000,
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
      Accept:
        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      "Accept-Language": "en-US,en;q=0.9",
    },
  });
  return res.data;
}

function parseHn(html) {
  const $ = cheerio.load(html);
  const out = [];

  $("tr.athing").each((_, el) => {
    const id = $(el).attr("id");
    const a = $(el).find("span.titleline > a").first();
    out.push({
      id,
      title: a.text().trim(),
      url: a.attr("href"),
    });
  });

  return out;
}

const html = await fetchHtml(URL);
const items = parseHn(html);
console.log("items", items.length);
console.log(items.slice(0, 3));

Run:

node scrape-cheerio.js

Part 2: Add retries + backoff (so your crawler doesn’t crumble)

Scrapers fail for boring reasons:

intermittent 502/503
TCP timeouts
temporary throttling

If you don’t retry correctly, you’ll get random holes in your data.

npm i p-retry

import axios from "axios";
import pRetry from "p-retry";

async function fetchHtml(url) {
  return pRetry(
    async () => {
      const res = await axios.get(url, { timeout: 30000 });
      if (res.status >= 500) throw new Error(`server error ${res.status}`);
      return res.data;
    },
    {
      retries: 3,
      onFailedAttempt: (err) => {
        console.log(
          `fetch failed: attempt ${err.attemptNumber} / ${err.retriesLeft + err.attemptNumber}`,
          err.message
        );
      },
    }
  );
}

Add jitter between requests:

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
await sleep(500 + Math.random() * 1500);

Part 3: Scrape JS-heavy sites with Playwright (browser path)

Setup

npm i playwright
npx playwright install --with-deps chromium

Code: render and extract DOM

// scrape-playwright.js
import { chromium } from "playwright";

const URL = "https://example.com"; // replace with your target

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
  userAgent:
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
});

await page.goto(URL, { waitUntil: "domcontentloaded", timeout: 60000 });

// Wait for a selector that indicates the content loaded
await page.waitForTimeout(1500);

// Example extraction (replace selectors)
const items = await page.$$eval("a", (as) =>
  as.slice(0, 10).map((a) => ({ text: a.textContent?.trim(), href: a.href }))
);

console.log(items);
await browser.close();

If you don’t know whether you need Playwright:

view page source (curl -s URL | head) — if it’s mostly blank divs, you need a browser
open DevTools → Network → see if data comes from XHR/GraphQL

Part 4: Integrate ProxiesAPI (Node)

The most common reason scrapers fail at scale is IP reputation + rate limits.

ProxiesAPI fits as the network layer that:

routes requests through a proxy endpoint
can rotate IPs between requests

4.1 ProxiesAPI with Axios

Set an environment variable:

export PROXIESAPI_PROXY_URL="http://USER:PASS@proxy.proxiesapi.com:PORT"

Then configure Axios to use an HTTP proxy agent.

npm i https-proxy-agent

import axios from "axios";
import { HttpsProxyAgent } from "https-proxy-agent";

const proxyUrl = process.env.PROXIESAPI_PROXY_URL;

const agent = proxyUrl ? new HttpsProxyAgent(proxyUrl) : undefined;

const res = await axios.get("https://httpbin.org/ip", {
  timeout: 30000,
  httpsAgent: agent,
  // For some setups you may also set httpAgent
});

console.log(res.data);

4.2 ProxiesAPI with Playwright

For Playwright, you can set the proxy at the browser or context level:

import { chromium } from "playwright";

const proxyUrl = process.env.PROXIESAPI_PROXY_URL;

const browser = await chromium.launch({
  headless: true,
  proxy: proxyUrl ? { server: proxyUrl } : undefined,
});

const page = await browser.newPage();
await page.goto("https://httpbin.org/ip");
console.log(await page.textContent("body"));
await browser.close();

Note: exact proxy format depends on the endpoint ProxiesAPI gives you. Use ProxiesAPI’s docs for the correct server string and authentication style.

Part 5: Concurrency control (don’t DDoS your own success)

Even if you can run 100 concurrent requests, you usually shouldn’t.

In Node, a good default is p-limit:

import pLimit from "p-limit";

const limit = pLimit(5);

const results = await Promise.all(
  urls.map((url) =>
    limit(async () => {
      const html = await fetchHtml(url);
      return parse(html);
    })
  )
);

The goal is stable completion, not maximum speed.