Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial

Apr 17, 2026 · guide · #javascript, #nodejs, #web-scraping, #cheerio, #playwright, #proxies

If you’re building scrapers in 2026, Node.js is a fantastic choice — especially when you care about:

high concurrency (many URLs in parallel)
a strong ecosystem (Cheerio, Playwright, p-queue)
shipping scrapers as production services (Docker, serverless, queues)

This tutorial is a practical, end-to-end guide to web scraping with JavaScript.

You’ll learn:

when simple HTTP + HTML parsing is enough
how to scrape with fetch/axios + Cheerio
how to handle pagination and detail pages
how to avoid getting blocked (pacing, headers, retries)
when to switch to Playwright (dynamic sites)
where ProxiesAPI fits in

Target keyword: web scraping with javascript (used naturally throughout)

Make Node.js scrapers more reliable with ProxiesAPI

As your Node scrapers scale from 50 URLs to 50,000, transient failures and IP-based throttling become the bottleneck. ProxiesAPI helps stabilize your fetch layer with proxy rotation and predictable connectivity.

Get 1,000 free API calls View pricing

The mental model: 3 layers of scraping

Almost every scraper has the same shape:

Fetch a page (HTTP request or headless browser)
Parse it (HTML → structured data)
Persist it (JSON/CSV/DB)

If you make each layer clean, you can swap components:

axios → fetch
Cheerio → JSDOM
HTTP → Playwright
local JSON → PostgreSQL

Option 1: Scrape static HTML with Node + Cheerio

If the site is server-rendered, you can scrape it without a browser.

Setup

mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-queue

We’ll use:

axios for HTTP
cheerio for DOM parsing (jQuery-like selectors)
p-queue to cap concurrency (a huge anti-block lever)

Build a reliable fetch()

// fetch.js
import axios from "axios";

export async function fetchHtml(url, { timeoutMs = 30000 } = {}) {
  const res = await axios.get(url, {
    timeout: timeoutMs,
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
      "Accept-Language": "en-US,en;q=0.9",
    },
    // If needed, you can pass proxy settings here (see ProxiesAPI section)
    validateStatus: (s) => s >= 200 && s < 400,
  });

  return res.data;
}

Parse with Cheerio

Let’s scrape a simple list page (example target: quotes.toscrape.com).

// scrape-quotes.js
import * as cheerio from "cheerio";
import { fetchHtml } from "./fetch.js";

const BASE = "https://quotes.toscrape.com";

function parseQuotes(html) {
  const $ = cheerio.load(html);
  const out = [];

  $(".quote").each((_, el) => {
    const text = $(el).find(".text").text().trim();
    const author = $(el).find(".author").text().trim();
    const tags = $(el)
      .find(".tags a.tag")
      .map((_, a) => $(a).text().trim())
      .get();

    out.push({ text, author, tags });
  });

  const nextHref = $("li.next a").attr("href");
  const nextUrl = nextHref ? new URL(nextHref, BASE).toString() : null;

  return { out, nextUrl };
}

async function run(pages = 3) {
  let url = BASE;
  const all = [];

  for (let i = 0; i < pages; i++) {
    const html = await fetchHtml(url);
    const { out, nextUrl } = parseQuotes(html);
    all.push(...out);
    if (!nextUrl) break;
    url = nextUrl;
  }

  console.log("quotes:", all.length);
  console.log(all[0]);
}

run();

This is web scraping with JavaScript at its cleanest: one HTTP request per page, parse selectors, and paginate.

Option 2: Add concurrency (without getting blocked)

A common mistake is Promise.all(urls.map(fetch)) on thousands of URLs.

Instead, use a queue with controlled concurrency.

import PQueue from "p-queue";
import { fetchHtml } from "./fetch.js";

const queue = new PQueue({ concurrency: 3, interval: 1000, intervalCap: 3 });

export async function fetchMany(urls) {
  const results = [];

  for (const url of urls) {
    results.push(
      queue.add(async () => {
        try {
          const html = await fetchHtml(url);
          return { url, ok: true, htmlLen: html.length };
        } catch (e) {
          return { url, ok: false, error: String(e) };
        }
      })
    );
  }

  return Promise.all(results);
}

This does two critical anti-block things:

caps concurrency (you control burstiness)
caps rate (interval + intervalCap)

Option 3: Scrape dynamic sites with Playwright

When a site is JS-rendered (React/Vue/Next with client-side fetch), Cheerio won’t see the data.

Use Playwright.

Setup

npm i playwright
npx playwright install

Example: render and extract text

import { chromium } from "playwright";

async function scrapeDynamic(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({ viewport: { width: 1400, height: 900 } });

  await page.goto(url, { waitUntil: "domcontentloaded", timeout: 60000 });
  await page.waitForTimeout(2000);

  // Always wait for a meaningful selector that indicates page is ready
  const title = await page.title();

  await browser.close();
  return { title };
}

scrapeDynamic("https://www.google.com/travel/flights?hl=en").then(console.log);

You can also extract structured data via:

page.locator("...").innerText()
page.$$eval("...", els => els.map(...))

Proxies, anti-blocking, and retries (the practical stack)

If you remember nothing else from this guide, remember this:

most blocks are caused by behavior (too fast, too many requests) more than tooling
reliability comes from timeouts + retries + pacing + observability

Headers

Send a real UA and language headers. For some sites, also send:

Accept: text/html,...
Referer

Timeouts

Never let requests hang:

connect timeout ~ 10s
read timeout ~ 30s

Retries with backoff

Retry only on transient issues:

429 (rate limited)
503 (server overloaded)
network errors

Cookies and sessions

Keep a session cookie jar (axios does not by default). Use tough-cookie + axios-cookiejar-support when needed.

CAPTCHA and interstitial detection

In Node, you can detect blocks by searching response HTML for:

“captcha”
“unusual traffic”
“verify you are a human”

Don’t keep hammering if you detect those.

Where ProxiesAPI fits in Node.js scrapers

ProxiesAPI typically fits in two places:

HTTP scraping (axios/fetch through a proxy)
Browser scraping (Playwright through a proxy)

1) Axios through a proxy

Axios can use an HTTP proxy agent.

npm i https-proxy-agent

import axios from "axios";
import { HttpsProxyAgent } from "https-proxy-agent";

const PROXY_URL = "http://USER:PASS@PROXY_HOST:PORT"; // ProxiesAPI-provided
const agent = new HttpsProxyAgent(PROXY_URL);

export async function fetchHtmlViaProxy(url) {
  const res = await axios.get(url, {
    httpsAgent: agent,
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    },
    timeout: 30000,
  });
  return res.data;
}

This is usually the simplest way to wire ProxiesAPI into a Node scraper.

2) Playwright through a proxy

import { chromium } from "playwright";

const PROXY = {
  server: "http://PROXY_HOST:PORT",
  username: "USER",
  password: "PASS",
};

const browser = await chromium.launch({ headless: true, proxy: PROXY });

Comparison: Cheerio vs Playwright

Use case	Cheerio (HTTP)	Playwright (Browser)
Speed	Fast	Slower
Cost	Low	Higher
Works on JS-heavy sites	No	Yes
Easy to scale concurrency	Yes	Harder
More likely to trigger bot detection	Lower	Higher
Best for	blogs, listings, docs	flights, ecommerce, dashboards