Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial

If you’re building scrapers in 2026, Node.js is a fantastic choice — especially when you care about:

  • high concurrency (many URLs in parallel)
  • a strong ecosystem (Cheerio, Playwright, p-queue)
  • shipping scrapers as production services (Docker, serverless, queues)

This tutorial is a practical, end-to-end guide to web scraping with JavaScript.

You’ll learn:

  • when simple HTTP + HTML parsing is enough
  • how to scrape with fetch/axios + Cheerio
  • how to handle pagination and detail pages
  • how to avoid getting blocked (pacing, headers, retries)
  • when to switch to Playwright (dynamic sites)
  • where ProxiesAPI fits in

Target keyword: web scraping with javascript (used naturally throughout)

Make Node.js scrapers more reliable with ProxiesAPI

As your Node scrapers scale from 50 URLs to 50,000, transient failures and IP-based throttling become the bottleneck. ProxiesAPI helps stabilize your fetch layer with proxy rotation and predictable connectivity.


The mental model: 3 layers of scraping

Almost every scraper has the same shape:

  1. Fetch a page (HTTP request or headless browser)
  2. Parse it (HTML → structured data)
  3. Persist it (JSON/CSV/DB)

If you make each layer clean, you can swap components:

  • axiosfetch
  • Cheerio → JSDOM
  • HTTP → Playwright
  • local JSON → PostgreSQL

Option 1: Scrape static HTML with Node + Cheerio

If the site is server-rendered, you can scrape it without a browser.

Setup

mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-queue

We’ll use:

  • axios for HTTP
  • cheerio for DOM parsing (jQuery-like selectors)
  • p-queue to cap concurrency (a huge anti-block lever)

Build a reliable fetch()

// fetch.js
import axios from "axios";

export async function fetchHtml(url, { timeoutMs = 30000 } = {}) {
  const res = await axios.get(url, {
    timeout: timeoutMs,
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
      "Accept-Language": "en-US,en;q=0.9",
    },
    // If needed, you can pass proxy settings here (see ProxiesAPI section)
    validateStatus: (s) => s >= 200 && s < 400,
  });

  return res.data;
}

Parse with Cheerio

Let’s scrape a simple list page (example target: quotes.toscrape.com).

// scrape-quotes.js
import * as cheerio from "cheerio";
import { fetchHtml } from "./fetch.js";

const BASE = "https://quotes.toscrape.com";

function parseQuotes(html) {
  const $ = cheerio.load(html);
  const out = [];

  $(".quote").each((_, el) => {
    const text = $(el).find(".text").text().trim();
    const author = $(el).find(".author").text().trim();
    const tags = $(el)
      .find(".tags a.tag")
      .map((_, a) => $(a).text().trim())
      .get();

    out.push({ text, author, tags });
  });

  const nextHref = $("li.next a").attr("href");
  const nextUrl = nextHref ? new URL(nextHref, BASE).toString() : null;

  return { out, nextUrl };
}

async function run(pages = 3) {
  let url = BASE;
  const all = [];

  for (let i = 0; i < pages; i++) {
    const html = await fetchHtml(url);
    const { out, nextUrl } = parseQuotes(html);
    all.push(...out);
    if (!nextUrl) break;
    url = nextUrl;
  }

  console.log("quotes:", all.length);
  console.log(all[0]);
}

run();

This is web scraping with JavaScript at its cleanest: one HTTP request per page, parse selectors, and paginate.


Option 2: Add concurrency (without getting blocked)

A common mistake is Promise.all(urls.map(fetch)) on thousands of URLs.

Instead, use a queue with controlled concurrency.

import PQueue from "p-queue";
import { fetchHtml } from "./fetch.js";

const queue = new PQueue({ concurrency: 3, interval: 1000, intervalCap: 3 });

export async function fetchMany(urls) {
  const results = [];

  for (const url of urls) {
    results.push(
      queue.add(async () => {
        try {
          const html = await fetchHtml(url);
          return { url, ok: true, htmlLen: html.length };
        } catch (e) {
          return { url, ok: false, error: String(e) };
        }
      })
    );
  }

  return Promise.all(results);
}

This does two critical anti-block things:

  • caps concurrency (you control burstiness)
  • caps rate (interval + intervalCap)

Option 3: Scrape dynamic sites with Playwright

When a site is JS-rendered (React/Vue/Next with client-side fetch), Cheerio won’t see the data.

Use Playwright.

Setup

npm i playwright
npx playwright install

Example: render and extract text

import { chromium } from "playwright";

async function scrapeDynamic(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({ viewport: { width: 1400, height: 900 } });

  await page.goto(url, { waitUntil: "domcontentloaded", timeout: 60000 });
  await page.waitForTimeout(2000);

  // Always wait for a meaningful selector that indicates page is ready
  const title = await page.title();

  await browser.close();
  return { title };
}

scrapeDynamic("https://www.google.com/travel/flights?hl=en").then(console.log);

You can also extract structured data via:

  • page.locator("...").innerText()
  • page.$$eval("...", els => els.map(...))

Proxies, anti-blocking, and retries (the practical stack)

If you remember nothing else from this guide, remember this:

  • most blocks are caused by behavior (too fast, too many requests) more than tooling
  • reliability comes from timeouts + retries + pacing + observability

Headers

Send a real UA and language headers. For some sites, also send:

  • Accept: text/html,...
  • Referer

Timeouts

Never let requests hang:

  • connect timeout ~ 10s
  • read timeout ~ 30s

Retries with backoff

Retry only on transient issues:

  • 429 (rate limited)
  • 503 (server overloaded)
  • network errors

Cookies and sessions

Keep a session cookie jar (axios does not by default). Use tough-cookie + axios-cookiejar-support when needed.

CAPTCHA and interstitial detection

In Node, you can detect blocks by searching response HTML for:

  • “captcha”
  • “unusual traffic”
  • “verify you are a human”

Don’t keep hammering if you detect those.


Where ProxiesAPI fits in Node.js scrapers

ProxiesAPI typically fits in two places:

  1. HTTP scraping (axios/fetch through a proxy)
  2. Browser scraping (Playwright through a proxy)

1) Axios through a proxy

Axios can use an HTTP proxy agent.

npm i https-proxy-agent
import axios from "axios";
import { HttpsProxyAgent } from "https-proxy-agent";

const PROXY_URL = "http://USER:PASS@PROXY_HOST:PORT"; // ProxiesAPI-provided
const agent = new HttpsProxyAgent(PROXY_URL);

export async function fetchHtmlViaProxy(url) {
  const res = await axios.get(url, {
    httpsAgent: agent,
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    },
    timeout: 30000,
  });
  return res.data;
}

This is usually the simplest way to wire ProxiesAPI into a Node scraper.

2) Playwright through a proxy

import { chromium } from "playwright";

const PROXY = {
  server: "http://PROXY_HOST:PORT",
  username: "USER",
  password: "PASS",
};

const browser = await chromium.launch({ headless: true, proxy: PROXY });

Comparison: Cheerio vs Playwright

Use caseCheerio (HTTP)Playwright (Browser)
SpeedFastSlower
CostLowHigher
Works on JS-heavy sitesNoYes
Easy to scale concurrencyYesHarder
More likely to trigger bot detectionLowerHigher
Best forblogs, listings, docsflights, ecommerce, dashboards

Practical advice:

  • start with Cheerio
  • only upgrade to Playwright when data isn’t in HTML

A minimal production checklist

  • Put URLs in a queue (Redis/SQS) not a for-loop
  • Cap concurrency and rate
  • Log status codes and response sizes
  • Store raw HTML for failed pages (debug later)
  • Implement “block detected → pause”
  • Add proxies only when you’re sure pacing alone isn’t enough

Closing thoughts

Web scraping with JavaScript is less about clever selectors and more about boring engineering:

  • stable network layer
  • conservative concurrency
  • predictable retries

If you build those right, you can swap targets without rewriting everything.

Make Node.js scrapers more reliable with ProxiesAPI

As your Node scrapers scale from 50 URLs to 50,000, transient failures and IP-based throttling become the bottleneck. ProxiesAPI helps stabilize your fetch layer with proxy rotation and predictable connectivity.

Related guides

Node.js Web Scraping with Cheerio: Quick Start Guide
A practical Cheerio + HTTP quick start: fetch with retries, parse real HTML selectors, paginate, and scale reliably with ProxiesAPI.
guide#nodejs#cheerio#web-scraping
Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)
Learn Cheerio by building a reusable Node.js scraper: robust fetch layer (timeouts, retries), parsing patterns, pagination, and where ProxiesAPI fits for stability.
guide#nodejs#javascript#cheerio
How to Scrape Data Without Getting Blocked: A Practical Playbook
The anti-block basics: headers, cookies, pacing, fingerprints, detecting blocks, and when to switch to headless + proxies.
guide#web-scraping#anti-block#proxies
Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
Pull routes + dates, parse price cards reliably, and export a clean dataset with retries + proxy rotation.
tutorial#python#google-flights#web-scraping