Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

The keyword “web scraping with javascript” is popular for a reason: Node.js is a great fit for scraping pipelines.

  • You can write scrapers, ETL, and APIs in one language
  • You have excellent tooling for concurrency and retries
  • HTML parsing (Cheerio) feels like jQuery

In this tutorial you’ll build a production-style scraper in Node.js that:

  1. fetches HTML pages reliably (timeouts + retries + backoff)
  2. parses real DOM structure with Cheerio
  3. crawls pagination
  4. rotates proxies using ProxiesAPI
  5. exports a clean dataset (JSON/JSONL)

To keep the tutorial concrete, we’ll use a “blog-like listing → detail pages” pattern that maps to many sites:

  • listing page: many items + “next page”
  • detail page: each item has content you want
Make Node.js scrapers resilient with ProxiesAPI

Scrapers fail in the network layer first: timeouts, 429s, and blocks. ProxiesAPI gives you a clean way to rotate IPs and keep retries from cascading into downtime.


0) Before you scrape: basic rules that save you hours

  • Prefer server-rendered HTML targets when possible.
  • Start with one page → then add pagination → then add detail pages.
  • Keep concurrency low at first. Reliability beats speed.
  • Put every network request behind a function with:
    • timeouts
    • retries
    • jitter/backoff

1) Project setup

mkdir node-scraper
cd node-scraper
npm init -y
npm install cheerio p-limit dotenv

Node 18+ includes fetch(). If you’re on an older Node version, install undici or node-fetch.

Create .env:

PROXIESAPI_KEY="YOUR_KEY"
PROXIESAPI_ENDPOINT="https://proxiesapi.com"  # example; use your real endpoint

2) A resilient fetch() wrapper (retries + backoff)

Create src/http.js:

import "dotenv/config";

const TIMEOUT_MS = 30_000;

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

function withTimeout(signal, ms) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), ms);

  // If a signal is passed in, forward abort
  if (signal) {
    signal.addEventListener("abort", () => controller.abort(), { once: true });
  }

  return { signal: controller.signal, cancel: () => clearTimeout(timeout) };
}

export function proxiesapiUrl(targetUrl) {
  const key = process.env.PROXIESAPI_KEY;
  const endpoint = process.env.PROXIESAPI_ENDPOINT;
  if (!key || !endpoint) throw new Error("Missing PROXIESAPI_KEY/PROXIESAPI_ENDPOINT");

  const u = new URL(endpoint);
  // Many proxy APIs use ?api_key=...&url=...
  u.searchParams.set("api_key", key);
  u.searchParams.set("url", targetUrl);
  return u.toString();
}

export async function fetchHtml(targetUrl, { headers = {}, maxRetries = 4 } = {}) {
  const ua =
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
    "AppleWebKit/537.36 (KHTML, like Gecko) " +
    "Chrome/123.0.0.0 Safari/537.36";

  const finalHeaders = {
    "user-agent": ua,
    "accept-language": "en-US,en;q=0.9",
    ...headers,
  };

  let attempt = 0;
  while (true) {
    attempt += 1;
    const jitter = 250 + Math.floor(Math.random() * 800);
    await sleep(jitter);

    const { signal, cancel } = withTimeout(undefined, TIMEOUT_MS);

    try {
      const res = await fetch(proxiesapiUrl(targetUrl), {
        method: "GET",
        headers: finalHeaders,
        signal,
      });

      if (!res.ok) {
        // retry on transient errors
        const retryable = [429, 500, 502, 503, 504].includes(res.status);
        const body = await res.text().catch(() => "");
        if (retryable && attempt < maxRetries) {
          const backoff = Math.min(20_000, 500 * 2 ** (attempt - 1));
          await sleep(backoff);
          continue;
        }
        throw new Error(`HTTP ${res.status} ${res.statusText} :: ${body.slice(0, 200)}`);
      }

      return await res.text();
    } finally {
      cancel();
    }
  }
}

Notes:

  • We deliberately keep the logic simple.
  • We retry on 429/5xx, not on everything.
  • We add jitter to avoid request bursts.

3) Parse a listing page with Cheerio

Create src/parse.js:

import * as cheerio from "cheerio";

export function parseListing(html, { baseUrl }) {
  const $ = cheerio.load(html);

  // Example: grab all links. In real targets, restrict selectors.
  const links = new Set();

  $("a[href]").each((_, a) => {
    const href = $(a).attr("href");
    if (!href) return;

    // Normalize relative → absolute
    try {
      const u = new URL(href, baseUrl);
      links.add(u.toString());
    } catch {
      // ignore invalid URLs
    }
  });

  return Array.from(links);
}

export function parseTitle(html) {
  const $ = cheerio.load(html);
  return $("title").text().trim();
}

This is a generic parser. On your real target, you’ll do something like:

  • $(".product-card a")
  • $("article h2 a")
  • $("a.question-hyperlink")

The concept is the same.


4) Crawl pagination (the pattern that works everywhere)

Here’s the “crawl N pages” loop:

  • fetch listing page
  • parse item links
  • enqueue detail pages
  • follow “next page” (or build ?page=N URLs)

Create src/crawl.js:

import fs from "node:fs";
import path from "node:path";
import pLimit from "p-limit";

import { fetchHtml } from "./http.js";
import { parseTitle, parseListing } from "./parse.js";

const limit = pLimit(3); // keep concurrency modest

export async function crawl({ startUrl, pages = 3 }) {
  const base = new URL(startUrl).origin;

  const listingUrls = [];
  for (let i = 1; i <= pages; i++) {
    // If your target uses ?page=N, build it here.
    // Otherwise, you can parse a “next” link from HTML.
    listingUrls.push(startUrl.replace("{page}", String(i)));
  }

  const detailUrls = new Set();

  for (const u of listingUrls) {
    const html = await fetchHtml(u);
    const links = parseListing(html, { baseUrl: base });

    // Heuristic: keep only same-origin links
    for (const link of links) {
      if (link.startsWith(base)) detailUrls.add(link);
    }

    console.log("listing", u, "=> links", links.length, "detail set", detailUrls.size);
  }

  // Fetch details (concurrently)
  const results = await Promise.all(
    Array.from(detailUrls).slice(0, 50).map((u) =>
      limit(async () => {
        const html = await fetchHtml(u);
        return { url: u, title: parseTitle(html) };
      })
    )
  );

  return results;
}

export function writeJson(outPath, data) {
  fs.mkdirSync(path.dirname(outPath), { recursive: true });
  fs.writeFileSync(outPath, JSON.stringify(data, null, 2), "utf-8");
}

5) Run it: a complete working script

Create index.js:

import { crawl, writeJson } from "./src/crawl.js";

// Example pattern: a listing with ?page=1..N
// Replace with your real target.
const START_URL = "https://example.com/list?page={page}";

const data = await crawl({ startUrl: START_URL, pages: 3 });
writeJson("out/results.json", data);

console.log("wrote", data.length, "items");
console.log(data.slice(0, 3));

Run:

node index.js

Comparison: Requests vs Browser (Node.js)

Some sites are easy with HTTP + HTML parsing. Others are JS-rendered.

Here’s how to decide:

SituationUseWhy
Server-rendered pagesfetch + CheerioFast, cheap, reliable
Needs JS to render dataPlaywright/PuppeteerYou’ll otherwise parse empty HTML
Heavily blocked / fingerprintedBrowser + proxies + pacingYou need realistic behavior
Bulk scraping (many URLs)HTTP + proxiesCost-effective

Practical anti-blocking tips (Node.js)

  • Use timeouts and retry only on transient failures.
  • Add jitter between requests.
  • Keep concurrency modest (p-limit).
  • Rotate IPs for scale (ProxiesAPI).
  • Don’t run “infinite crawl” without:
    • dedupe
    • max pages
    • max depth

Where ProxiesAPI fits (honestly)

ProxiesAPI isn’t a “scrape anything instantly” button.

It’s a pragmatic tool to make the boring part of scraping reliable:

  • rotating IPs
  • re-trying failures cleanly
  • reducing downtime from blocks

Once your fetch layer is stable, you can focus on the parts that actually create value:

  • parsers
  • data model
  • exports
  • alerts

Next upgrades

  • Save HTML snapshots for debugging (out/html/...).
  • Add a URL queue and persist progress to SQLite.
  • Add a robots.txt + compliance check step.
  • Implement per-domain rate limiting.
Make Node.js scrapers resilient with ProxiesAPI

Scrapers fail in the network layer first: timeouts, 429s, and blocks. ProxiesAPI gives you a clean way to rotate IPs and keep retries from cascading into downtime.

Related guides

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A practical web scraping tutorial for JavaScript/Node.js using fetch + Cheerio, with pagination, retries, CSV export, and ProxiesAPI integration for more reliable crawling.
guide#javascript#nodejs#web-scraping
Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)
Learn Cheerio by building a reusable Node.js scraper: robust fetch layer (timeouts, retries), parsing patterns, pagination, and where ProxiesAPI fits for stability.
guide#nodejs#javascript#cheerio
Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial
A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.
guide#javascript#nodejs#web-scraping