Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

May 01, 2026 · guide · #javascript, #nodejs, #web-scraping, #cheerio, #fetch, #pagination, #proxies

The keyword “web scraping with javascript” is popular for a reason: Node.js is a great fit for scraping pipelines.

You can write scrapers, ETL, and APIs in one language
You have excellent tooling for concurrency and retries
HTML parsing (Cheerio) feels like jQuery

In this tutorial you’ll build a production-style scraper in Node.js that:

fetches HTML pages reliably (timeouts + retries + backoff)
parses real DOM structure with Cheerio
crawls pagination
rotates proxies using ProxiesAPI
exports a clean dataset (JSON/JSONL)

To keep the tutorial concrete, we’ll use a “blog-like listing → detail pages” pattern that maps to many sites:

listing page: many items + “next page”
detail page: each item has content you want

Make Node.js scrapers resilient with ProxiesAPI

Scrapers fail in the network layer first: timeouts, 429s, and blocks. ProxiesAPI gives you a clean way to rotate IPs and keep retries from cascading into downtime.

Get 1,000 free API calls View pricing

0) Before you scrape: basic rules that save you hours

Prefer server-rendered HTML targets when possible.
Start with one page → then add pagination → then add detail pages.
Keep concurrency low at first. Reliability beats speed.
Put every network request behind a function with:
- timeouts
- retries
- jitter/backoff

1) Project setup

mkdir node-scraper
cd node-scraper
npm init -y
npm install cheerio p-limit dotenv

Node 18+ includes fetch(). If you’re on an older Node version, install undici or node-fetch.

Create .env:

PROXIESAPI_KEY="YOUR_KEY"
PROXIESAPI_ENDPOINT="https://proxiesapi.com"  # example; use your real endpoint

2) A resilient fetch() wrapper (retries + backoff)

Create src/http.js:

import "dotenv/config";

const TIMEOUT_MS = 30_000;

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

function withTimeout(signal, ms) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), ms);

  // If a signal is passed in, forward abort
  if (signal) {
    signal.addEventListener("abort", () => controller.abort(), { once: true });
  }

  return { signal: controller.signal, cancel: () => clearTimeout(timeout) };
}

export function proxiesapiUrl(targetUrl) {
  const key = process.env.PROXIESAPI_KEY;
  const endpoint = process.env.PROXIESAPI_ENDPOINT;
  if (!key || !endpoint) throw new Error("Missing PROXIESAPI_KEY/PROXIESAPI_ENDPOINT");

  const u = new URL(endpoint);
  // Many proxy APIs use ?api_key=...&url=...
  u.searchParams.set("api_key", key);
  u.searchParams.set("url", targetUrl);
  return u.toString();
}

export async function fetchHtml(targetUrl, { headers = {}, maxRetries = 4 } = {}) {
  const ua =
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
    "AppleWebKit/537.36 (KHTML, like Gecko) " +
    "Chrome/123.0.0.0 Safari/537.36";

  const finalHeaders = {
    "user-agent": ua,
    "accept-language": "en-US,en;q=0.9",
    ...headers,
  };

  let attempt = 0;
  while (true) {
    attempt += 1;
    const jitter = 250 + Math.floor(Math.random() * 800);
    await sleep(jitter);

    const { signal, cancel } = withTimeout(undefined, TIMEOUT_MS);

    try {
      const res = await fetch(proxiesapiUrl(targetUrl), {
        method: "GET",
        headers: finalHeaders,
        signal,
      });

      if (!res.ok) {
        // retry on transient errors
        const retryable = [429, 500, 502, 503, 504].includes(res.status);
        const body = await res.text().catch(() => "");
        if (retryable && attempt < maxRetries) {
          const backoff = Math.min(20_000, 500 * 2 ** (attempt - 1));
          await sleep(backoff);
          continue;
        }
        throw new Error(`HTTP ${res.status} ${res.statusText} :: ${body.slice(0, 200)}`);
      }

      return await res.text();
    } finally {
      cancel();
    }
  }
}

Notes:

We deliberately keep the logic simple.
We retry on 429/5xx, not on everything.
We add jitter to avoid request bursts.

3) Parse a listing page with Cheerio

Create src/parse.js:

import * as cheerio from "cheerio";

export function parseListing(html, { baseUrl }) {
  const $ = cheerio.load(html);

  // Example: grab all links. In real targets, restrict selectors.
  const links = new Set();

  $("a[href]").each((_, a) => {
    const href = $(a).attr("href");
    if (!href) return;

    // Normalize relative → absolute
    try {
      const u = new URL(href, baseUrl);
      links.add(u.toString());
    } catch {
      // ignore invalid URLs
    }
  });

  return Array.from(links);
}

export function parseTitle(html) {
  const $ = cheerio.load(html);
  return $("title").text().trim();
}

This is a generic parser. On your real target, you’ll do something like:

$(".product-card a")
$("article h2 a")
$("a.question-hyperlink")

The concept is the same.

4) Crawl pagination (the pattern that works everywhere)

Here’s the “crawl N pages” loop:

fetch listing page
parse item links
enqueue detail pages
follow “next page” (or build ?page=N URLs)

Create src/crawl.js:

import fs from "node:fs";
import path from "node:path";
import pLimit from "p-limit";

import { fetchHtml } from "./http.js";
import { parseTitle, parseListing } from "./parse.js";

const limit = pLimit(3); // keep concurrency modest

export async function crawl({ startUrl, pages = 3 }) {
  const base = new URL(startUrl).origin;

  const listingUrls = [];
  for (let i = 1; i <= pages; i++) {
    // If your target uses ?page=N, build it here.
    // Otherwise, you can parse a “next” link from HTML.
    listingUrls.push(startUrl.replace("{page}", String(i)));
  }

  const detailUrls = new Set();

  for (const u of listingUrls) {
    const html = await fetchHtml(u);
    const links = parseListing(html, { baseUrl: base });

    // Heuristic: keep only same-origin links
    for (const link of links) {
      if (link.startsWith(base)) detailUrls.add(link);
    }

    console.log("listing", u, "=> links", links.length, "detail set", detailUrls.size);
  }

  // Fetch details (concurrently)
  const results = await Promise.all(
    Array.from(detailUrls).slice(0, 50).map((u) =>
      limit(async () => {
        const html = await fetchHtml(u);
        return { url: u, title: parseTitle(html) };
      })
    )
  );

  return results;
}

export function writeJson(outPath, data) {
  fs.mkdirSync(path.dirname(outPath), { recursive: true });
  fs.writeFileSync(outPath, JSON.stringify(data, null, 2), "utf-8");
}

5) Run it: a complete working script

Create index.js:

import { crawl, writeJson } from "./src/crawl.js";

// Example pattern: a listing with ?page=1..N
// Replace with your real target.
const START_URL = "https://example.com/list?page={page}";

const data = await crawl({ startUrl: START_URL, pages: 3 });
writeJson("out/results.json", data);

console.log("wrote", data.length, "items");
console.log(data.slice(0, 3));

Run:

node index.js

Comparison: Requests vs Browser (Node.js)

Some sites are easy with HTTP + HTML parsing. Others are JS-rendered.

Here’s how to decide:

Situation	Use	Why
Server-rendered pages	`fetch` + Cheerio	Fast, cheap, reliable
Needs JS to render data	Playwright/Puppeteer	You’ll otherwise parse empty HTML
Heavily blocked / fingerprinted	Browser + proxies + pacing	You need realistic behavior
Bulk scraping (many URLs)	HTTP + proxies	Cost-effective

Practical anti-blocking tips (Node.js)

Use timeouts and retry only on transient failures.
Add jitter between requests.
Keep concurrency modest (p-limit).
Rotate IPs for scale (ProxiesAPI).
Don’t run “infinite crawl” without:
- dedupe
- max pages
- max depth

Where ProxiesAPI fits (honestly)

ProxiesAPI isn’t a “scrape anything instantly” button.

It’s a pragmatic tool to make the boring part of scraping reliable:

rotating IPs
re-trying failures cleanly
reducing downtime from blocks

Once your fetch layer is stable, you can focus on the parts that actually create value:

parsers
data model
exports
alerts

Next upgrades

Save HTML snapshots for debugging (out/html/...).
Add a URL queue and persist progress to SQLite.
Add a robots.txt + compliance check step.
Implement per-domain rate limiting.

Make Node.js scrapers resilient with ProxiesAPI

Scrapers fail in the network layer first: timeouts, 429s, and blocks. ProxiesAPI gives you a clean way to rotate IPs and keep retries from cascading into downtime.

Get 1,000 free API calls View pricing

A practical web scraping tutorial for JavaScript/Node.js using fetch + Cheerio, with pagination, retries, CSV export, and ProxiesAPI integration for more reliable crawling.

guide#javascript#nodejs#web-scraping

Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)

Learn Cheerio by building a reusable Node.js scraper: robust fetch layer (timeouts, retries), parsing patterns, pagination, and where ProxiesAPI fits for stability.

guide#nodejs#javascript#cheerio

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial

A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

Related guides