Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

If you already know JavaScript, Node.js is one of the fastest ways to build a web scraper:

  • great HTTP tooling
  • strong HTML parsers (Cheerio)
  • easy concurrency control
  • perfect for pipelines (scrape → transform → store)

In this 2026 tutorial, you’ll build a complete Node scraper that handles the stuff that actually matters:

  • fetching pages with timeouts
  • parsing HTML using Cheerio selectors
  • crawling pagination safely
  • retrying failures with backoff
  • exporting data to CSV
  • routing requests through ProxiesAPI when you need more reliability
When your Node scraper scales, ProxiesAPI keeps it steadier

Most scrapers fail in the network layer: timeouts, throttling, and blocks. ProxiesAPI gives your Node.js scraper a simple proxy route so your crawling stays more reliable as you add more targets and more URLs.


When Node.js is a good fit for scraping

Node is especially good for:

  • scraping server-rendered HTML pages
  • building “ETL-style” scrapers
  • running lots of small jobs (cron, queues)
  • building internal dashboards that refresh regularly

Node is not a magic bullet for:

  • heavily client-rendered apps (React apps that fetch everything via XHR)
  • pages that require solving complex bot challenges

For those cases, you usually move to a browser automation stack (Playwright) or a first-party API.


Setup

Create a new project:

mkdir node-scraper
cd node-scraper
npm init -y

Install dependencies:

npm install cheerio dotenv

Node 18+ already includes fetch. (If you’re on an older Node, install node-fetch.)

Create a .env:

PROXIESAPI_KEY="YOUR_KEY"

Part 1 — Build a robust fetch() with retries

Scrapers die in the network layer. So we’ll start there.

We want:

  • timeouts (never hang)
  • retries for transient failures
  • a stable User-Agent
// scraper.js
import * as cheerio from "cheerio";
import "dotenv/config";

const PROXIESAPI_KEY = process.env.PROXIESAPI_KEY;
const PROXIESAPI_ENDPOINT = "http://api.proxiesapi.com/";

function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function proxiesApiUrl(targetUrl) {
  if (!PROXIESAPI_KEY) throw new Error("Missing PROXIESAPI_KEY");
  const u = new URL(PROXIESAPI_ENDPOINT);
  u.searchParams.set("auth_key", PROXIESAPI_KEY);
  u.searchParams.set("url", targetUrl);
  return u.toString();
}

async function fetchHtml(url, { useProxiesApi = false, timeoutMs = 45000, retries = 4 } = {}) {
  let lastErr;

  for (let attempt = 1; attempt <= retries; attempt++) {
    const controller = new AbortController();
    const t = setTimeout(() => controller.abort(), timeoutMs);

    try {
      const finalUrl = useProxiesApi ? proxiesApiUrl(url) : url;

      const res = await fetch(finalUrl, {
        signal: controller.signal,
        headers: {
          "User-Agent":
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
          Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
          "Accept-Language": "en-US,en;q=0.9",
        },
      });

      if (!res.ok) {
        throw new Error(`HTTP ${res.status} ${res.statusText}`);
      }

      const html = await res.text();
      if (html.length < 2000) {
        throw new Error(`Response too small (${html.length} bytes)`);
      }

      return html;
    } catch (e) {
      lastErr = e;
      if (attempt === retries) break;

      const backoff = (2 ** attempt) * 1000 + Math.floor(Math.random() * 400);
      console.log(`attempt ${attempt} failed: ${e}. sleeping ${backoff}ms`);
      await sleep(backoff);
    } finally {
      clearTimeout(t);
    }
  }

  throw new Error(`Failed to fetch ${url}: ${lastErr}`);
}

Use ProxiesAPI or not?

  • For friendly sites: set useProxiesApi: false
  • For targets that start throttling/denying: set useProxiesApi: true

In real pipelines, you often start without proxies, then switch on proxies for specific hosts.


Part 2 — Parse HTML with Cheerio (selectors that make sense)

Cheerio gives you jQuery-like selectors.

Here’s a generic example for parsing a “listing card” page:

function parseListingPage(html, { baseUrl }) {
  const $ = cheerio.load(html);

  const out = [];

  $("article, .card, .product, .result").each((_, el) => {
    // Try to find a reasonable link + title inside a card.
    const a = $(el).find("a[href]").first();
    const href = a.attr("href");
    const title = a.text().trim();

    if (!href || title.length < 4) return;

    const url = new URL(href, baseUrl).toString();

    // Best-effort price extraction.
    const textBlob = $(el).text().replace(/\s+/g, " ").trim();
    const priceMatch = textBlob.match(/(\$\s?\d[\d\.,]*|€\s?\d[\d\.,]*|£\s?\d[\d\.,]*)/);

    out.push({
      title,
      url,
      price: priceMatch ? priceMatch[1] : null,
    });
  });

  return out;
}

This pattern is useful because it’s resilient across sites: you don’t overfit to one class name.

For high accuracy on a single target, you should:

  • inspect the HTML in DevTools
  • target specific containers (e.g. div[data-testid='product-card'])
  • write selectors based on attributes instead of CSS classes

Part 3 — Crawl pagination (without guessing)

The cleanest pagination approach is: follow the Next link.

We’ll implement:

  • rel="next" if present
  • a fallback selector for anchors containing “Next”
function findNextPageUrl(html, currentUrl) {
  const $ = cheerio.load(html);

  const relNext = $("link[rel='next']").attr("href");
  if (relNext) return new URL(relNext, currentUrl).toString();

  // Fallback (English sites): anchor text contains "Next"
  let nextHref = null;
  $("a[href]").each((_, a) => {
    const text = $(a).text().trim().toLowerCase();
    if (text === "next" || text.includes("next ") || text.includes(" next")) {
      nextHref = $(a).attr("href");
      return false;
    }
  });

  return nextHref ? new URL(nextHref, currentUrl).toString() : null;
}

async function crawl(startUrl, { maxPages = 5, useProxiesApi = false } = {}) {
  let url = startUrl;
  const all = [];
  const seen = new Set();

  for (let page = 1; page <= maxPages; page++) {
    const html = await fetchHtml(url, { useProxiesApi });
    const rows = parseListingPage(html, { baseUrl: url });

    for (const r of rows) {
      if (seen.has(r.url)) continue;
      seen.add(r.url);
      all.push(r);
    }

    console.log(`page ${page}: rows=${rows.length} total_unique=${all.length}`);

    const nextUrl = findNextPageUrl(html, url);
    if (!nextUrl) break;
    url = nextUrl;
  }

  return all;
}

Part 4 — Export to CSV (no dependencies)

import fs from "node:fs";

function toCsv(rows) {
  const headers = Object.keys(rows[0]);
  const escape = (v) => {
    if (v === null || v === undefined) return "";
    const s = String(v);
    if (s.includes(",") || s.includes("\n") || s.includes('"')) {
      return '"' + s.replace(/"/g, '""') + '"';
    }
    return s;
  };

  const lines = [headers.join(",")];
  for (const row of rows) {
    lines.push(headers.map((h) => escape(row[h])).join(","));
  }

  return lines.join("\n") + "\n";
}

async function main() {
  const startUrl = "https://news.ycombinator.com/"; // replace with your target

  const rows = await crawl(startUrl, { maxPages: 3, useProxiesApi: true });
  if (!rows.length) throw new Error("No rows scraped");

  const csv = toCsv(rows);
  fs.writeFileSync("scraped.csv", csv, "utf-8");
  console.log("wrote scraped.csv rows:", rows.length);
}

main().catch((e) => {
  console.error(e);
  process.exit(1);
});

Practical advice (what experienced scrapers do)

1) Start with a “selector budget”

Don’t start by writing 25 selectors.

Start with:

  • a URL selector
  • a title selector
  • a price selector

Ship a dataset. Then iterate.

2) Prefer attributes over CSS classes

Classes change. Attributes like data-testid, aria-label, and semantic tags change less.

3) Treat every scrape as an ETL job

A good scraper has three stages:

  1. Extract: fetch HTML
  2. Transform: parse into normalized rows
  3. Load: write to CSV/DB/API

Keeping these separate makes your code maintainable.

4) Expect failures

Plan for:

  • timeouts
  • 429 rate limits
  • intermittent HTML differences

Retries + backoff + logging are not optional.


ProxiesAPI integration patterns (without overpromising)

You have three common options:

  1. Always on: every request goes through ProxiesAPI
  2. Host-based: only “difficult” hosts go through ProxiesAPI
  3. Fallback: try direct first, then retry with ProxiesAPI

If you run a lot of scraping jobs, host-based routing is usually the sweet spot.


Comparison: Node.js scraping libraries (2026)

ApproachBest forProsCons
fetch + CheerioHTML pagesFast, simple, cheapNo JS rendering
Axios + CheerioHTML pagesMature ecosystemExtra dependency
PlaywrightJS-heavy sitesAccurate, renders pagesSlower, heavier
PuppeteerJS-heavy sitesPopularSimilar tradeoffs

FAQ

It depends on what you scrape, how you use it, and the jurisdiction.

Common best practices:

  • scrape public pages
  • respect rate limits
  • don’t collect sensitive personal data
  • check the target’s terms and local law

What about robots.txt?

robots.txt is not a law, but it’s a strong signal of what the site expects.

If you’re doing something commercial, treat it as part of your compliance process.


QA checklist

  • Fetch HTML with timeouts (no hanging)
  • Parse a page and log the first 3 extracted rows
  • Pagination increases unique URLs
  • CSV opens cleanly in Excel/Sheets
  • Retries/backoff are visible in logs
When your Node scraper scales, ProxiesAPI keeps it steadier

Most scrapers fail in the network layer: timeouts, throttling, and blocks. ProxiesAPI gives your Node.js scraper a simple proxy route so your crawling stays more reliable as you add more targets and more URLs.

Related guides

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial
A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.
guide#javascript#nodejs#web-scraping
Node.js Web Scraping with Cheerio: Quick Start Guide
A practical Cheerio + HTTP quick start: fetch with retries, parse real HTML selectors, paginate, and scale reliably with ProxiesAPI.
guide#nodejs#cheerio#web-scraping
Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)
Learn Cheerio by building a reusable Node.js scraper: robust fetch layer (timeouts, retries), parsing patterns, pagination, and where ProxiesAPI fits for stability.
guide#nodejs#javascript#cheerio