Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)

If you’re building scrapers in 2026, Node.js is a killer choice:

  • fast iteration
  • a huge ecosystem
  • excellent browser automation (Playwright/Puppeteer)

But the biggest mistake people make is jumping straight to headless browsers for everything.

A production setup is two-tier:

  1. HTTP-first (cheap + fast): fetch HTML and parse it
  2. Browser fallback (expensive + powerful): only for pages that truly need JS

This tutorial gives you a complete, copy-pasteable stack for the keyword web scraping javascript.

Make your Node scrapers resilient with ProxiesAPI

As soon as your crawler hits real-world scale (more URLs, more concurrency, more blocks), the proxy layer becomes the difference between a toy script and a reliable pipeline.


The 80/20 architecture (HTTP-first → browser fallback)

Here’s the pattern you want:

  • a URL queue
  • an HTTP fetcher (with retries, timeouts, and proxies)
  • an HTML parser (Cheerio)
  • an escalation rule:
    • if content is missing
    • or you get soft-blocked
    • then render with Playwright

This keeps costs down and throughput up.


Project setup

mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-limit
npm i -D playwright

(You can use Puppeteer instead of Playwright — I’ll show both patterns. I prefer Playwright for reliability and multi-browser support.)


Step 1: HTTP scraping with Axios + Cheerio

A production-grade fetcher

  • timeouts
  • retry with backoff (simple and effective)
  • proxy support via env vars (ProxiesAPI)
// fetch.js
import axios from "axios";

const TIMEOUT_MS = 30_000;

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

function buildProxyFromEnv() {
  // Prefer a single URL like: http://USER:PASS@proxy.proxiesapi.com:1234
  const proxyUrl = process.env.PROXIESAPI_PROXY_URL;
  if (!proxyUrl) return null;

  // Axios proxy config is host/port based; for URL proxies,
  // the simplest approach is to use an HTTPS proxy agent.
  // To keep this tutorial dependency-light, we’ll pass proxies via HTTP(S)_PROXY
  // environment variables that many HTTP stacks respect.
  // If you want strict proxying in Axios, add `https-proxy-agent`.
  return proxyUrl;
}

export async function fetchHtml(url) {
  const proxyUrl = buildProxyFromEnv();

  const headers = {
    "User-Agent":
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
  };

  // If your environment supports it, this is the easiest way to route traffic.
  // Many tools respect HTTP(S)_PROXY.
  if (proxyUrl) {
    process.env.HTTP_PROXY = proxyUrl;
    process.env.HTTPS_PROXY = proxyUrl;
  }

  let lastErr;
  for (let attempt = 1; attempt <= 4; attempt++) {
    try {
      const res = await axios.get(url, {
        timeout: TIMEOUT_MS,
        headers,
        maxRedirects: 5,
        validateStatus: () => true,
      });

      if (res.status === 403 || res.status === 429) {
        throw new Error(`blocked/throttled: HTTP ${res.status}`);
      }
      if (res.status >= 500) {
        throw new Error(`server error: HTTP ${res.status}`);
      }
      if (res.status < 200 || res.status >= 300) {
        throw new Error(`unexpected status: HTTP ${res.status}`);
      }

      const html = res.data;
      if (!html || typeof html !== "string" || html.length < 500) {
        throw new Error("HTML too small (possible block or JS-only page)");
      }

      return html;
    } catch (e) {
      lastErr = e;
      const backoff = Math.min(1500 * 2 ** (attempt - 1), 10_000);
      await sleep(backoff);
    }
  }

  throw lastErr;
}

Parse with Cheerio

// parse.js
import * as cheerio from "cheerio";

export function parseHackerNewsLike(html) {
  const $ = cheerio.load(html);

  // Example extraction: all links
  const links = [];
  $("a").each((_, el) => {
    const href = $(el).attr("href");
    const text = $(el).text().trim();
    if (href) links.push({ href, text });
  });

  return links.slice(0, 50);
}

(Replace parsing logic with your site’s selectors. Cheerio is basically jQuery for server-side HTML.)


Step 2: Detect when you need a browser

A good escalation rule is:

  • your key selector returns 0 matches
  • or the page contains a known block signature
  • or HTML is suspiciously small

Example:

export function needsBrowser(html) {
  const lower = html.toLowerCase();
  return (
    lower.includes("captcha") ||
    lower.includes("unusual traffic") ||
    lower.includes("/sorry/") ||
    html.length < 5_000
  );
}

Step 3: Browser scraping with Playwright (with proxy support)

Playwright is your “break glass” option. Use it selectively.

// render.js
import { chromium } from "playwright";

export async function renderHtml(url) {
  const proxyUrl = process.env.PROXIESAPI_PROXY_URL; // e.g. http://user:pass@host:port

  const browser = await chromium.launch({
    headless: true,
  });

  const context = await browser.newContext(
    proxyUrl
      ? {
          proxy: { server: proxyUrl },
          userAgent:
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        }
      : {}
  );

  const page = await context.newPage();
  await page.goto(url, { waitUntil: "domcontentloaded", timeout: 45_000 });

  // If your target renders late, use networkidle, but it can hang on long-polling sites.
  // await page.goto(url, { waitUntil: "networkidle", timeout: 45_000 });

  const html = await page.content();
  await browser.close();
  return html;
}

Puppeteer alternative

If you prefer Puppeteer:

import puppeteer from "puppeteer";

const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto(url, { waitUntil: "domcontentloaded" });
const html = await page.content();
await browser.close();

Step 4: Put it together (queue + concurrency)

// index.js
import pLimit from "p-limit";
import { fetchHtml } from "./fetch.js";
import { renderHtml } from "./render.js";
import { needsBrowser } from "./needsBrowser.js";
import { parseHackerNewsLike } from "./parse.js";

const limit = pLimit(3);

async function scrapeUrl(url) {
  let html = await fetchHtml(url);
  if (needsBrowser(html)) {
    html = await renderHtml(url);
  }
  return parseHackerNewsLike(html);
}

const urls = ["https://news.ycombinator.com/"];

const results = await Promise.all(
  urls.map((u) => limit(() => scrapeUrl(u)))
);

console.log(results[0].slice(0, 5));

This structure scales cleanly:

  • increase concurrency gradually
  • add per-domain rate limits
  • add caching
  • add database writes

Proxies + blocks: the practical checklist

When a Node scraper fails in production, it’s usually one of these:

  • no timeouts (requests hang forever)
  • no retries/backoff (transient errors kill the run)
  • no block detection (you “successfully” parse a CAPTCHA page)
  • too much browser automation (slow + expensive)

A simple checklist:

  • timeouts everywhere
  • retries with exponential backoff
  • block detection (403/429 + HTML signatures)
  • HTTP-first; browser only when needed
  • low steady request rate per domain
  • proxies configured via env vars (ProxiesAPI)

Comparison table: Cheerio vs Playwright vs Puppeteer

ToolBest forProsCons
Cheerioserver-rendered HTMLfast, cheap, simplecan’t execute JS
PlaywrightJS-rendered pagesreliable, modern, multi-browserslower, higher cost
PuppeteerChrome automationbig ecosystemfewer cross-browser features

Where ProxiesAPI fits (honestly)

Proxies don’t replace good scraping hygiene — they complement it.

Use ProxiesAPI to:

  • rotate outbound IPs when throttling starts
  • isolate domains (different sessions / IP pools)
  • keep high-volume crawls stable

And keep your scraper disciplined:

  • HTTP-first
  • browser only when needed
  • store results so you can resume instead of restarting
Make your Node scrapers resilient with ProxiesAPI

As soon as your crawler hits real-world scale (more URLs, more concurrency, more blocks), the proxy layer becomes the difference between a toy script and a reliable pipeline.

Related guides

Web Scraping with JavaScript and Node.js: A Complete Practical Tutorial (2026)
Learn a modern Node.js web scraping stack: fetch + Cheerio for fast HTML parsing, a Playwright fallback for JS-heavy sites, and a production-ready layer for retries, rate limits, and ProxiesAPI proxy rotation.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial
A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.
guide#javascript#nodejs#web-scraping
Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
An end-to-end Node.js scraping workflow: fetch pages with retries, parse HTML, handle pagination, rotate proxies with ProxiesAPI, and export clean JSON.
guide#javascript#nodejs#web-scraping