Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)

May 11, 2026 · guide · #javascript, #nodejs, #web-scraping, #playwright, #puppeteer, #cheerio, #proxies

If you’re building scrapers in 2026, Node.js is a killer choice:

fast iteration
a huge ecosystem
excellent browser automation (Playwright/Puppeteer)

But the biggest mistake people make is jumping straight to headless browsers for everything.

A production setup is two-tier:

HTTP-first (cheap + fast): fetch HTML and parse it
Browser fallback (expensive + powerful): only for pages that truly need JS

This tutorial gives you a complete, copy-pasteable stack for the keyword web scraping javascript.

Make your Node scrapers resilient with ProxiesAPI

As soon as your crawler hits real-world scale (more URLs, more concurrency, more blocks), the proxy layer becomes the difference between a toy script and a reliable pipeline.

Get 1,000 free API calls View pricing

The 80/20 architecture (HTTP-first → browser fallback)

Here’s the pattern you want:

a URL queue
an HTTP fetcher (with retries, timeouts, and proxies)
an HTML parser (Cheerio)
an escalation rule:
- if content is missing
- or you get soft-blocked
- then render with Playwright

This keeps costs down and throughput up.

Project setup

mkdir node-scraper
cd node-scraper
npm init -y
npm i axios cheerio p-limit
npm i -D playwright

(You can use Puppeteer instead of Playwright — I’ll show both patterns. I prefer Playwright for reliability and multi-browser support.)

Step 1: HTTP scraping with Axios + Cheerio

A production-grade fetcher

timeouts
retry with backoff (simple and effective)
proxy support via env vars (ProxiesAPI)

// fetch.js
import axios from "axios";

const TIMEOUT_MS = 30_000;

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

function buildProxyFromEnv() {
  // Prefer a single URL like: http://USER:PASS@proxy.proxiesapi.com:1234
  const proxyUrl = process.env.PROXIESAPI_PROXY_URL;
  if (!proxyUrl) return null;

  // Axios proxy config is host/port based; for URL proxies,
  // the simplest approach is to use an HTTPS proxy agent.
  // To keep this tutorial dependency-light, we’ll pass proxies via HTTP(S)_PROXY
  // environment variables that many HTTP stacks respect.
  // If you want strict proxying in Axios, add `https-proxy-agent`.
  return proxyUrl;
}

export async function fetchHtml(url) {
  const proxyUrl = buildProxyFromEnv();

  const headers = {
    "User-Agent":
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
  };

  // If your environment supports it, this is the easiest way to route traffic.
  // Many tools respect HTTP(S)_PROXY.
  if (proxyUrl) {
    process.env.HTTP_PROXY = proxyUrl;
    process.env.HTTPS_PROXY = proxyUrl;
  }

  let lastErr;
  for (let attempt = 1; attempt <= 4; attempt++) {
    try {
      const res = await axios.get(url, {
        timeout: TIMEOUT_MS,
        headers,
        maxRedirects: 5,
        validateStatus: () => true,
      });

      if (res.status === 403 || res.status === 429) {
        throw new Error(`blocked/throttled: HTTP ${res.status}`);
      }
      if (res.status >= 500) {
        throw new Error(`server error: HTTP ${res.status}`);
      }
      if (res.status < 200 || res.status >= 300) {
        throw new Error(`unexpected status: HTTP ${res.status}`);
      }

      const html = res.data;
      if (!html || typeof html !== "string" || html.length < 500) {
        throw new Error("HTML too small (possible block or JS-only page)");
      }

      return html;
    } catch (e) {
      lastErr = e;
      const backoff = Math.min(1500 * 2 ** (attempt - 1), 10_000);
      await sleep(backoff);
    }
  }

  throw lastErr;
}

Parse with Cheerio

// parse.js
import * as cheerio from "cheerio";

export function parseHackerNewsLike(html) {
  const $ = cheerio.load(html);

  // Example extraction: all links
  const links = [];
  $("a").each((_, el) => {
    const href = $(el).attr("href");
    const text = $(el).text().trim();
    if (href) links.push({ href, text });
  });

  return links.slice(0, 50);
}

(Replace parsing logic with your site’s selectors. Cheerio is basically jQuery for server-side HTML.)

Step 2: Detect when you need a browser

A good escalation rule is:

your key selector returns 0 matches
or the page contains a known block signature
or HTML is suspiciously small

Example:

export function needsBrowser(html) {
  const lower = html.toLowerCase();
  return (
    lower.includes("captcha") ||
    lower.includes("unusual traffic") ||
    lower.includes("/sorry/") ||
    html.length < 5_000
  );
}

Step 3: Browser scraping with Playwright (with proxy support)

Playwright is your “break glass” option. Use it selectively.

// render.js
import { chromium } from "playwright";

export async function renderHtml(url) {
  const proxyUrl = process.env.PROXIESAPI_PROXY_URL; // e.g. http://user:pass@host:port

  const browser = await chromium.launch({
    headless: true,
  });

  const context = await browser.newContext(
    proxyUrl
      ? {
          proxy: { server: proxyUrl },
          userAgent:
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        }
      : {}
  );

  const page = await context.newPage();
  await page.goto(url, { waitUntil: "domcontentloaded", timeout: 45_000 });

  // If your target renders late, use networkidle, but it can hang on long-polling sites.
  // await page.goto(url, { waitUntil: "networkidle", timeout: 45_000 });

  const html = await page.content();
  await browser.close();
  return html;
}

Puppeteer alternative

If you prefer Puppeteer:

import puppeteer from "puppeteer";

const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto(url, { waitUntil: "domcontentloaded" });
const html = await page.content();
await browser.close();

Step 4: Put it together (queue + concurrency)

// index.js
import pLimit from "p-limit";
import { fetchHtml } from "./fetch.js";
import { renderHtml } from "./render.js";
import { needsBrowser } from "./needsBrowser.js";
import { parseHackerNewsLike } from "./parse.js";

const limit = pLimit(3);

async function scrapeUrl(url) {
  let html = await fetchHtml(url);
  if (needsBrowser(html)) {
    html = await renderHtml(url);
  }
  return parseHackerNewsLike(html);
}

const urls = ["https://news.ycombinator.com/"];

const results = await Promise.all(
  urls.map((u) => limit(() => scrapeUrl(u)))
);

console.log(results[0].slice(0, 5));

This structure scales cleanly:

increase concurrency gradually
add per-domain rate limits
add caching
add database writes

Proxies + blocks: the practical checklist

When a Node scraper fails in production, it’s usually one of these:

no timeouts (requests hang forever)
no retries/backoff (transient errors kill the run)
no block detection (you “successfully” parse a CAPTCHA page)
too much browser automation (slow + expensive)

A simple checklist:

timeouts everywhere
retries with exponential backoff
block detection (403/429 + HTML signatures)
HTTP-first; browser only when needed
low steady request rate per domain
proxies configured via env vars (ProxiesAPI)

Comparison table: Cheerio vs Playwright vs Puppeteer

Tool	Best for	Pros	Cons
Cheerio	server-rendered HTML	fast, cheap, simple	can’t execute JS
Playwright	JS-rendered pages	reliable, modern, multi-browser	slower, higher cost
Puppeteer	Chrome automation	big ecosystem	fewer cross-browser features

Where ProxiesAPI fits (honestly)

Proxies don’t replace good scraping hygiene — they complement it.

Use ProxiesAPI to:

rotate outbound IPs when throttling starts
isolate domains (different sessions / IP pools)
keep high-volume crawls stable

And keep your scraper disciplined:

HTTP-first
browser only when needed
store results so you can resume instead of restarting

Make your Node scrapers resilient with ProxiesAPI

As soon as your crawler hits real-world scale (more URLs, more concurrency, more blocks), the proxy layer becomes the difference between a toy script and a reliable pipeline.

Get 1,000 free API calls View pricing

Learn a modern Node.js web scraping stack: fetch + Cheerio for fast HTML parsing, a Playwright fallback for JS-heavy sites, and a production-ready layer for retries, rate limits, and ProxiesAPI proxy rotation.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

A modern Node.js scraping toolkit: fetch + parse with Cheerio, render JS sites with Playwright, add retries/backoff, and integrate ProxiesAPI for proxy rotation. Includes comparison table and production checklists.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: A Full 2026 Tutorial

A practical Node.js guide (fetch/axios + Cheerio, plus Playwright when needed) with proxy + anti-block patterns.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

An end-to-end Node.js scraping workflow: fetch pages with retries, parse HTML, handle pagination, rotate proxies with ProxiesAPI, and export clean JSON.

guide#javascript#nodejs#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (Puppeteer/Playwright + ProxiesAPI)

Related guides