Web Scraping with Ruby: Nokogiri + HTTParty Tutorial

Jun 23, 2026 · guide · #ruby, #nokogiri, #httparty, #web-scraping, #pagination, #proxies, #2026

If you search for “web scraping with ruby”, you’ll find a lot of tiny examples… and not many end-to-end tutorials you can actually reuse.

This guide is different: we’ll build a small but production-shaped Ruby scraper using two battle-tested gems:

HTTParty for HTTP requests
Nokogiri for HTML parsing

By the end, you’ll know how to:

fetch pages with proper headers + timeouts
parse real HTML with stable selectors
follow pagination safely
add retries/backoff
(optionally) route requests through a proxy

We’ll use a simple target site structure in examples so you can copy/paste and adapt.

When your Ruby scraper scales, keep the network layer stable

Once you go beyond a few pages, the hardest part is often reliability: timeouts, blocks, and inconsistent responses. A proxy layer (like ProxiesAPI) can help keep your fetch step predictable.

Get 1,000 free API calls View pricing

Prerequisites

Ruby 3.x
Bundler

Create a new folder:

mkdir ruby-scraper
cd ruby-scraper
bundle init

Add gems to your Gemfile:

# Gemfile
source "https://rubygems.org"

gem "httparty"
gem "nokogiri"
gem "addressable"

Install:

bundle install

A minimal, correct fetch function (headers + timeouts)

A lot of scraping issues come from a “naive” HTTP client:

no timeouts → scripts hang forever
no headers → you get bot-ish responses
no retry → any transient failure kills the run

Start here.

require "httparty"

class Fetcher
  DEFAULT_HEADERS = {
    "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language" => "en-US,en;q=0.9"
  }.freeze

  def initialize(proxy: nil)
    @proxy = proxy
  end

  def get(url, timeout: 20)
    options = {
      headers: DEFAULT_HEADERS,
      timeout: timeout
    }

    # HTTParty proxy supports "host", port, username, password
    if @proxy
      options[:http_proxyaddr] = @proxy[:host]
      options[:http_proxyport] = @proxy[:port]
      options[:http_proxyuser] = @proxy[:username] if @proxy[:username]
      options[:http_proxypass] = @proxy[:password] if @proxy[:password]
    end

    resp = HTTParty.get(url, options)

    if resp.code >= 400
      raise "HTTP #{resp.code} for #{url}"
    end

    resp.body
  end
end

Proxy configuration

If your proxy provider gives you a URL like:

http://user:pass@gw.proxiesapi.com:8080

You can split it into a hash:

proxy = {
  host: "gw.proxiesapi.com",
  port: 8080,
  username: "user",
  password: "pass"
}

fetcher = Fetcher.new(proxy: proxy)

Keep proxy usage optional. Many sites don’t need it at small scale.

Parse HTML with Nokogiri

Nokogiri shines when you:

use CSS selectors (simple and readable)
keep parsing in small functions
return plain hashes

Example:

require "nokogiri"

def parse_listing_page(html)
  doc = Nokogiri::HTML(html)

  products = []

  doc.css(".product-card").each do |card|
    title = card.at_css(".title")&.text&.strip
    price = card.at_css(".price")&.text&.strip
    url = card.at_css("a")&.[]("href")

    next if title.nil? || url.nil?

    products << {
      title: title,
      price: price,
      url: url
    }
  end

  next_href = doc.at_css("a.next")&.[]("href")

  [products, next_href]
end

The selectors (.product-card, .title, .price, a.next) are placeholders. On a real site you’d inspect the HTML and choose stable patterns:

IDs or data-* attributes
semantic tags inside a known container
avoid brittle “utility class soup”

Handle pagination safely

Most scrapers break pagination in one of three ways:

follow “Next” links forever
accidentally revisit pages (loops)
hammer pages too fast

Here’s a safe crawler loop:

require "addressable/uri"

def absolutize(base_url, href)
  return nil if href.nil?
  Addressable::URI.join(base_url, href).to_s
end

base_url = "https://example.com/catalog"
url = base_url

fetcher = Fetcher.new # or Fetcher.new(proxy: proxy)

seen = {}
all = []
max_pages = 10

(1..max_pages).each do |page|
  break if url.nil?
  break if seen[url]
  seen[url] = true

  html = fetcher.get(url)
  items, next_href = parse_listing_page(html)

  puts "page=#{page} items=#{items.size} url=#{url}"

  all.concat(items)

  # polite delay
  sleep(rand(0.6..1.4))

  url = absolutize(base_url, next_href)
end

puts "total items: #{all.size}"

Add retries (with backoff)

Real scraping involves transient failures:

timeouts
429 rate limits
random 5xx

A simple retry wrapper makes your scraper much more resilient.

def with_retries(max_attempts: 4, base_sleep: 1)
  attempt = 0
  begin
    attempt += 1
    yield
  rescue => e
    raise e if attempt >= max_attempts

    sleep_time = base_sleep * (2 ** (attempt - 1)) + rand * 0.25
    warn "retry attempt=#{attempt} error=#{e.class}: #{e.message} sleep=#{sleep_time.round(2)}s"
    sleep(sleep_time)
    retry
  end
end

Use it around fetch calls:

html = with_retries { fetcher.get(url, timeout: 20) }

Export results (JSONL is the easiest)

Ruby is great for writing JSONL—one JSON object per line.

require "json"

def export_jsonl(rows, path)
  File.open(path, "w") do |f|
    rows.each do |r|
      f.puts(JSON.generate(r))
    end
  end
end

export_jsonl(all, "out.jsonl")
puts "wrote out.jsonl"

Proxy rotation patterns (practical advice)

If you’re scraping at meaningful scale, you’ll need to think about IP reputation and request distribution.

Two common patterns:

Single stable proxy endpoint
- easiest integration
- your provider rotates behind the scenes (if supported)
Pool rotation
- you keep a list of proxy endpoints
- pick one per request

A minimal pool rotator:

PROXIES = [
  { host: "gw1.example.com", port: 8080, username: "u", password: "p" },
  { host: "gw2.example.com", port: 8080, username: "u", password: "p" }
]

def pick_proxy
  PROXIES.sample
end

fetcher = Fetcher.new(proxy: pick_proxy)

Don’t rotate “because you can”. Rotate when you see:

rising 403 rates
repeated captchas
sudden response-body changes

Ethics + legality (quick reality check)

Before you scrape:

respect robots.txt where appropriate for your use case
don’t collect personal data you don’t need
throttle requests (be polite)
read the site’s Terms (especially for commercial use)

Summary

If you remember nothing else:

timeouts + headers prevent a ton of failures
Nokogiri selectors should be stable and meaning-based
pagination needs loop protection (seen set)
retries turn flaky scripts into pipelines

From here, the next step is to adapt selectors to your target site and add a storage layer (SQLite/Postgres) for incremental crawls.

When your Ruby scraper scales, keep the network layer stable

Once you go beyond a few pages, the hardest part is often reliability: timeouts, blocks, and inconsistent responses. A proxy layer (like ProxiesAPI) can help keep your fetch step predictable.

Get 1,000 free API calls View pricing

A step-by-step playbook for ecommerce scraping: product selectors, pagination, retries, proxy rotation, and data QA — with real Python patterns you can reuse.

guide#ecommerce scraping#python#web-scraping

How to Scrape E-Commerce Websites: A Practical Guide

A practical playbook for ecommerce scraping: category discovery, pagination patterns, product detail extraction, variants, rate limits, retries, and proxy-backed fetching with ProxiesAPI.

guide#ecommerce scraping#ecommerce#web-scraping

Web Scraping with JavaScript and Node.js: Full Tutorial (2026)

An end-to-end Node.js scraping workflow: fetch pages with retries, parse HTML, handle pagination, rotate proxies with ProxiesAPI, and export clean JSON.

guide#javascript#nodejs#web-scraping

Node.js Web Scraping with Cheerio: Quick Start Guide (Requests + Proxies + Pagination)

Learn Cheerio by building a reusable Node.js scraper: robust fetch layer (timeouts, retries), parsing patterns, pagination, and where ProxiesAPI fits for stability.

guide#nodejs#javascript#cheerio

Web Scraping with Ruby: Nokogiri + HTTParty Tutorial

Related guides