Web Scraping with Ruby: Nokogiri + HTTParty Tutorial (2026)

If you search for “web scraping with ruby”, you’ll find a lot of tiny examples… and not many end-to-end tutorials you can actually reuse.

This guide is different: we’ll build a small but production-shaped Ruby scraper using two battle-tested gems:

  • HTTParty for HTTP requests
  • Nokogiri for HTML parsing

By the end, you’ll know how to:

  • fetch pages with proper headers + timeouts
  • parse real HTML with stable selectors
  • follow pagination safely
  • add retries/backoff
  • (optionally) route requests through a proxy

We’ll use a simple target site structure in examples so you can copy/paste and adapt.

When your Ruby scraper scales, keep the network layer stable

Once you go beyond a few pages, the hardest part is often reliability: timeouts, blocks, and inconsistent responses. A proxy layer (like ProxiesAPI) can help keep your fetch step predictable.


Prerequisites

  • Ruby 3.x
  • Bundler

Create a new folder:

mkdir ruby-scraper
cd ruby-scraper
bundle init

Add gems to your Gemfile:

# Gemfile
source "https://rubygems.org"

gem "httparty"
gem "nokogiri"
gem "addressable"

Install:

bundle install

A minimal, correct fetch function (headers + timeouts)

A lot of scraping issues come from a “naive” HTTP client:

  • no timeouts → scripts hang forever
  • no headers → you get bot-ish responses
  • no retry → any transient failure kills the run

Start here.

require "httparty"

class Fetcher
  DEFAULT_HEADERS = {
    "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language" => "en-US,en;q=0.9"
  }.freeze

  def initialize(proxy: nil)
    @proxy = proxy
  end

  def get(url, timeout: 20)
    options = {
      headers: DEFAULT_HEADERS,
      timeout: timeout
    }

    # HTTParty proxy supports "host", port, username, password
    if @proxy
      options[:http_proxyaddr] = @proxy[:host]
      options[:http_proxyport] = @proxy[:port]
      options[:http_proxyuser] = @proxy[:username] if @proxy[:username]
      options[:http_proxypass] = @proxy[:password] if @proxy[:password]
    end

    resp = HTTParty.get(url, options)

    if resp.code >= 400
      raise "HTTP #{resp.code} for #{url}"
    end

    resp.body
  end
end

Proxy configuration

If your proxy provider gives you a URL like:

http://user:pass@gw.proxiesapi.com:8080

You can split it into a hash:

proxy = {
  host: "gw.proxiesapi.com",
  port: 8080,
  username: "user",
  password: "pass"
}

fetcher = Fetcher.new(proxy: proxy)

Keep proxy usage optional. Many sites don’t need it at small scale.


Parse HTML with Nokogiri

Nokogiri shines when you:

  • use CSS selectors (simple and readable)
  • keep parsing in small functions
  • return plain hashes

Example:

require "nokogiri"

def parse_listing_page(html)
  doc = Nokogiri::HTML(html)

  products = []

  doc.css(".product-card").each do |card|
    title = card.at_css(".title")&.text&.strip
    price = card.at_css(".price")&.text&.strip
    url = card.at_css("a")&.[]("href")

    next if title.nil? || url.nil?

    products << {
      title: title,
      price: price,
      url: url
    }
  end

  next_href = doc.at_css("a.next")&.[]("href")

  [products, next_href]
end

The selectors (.product-card, .title, .price, a.next) are placeholders. On a real site you’d inspect the HTML and choose stable patterns:

  • IDs or data-* attributes
  • semantic tags inside a known container
  • avoid brittle “utility class soup”

Handle pagination safely

Most scrapers break pagination in one of three ways:

  1. follow “Next” links forever
  2. accidentally revisit pages (loops)
  3. hammer pages too fast

Here’s a safe crawler loop:

require "addressable/uri"

def absolutize(base_url, href)
  return nil if href.nil?
  Addressable::URI.join(base_url, href).to_s
end

base_url = "https://example.com/catalog"
url = base_url

fetcher = Fetcher.new # or Fetcher.new(proxy: proxy)

seen = {}
all = []
max_pages = 10

(1..max_pages).each do |page|
  break if url.nil?
  break if seen[url]
  seen[url] = true

  html = fetcher.get(url)
  items, next_href = parse_listing_page(html)

  puts "page=#{page} items=#{items.size} url=#{url}"

  all.concat(items)

  # polite delay
  sleep(rand(0.6..1.4))

  url = absolutize(base_url, next_href)
end

puts "total items: #{all.size}"

Add retries (with backoff)

Real scraping involves transient failures:

  • timeouts
  • 429 rate limits
  • random 5xx

A simple retry wrapper makes your scraper much more resilient.

def with_retries(max_attempts: 4, base_sleep: 1)
  attempt = 0
  begin
    attempt += 1
    yield
  rescue => e
    raise e if attempt >= max_attempts

    sleep_time = base_sleep * (2 ** (attempt - 1)) + rand * 0.25
    warn "retry attempt=#{attempt} error=#{e.class}: #{e.message} sleep=#{sleep_time.round(2)}s"
    sleep(sleep_time)
    retry
  end
end

Use it around fetch calls:

html = with_retries { fetcher.get(url, timeout: 20) }

Export results (JSONL is the easiest)

Ruby is great for writing JSONL—one JSON object per line.

require "json"

def export_jsonl(rows, path)
  File.open(path, "w") do |f|
    rows.each do |r|
      f.puts(JSON.generate(r))
    end
  end
end

export_jsonl(all, "out.jsonl")
puts "wrote out.jsonl"

Proxy rotation patterns (practical advice)

If you’re scraping at meaningful scale, you’ll need to think about IP reputation and request distribution.

Two common patterns:

  1. Single stable proxy endpoint

    • easiest integration
    • your provider rotates behind the scenes (if supported)
  2. Pool rotation

    • you keep a list of proxy endpoints
    • pick one per request

A minimal pool rotator:

PROXIES = [
  { host: "gw1.example.com", port: 8080, username: "u", password: "p" },
  { host: "gw2.example.com", port: 8080, username: "u", password: "p" }
]

def pick_proxy
  PROXIES.sample
end

fetcher = Fetcher.new(proxy: pick_proxy)

Don’t rotate “because you can”. Rotate when you see:

  • rising 403 rates
  • repeated captchas
  • sudden response-body changes

Ethics + legality (quick reality check)

Before you scrape:

  • respect robots.txt where appropriate for your use case
  • don’t collect personal data you don’t need
  • throttle requests (be polite)
  • read the site’s Terms (especially for commercial use)

Summary

If you remember nothing else:

  • timeouts + headers prevent a ton of failures
  • Nokogiri selectors should be stable and meaning-based
  • pagination needs loop protection (seen set)
  • retries turn flaky scripts into pipelines

From here, the next step is to adapt selectors to your target site and add a storage layer (SQLite/Postgres) for incremental crawls.

When your Ruby scraper scales, keep the network layer stable

Once you go beyond a few pages, the hardest part is often reliability: timeouts, blocks, and inconsistent responses. A proxy layer (like ProxiesAPI) can help keep your fetch step predictable.

Related guides

How to Scrape Shopify Stores: Products, Prices, and Inventory (2026)
Practical Shopify scraping patterns: discover product JSON endpoints, paginate collections, extract variants + availability, and reduce blocks while staying ethical.
guide#shopify#ecommerce#web-scraping
How to Scrape E-Commerce Websites: A Practical Guide
A practical playbook for ecommerce scraping: category discovery, pagination patterns, product detail extraction, variants, rate limits, retries, and proxy-backed fetching with ProxiesAPI.
guide#ecommerce scraping#ecommerce#web-scraping
How to Scrape Google Search Results with Python (Without Getting Blocked)
A practical SERP scraping workflow in Python: handle consent/interstitials, parse organic results defensively, rotate IPs, backoff on blocks, and export clean results. Includes a ProxiesAPI-backed fetch layer.
guide#how to scrape google search results with python#python#serp
Web Scraping with Java: JSoup + HttpClient Guide (2026)
A practical end-to-end Java web scraping tutorial using Java 21+: HttpClient for requests, JSoup for parsing, pagination loops, retries/backoff, and proxy rotation patterns.
guide#web scraping with java#java#jsoup