Web Scraping with Ruby: Nokogiri + HTTParty Tutorial (2026)
If you search for “web scraping with ruby”, you’ll find a lot of tiny examples… and not many end-to-end tutorials you can actually reuse.
This guide is different: we’ll build a small but production-shaped Ruby scraper using two battle-tested gems:
- HTTParty for HTTP requests
- Nokogiri for HTML parsing
By the end, you’ll know how to:
- fetch pages with proper headers + timeouts
- parse real HTML with stable selectors
- follow pagination safely
- add retries/backoff
- (optionally) route requests through a proxy
We’ll use a simple target site structure in examples so you can copy/paste and adapt.
Once you go beyond a few pages, the hardest part is often reliability: timeouts, blocks, and inconsistent responses. A proxy layer (like ProxiesAPI) can help keep your fetch step predictable.
Prerequisites
- Ruby 3.x
- Bundler
Create a new folder:
mkdir ruby-scraper
cd ruby-scraper
bundle init
Add gems to your Gemfile:
# Gemfile
source "https://rubygems.org"
gem "httparty"
gem "nokogiri"
gem "addressable"
Install:
bundle install
A minimal, correct fetch function (headers + timeouts)
A lot of scraping issues come from a “naive” HTTP client:
- no timeouts → scripts hang forever
- no headers → you get bot-ish responses
- no retry → any transient failure kills the run
Start here.
require "httparty"
class Fetcher
DEFAULT_HEADERS = {
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language" => "en-US,en;q=0.9"
}.freeze
def initialize(proxy: nil)
@proxy = proxy
end
def get(url, timeout: 20)
options = {
headers: DEFAULT_HEADERS,
timeout: timeout
}
# HTTParty proxy supports "host", port, username, password
if @proxy
options[:http_proxyaddr] = @proxy[:host]
options[:http_proxyport] = @proxy[:port]
options[:http_proxyuser] = @proxy[:username] if @proxy[:username]
options[:http_proxypass] = @proxy[:password] if @proxy[:password]
end
resp = HTTParty.get(url, options)
if resp.code >= 400
raise "HTTP #{resp.code} for #{url}"
end
resp.body
end
end
Proxy configuration
If your proxy provider gives you a URL like:
http://user:pass@gw.proxiesapi.com:8080
You can split it into a hash:
proxy = {
host: "gw.proxiesapi.com",
port: 8080,
username: "user",
password: "pass"
}
fetcher = Fetcher.new(proxy: proxy)
Keep proxy usage optional. Many sites don’t need it at small scale.
Parse HTML with Nokogiri
Nokogiri shines when you:
- use CSS selectors (simple and readable)
- keep parsing in small functions
- return plain hashes
Example:
require "nokogiri"
def parse_listing_page(html)
doc = Nokogiri::HTML(html)
products = []
doc.css(".product-card").each do |card|
title = card.at_css(".title")&.text&.strip
price = card.at_css(".price")&.text&.strip
url = card.at_css("a")&.[]("href")
next if title.nil? || url.nil?
products << {
title: title,
price: price,
url: url
}
end
next_href = doc.at_css("a.next")&.[]("href")
[products, next_href]
end
The selectors (.product-card, .title, .price, a.next) are placeholders. On a real site you’d inspect the HTML and choose stable patterns:
- IDs or
data-*attributes - semantic tags inside a known container
- avoid brittle “utility class soup”
Handle pagination safely
Most scrapers break pagination in one of three ways:
- follow “Next” links forever
- accidentally revisit pages (loops)
- hammer pages too fast
Here’s a safe crawler loop:
require "addressable/uri"
def absolutize(base_url, href)
return nil if href.nil?
Addressable::URI.join(base_url, href).to_s
end
base_url = "https://example.com/catalog"
url = base_url
fetcher = Fetcher.new # or Fetcher.new(proxy: proxy)
seen = {}
all = []
max_pages = 10
(1..max_pages).each do |page|
break if url.nil?
break if seen[url]
seen[url] = true
html = fetcher.get(url)
items, next_href = parse_listing_page(html)
puts "page=#{page} items=#{items.size} url=#{url}"
all.concat(items)
# polite delay
sleep(rand(0.6..1.4))
url = absolutize(base_url, next_href)
end
puts "total items: #{all.size}"
Add retries (with backoff)
Real scraping involves transient failures:
- timeouts
- 429 rate limits
- random 5xx
A simple retry wrapper makes your scraper much more resilient.
def with_retries(max_attempts: 4, base_sleep: 1)
attempt = 0
begin
attempt += 1
yield
rescue => e
raise e if attempt >= max_attempts
sleep_time = base_sleep * (2 ** (attempt - 1)) + rand * 0.25
warn "retry attempt=#{attempt} error=#{e.class}: #{e.message} sleep=#{sleep_time.round(2)}s"
sleep(sleep_time)
retry
end
end
Use it around fetch calls:
html = with_retries { fetcher.get(url, timeout: 20) }
Export results (JSONL is the easiest)
Ruby is great for writing JSONL—one JSON object per line.
require "json"
def export_jsonl(rows, path)
File.open(path, "w") do |f|
rows.each do |r|
f.puts(JSON.generate(r))
end
end
end
export_jsonl(all, "out.jsonl")
puts "wrote out.jsonl"
Proxy rotation patterns (practical advice)
If you’re scraping at meaningful scale, you’ll need to think about IP reputation and request distribution.
Two common patterns:
-
Single stable proxy endpoint
- easiest integration
- your provider rotates behind the scenes (if supported)
-
Pool rotation
- you keep a list of proxy endpoints
- pick one per request
A minimal pool rotator:
PROXIES = [
{ host: "gw1.example.com", port: 8080, username: "u", password: "p" },
{ host: "gw2.example.com", port: 8080, username: "u", password: "p" }
]
def pick_proxy
PROXIES.sample
end
fetcher = Fetcher.new(proxy: pick_proxy)
Don’t rotate “because you can”. Rotate when you see:
- rising 403 rates
- repeated captchas
- sudden response-body changes
Ethics + legality (quick reality check)
Before you scrape:
- respect robots.txt where appropriate for your use case
- don’t collect personal data you don’t need
- throttle requests (be polite)
- read the site’s Terms (especially for commercial use)
Summary
If you remember nothing else:
- timeouts + headers prevent a ton of failures
- Nokogiri selectors should be stable and meaning-based
- pagination needs loop protection (
seenset) - retries turn flaky scripts into pipelines
From here, the next step is to adapt selectors to your target site and add a storage layer (SQLite/Postgres) for incremental crawls.
Once you go beyond a few pages, the hardest part is often reliability: timeouts, blocks, and inconsistent responses. A proxy layer (like ProxiesAPI) can help keep your fetch step predictable.