Web Scraping with Java: JSoup + HttpClient Guide (2026)

Mar 24, 2026 · guide · #web scraping with java, #java, #jsoup, #httpclient, #parsing, #pagination, #retries, #proxies

If you’re building a scraper that needs to run for months, integrate into back-end services, and handle high throughput, Java is a great choice.

This guide is a practical, end-to-end answer to the keyword “web scraping with Java”:

Use Java HttpClient (built-in) for HTTP
Use JSoup for HTML parsing
Add timeouts, retries, and backoff
Implement pagination
Add a proxy rotation abstraction (and where ProxiesAPI fits)

By the end, you’ll have a minimal but production-shaped scraper you can adapt to most sites.

What we’re building

We’ll implement a scraper with:

Fetcher (HTTP client with headers + timeouts)
Retry utility (backoff + jitter)
Parser (JSoup selectors)
Crawler (pagination loop + dedupe)
Exporter (CSV)

Even if your target site changes, this structure holds.

Dependencies

If you use Maven:

<dependencies>
  <dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
  </dependency>
</dependencies>

JSoup does HTML parsing and also supports simple requests, but for control (timeouts, proxies, headers) we’ll use Java’s HttpClient.

Step 1: Create an HttpClient fetcher

Java 11+ ships with a modern HTTP client.

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;

public class Fetcher {
  private final HttpClient client;

  public Fetcher() {
    this.client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(10))
        .followRedirects(HttpClient.Redirect.NORMAL)
        .build();
  }

  public String get(String url) throws IOException, InterruptedException {
    HttpRequest req = HttpRequest.newBuilder()
        .uri(URI.create(url))
        .timeout(Duration.ofSeconds(30))
        .header("User-Agent", "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)")
        .header("Accept", "text/html,application/xhtml+xml")
        .GET()
        .build();

    HttpResponse<String> res = client.send(req, HttpResponse.BodyHandlers.ofString());

    int code = res.statusCode();
    if (code >= 400) {
      throw new IOException("HTTP " + code + " for " + url);
    }

    return res.body();
  }
}

Why `User-Agent` matters

Many sites treat the default Java UA as suspicious. A boring UA reduces friction.

Step 2: Retries with exponential backoff + jitter

Scrapers fail for boring reasons:

transient 503s
timeouts
occasional 429 throttling

Backoff is the difference between “works once” and “works nightly”.

import java.time.Duration;
import java.util.concurrent.Callable;
import java.util.concurrent.ThreadLocalRandom;

public class Retry {
  public static <T> T withBackoff(Callable<T> fn, int maxAttempts) throws Exception {
    for (int attempt = 1; attempt <= maxAttempts; attempt++) {
      try {
        return fn.call();
      } catch (Exception e) {
        if (attempt == maxAttempts) throw e;

        long baseMs = (long) Math.min(20_000, Math.pow(2, attempt - 1) * 1000);
        long jitter = ThreadLocalRandom.current().nextLong(0, 400);
        long sleepMs = baseMs + jitter;

        System.out.println("attempt " + attempt + " failed: " + e.getMessage() + " — sleep " + sleepMs + "ms");
        Thread.sleep(sleepMs);
      }
    }
    throw new IllegalStateException("unreachable");
  }
}

Step 3: Parse HTML with JSoup selectors

JSoup uses CSS selectors similar to browser devtools.

Example: parse a list page with cards like:

<article class="card">
  <a class="title" href="/item/123">Item title</a>
  <span class="price">$19.99</span>
</article>

Parser:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.util.ArrayList;
import java.util.List;

public class Parser {

  public record Item(String title, String url, String price) {}

  public static List<Item> parseItems(String html, String baseUrl) {
    Document doc = Jsoup.parse(html, baseUrl);

    List<Item> out = new ArrayList<>();
    for (Element card : doc.select("article.card")) {
      Element a = card.selectFirst("a.title");
      if (a == null) continue;

      String title = a.text();
      String url = a.absUrl("href");

      Element priceEl = card.selectFirst("span.price");
      String price = priceEl != null ? priceEl.text() : null;

      out.add(new Item(title, url, price));
    }
    return out;
  }

  public static String findNextPage(String html, String baseUrl) {
    Document doc = Jsoup.parse(html, baseUrl);

    // common patterns
    Element relNext = doc.selectFirst("a[rel=next]");
    if (relNext != null) return relNext.absUrl("href");

    // fallback: link text contains 'Next'
    for (Element a : doc.select("a[href]")) {
      if (a.text().toLowerCase().contains("next")) {
        return a.absUrl("href");
      }
    }

    return null;
  }
}

Step 4: Pagination loop + dedupe

Pagination is almost always “fetch → parse → next link”.

Key rules:

cap pages while developing
dedupe by canonical URL
stop if next URL is null

import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class Crawler {

  public static void main(String[] args) throws Exception {
    String startUrl = args.length > 0 ? args[0] : "https://example.com/list";

    Fetcher fetcher = new Fetcher();

    Set<String> seen = new HashSet<>();

    String url = startUrl;
    int page = 0;

    while (url != null && page < 5) {
      page++;

      String html = Retry.withBackoff(() -> fetcher.get(url), 5);
      List<Parser.Item> items = Parser.parseItems(html, url);

      int newCount = 0;
      for (Parser.Item it : items) {
        if (seen.add(it.url())) {
          newCount++;
          System.out.println(it.title() + " | " + it.price() + " | " + it.url());
        }
      }

      System.out.println("page " + page + ": items=" + items.size() + " new=" + newCount + " total=" + seen.size());

      url = Parser.findNextPage(html, url);
    }
  }
}

Step 5: Export to CSV

For quick exports, write CSV manually:

import java.io.FileWriter;
import java.io.IOException;
import java.util.List;

public class CsvExport {
  public static void write(String path, List<Parser.Item> items) throws IOException {
    try (FileWriter w = new FileWriter(path)) {
      w.write("title,url,price\n");
      for (Parser.Item it : items) {
        String title = it.title().replace("\"", "\"\"");
        String url = it.url().replace("\"", "\"\"");
        String price = it.price() != null ? it.price().replace("\"", "\"\"") : "";

        w.write("\"" + title + "\",\"" + url + "\",\"" + price + "\"\n");
      }
    }
  }
}

Proxy rotation patterns (and where ProxiesAPI fits)

For many sites, you’ll eventually need some form of proxy strategy.

A clean way is to abstract it:

ProxyProvider returns the next proxy endpoint
Fetcher uses that proxy for a request

Basic structure

public interface ProxyProvider {
  String nextProxy(); // e.g., http://user:pass@host:port
}

In Java HttpClient, proxies are configured via ProxySelector on the client builder.

If you use ProxiesAPI, you often don’t need to manage a list of proxies yourself — you send requests through ProxiesAPI and it handles rotation upstream.

The exact integration depends on your ProxiesAPI plan/endpoint, but the architectural point is:

keep proxy logic out of parsing
keep it in the fetch layer

Practical advice: what usually breaks first

No timeouts → your jobs hang forever
No backoff → you intensify blocks and get 403/429 storms
Parsing too much → keep list pages light; fetch details only when needed
No caching → you re-download the same pages nightly

QA checklist

your fetcher sets connect+read timeouts
you backoff on failures
selectors are tight (avoid matching the whole page)
pagination stops correctly
you dedupe

Next upgrades

store results in SQLite (resume after crash)
concurrency with a fixed worker pool (but keep rate limits)
HTML snapshots for selector regression tests

When Java scrapers scale, ProxiesAPI keeps them stable

Java is excellent for long-running crawlers — but once you increase request volume, blocks and throttling show up. ProxiesAPI adds a resilient proxy + retry layer so your Java scraper can run reliably in production.

Get 1,000 free API calls View pricing

A practical Kotlin web scraping guide: fetch pages with Ktor, parse HTML with Jsoup selectors, handle retries/timeouts, paginate, and export results. Includes honest notes on when ProxiesAPI belongs in the fetch layer.

tutorial#web scraping with kotlin#kotlin#ktor

Web Scraping with Java: JSoup + HttpClient Guide

Teach Java developers how to fetch pages, parse HTML, and add proxy rotation without jumping to heavyweight browser tooling.

guide#web scraping java#java#jsoup

How to Scrape E-Commerce Websites: A Practical Guide

A step-by-step playbook for ecommerce scraping: product selectors, pagination, retries, proxy rotation, and data QA — with real Python patterns you can reuse.

guide#ecommerce scraping#python#web-scraping

How to Scrape E-Commerce Websites: A Practical Guide

A practical playbook for ecommerce scraping: category discovery, pagination patterns, product detail extraction, variants, rate limits, retries, and proxy-backed fetching with ProxiesAPI.

guide#ecommerce scraping#ecommerce#web-scraping

Web Scraping with Java: JSoup + HttpClient Guide (2026)

Related guides