Web Scraping with Java: JSoup + HttpClient Guide (2026)

If you’re building a scraper that needs to run for months, integrate into back-end services, and handle high throughput, Java is a great choice.

This guide is a practical, end-to-end answer to the keyword “web scraping with Java”:

  • Use Java HttpClient (built-in) for HTTP
  • Use JSoup for HTML parsing
  • Add timeouts, retries, and backoff
  • Implement pagination
  • Add a proxy rotation abstraction (and where ProxiesAPI fits)

By the end, you’ll have a minimal but production-shaped scraper you can adapt to most sites.


What we’re building

We’ll implement a scraper with:

  1. Fetcher (HTTP client with headers + timeouts)
  2. Retry utility (backoff + jitter)
  3. Parser (JSoup selectors)
  4. Crawler (pagination loop + dedupe)
  5. Exporter (CSV)

Even if your target site changes, this structure holds.


Dependencies

If you use Maven:

<dependencies>
  <dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
  </dependency>
</dependencies>

JSoup does HTML parsing and also supports simple requests, but for control (timeouts, proxies, headers) we’ll use Java’s HttpClient.


Step 1: Create an HttpClient fetcher

Java 11+ ships with a modern HTTP client.

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;

public class Fetcher {
  private final HttpClient client;

  public Fetcher() {
    this.client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(10))
        .followRedirects(HttpClient.Redirect.NORMAL)
        .build();
  }

  public String get(String url) throws IOException, InterruptedException {
    HttpRequest req = HttpRequest.newBuilder()
        .uri(URI.create(url))
        .timeout(Duration.ofSeconds(30))
        .header("User-Agent", "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)")
        .header("Accept", "text/html,application/xhtml+xml")
        .GET()
        .build();

    HttpResponse<String> res = client.send(req, HttpResponse.BodyHandlers.ofString());

    int code = res.statusCode();
    if (code >= 400) {
      throw new IOException("HTTP " + code + " for " + url);
    }

    return res.body();
  }
}

Why User-Agent matters

Many sites treat the default Java UA as suspicious. A boring UA reduces friction.


Step 2: Retries with exponential backoff + jitter

Scrapers fail for boring reasons:

  • transient 503s
  • timeouts
  • occasional 429 throttling

Backoff is the difference between “works once” and “works nightly”.

import java.time.Duration;
import java.util.concurrent.Callable;
import java.util.concurrent.ThreadLocalRandom;

public class Retry {
  public static <T> T withBackoff(Callable<T> fn, int maxAttempts) throws Exception {
    for (int attempt = 1; attempt <= maxAttempts; attempt++) {
      try {
        return fn.call();
      } catch (Exception e) {
        if (attempt == maxAttempts) throw e;

        long baseMs = (long) Math.min(20_000, Math.pow(2, attempt - 1) * 1000);
        long jitter = ThreadLocalRandom.current().nextLong(0, 400);
        long sleepMs = baseMs + jitter;

        System.out.println("attempt " + attempt + " failed: " + e.getMessage() + " — sleep " + sleepMs + "ms");
        Thread.sleep(sleepMs);
      }
    }
    throw new IllegalStateException("unreachable");
  }
}

Step 3: Parse HTML with JSoup selectors

JSoup uses CSS selectors similar to browser devtools.

Example: parse a list page with cards like:

<article class="card">
  <a class="title" href="/item/123">Item title</a>
  <span class="price">$19.99</span>
</article>

Parser:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.util.ArrayList;
import java.util.List;

public class Parser {

  public record Item(String title, String url, String price) {}

  public static List<Item> parseItems(String html, String baseUrl) {
    Document doc = Jsoup.parse(html, baseUrl);

    List<Item> out = new ArrayList<>();
    for (Element card : doc.select("article.card")) {
      Element a = card.selectFirst("a.title");
      if (a == null) continue;

      String title = a.text();
      String url = a.absUrl("href");

      Element priceEl = card.selectFirst("span.price");
      String price = priceEl != null ? priceEl.text() : null;

      out.add(new Item(title, url, price));
    }
    return out;
  }

  public static String findNextPage(String html, String baseUrl) {
    Document doc = Jsoup.parse(html, baseUrl);

    // common patterns
    Element relNext = doc.selectFirst("a[rel=next]");
    if (relNext != null) return relNext.absUrl("href");

    // fallback: link text contains 'Next'
    for (Element a : doc.select("a[href]")) {
      if (a.text().toLowerCase().contains("next")) {
        return a.absUrl("href");
      }
    }

    return null;
  }
}

Step 4: Pagination loop + dedupe

Pagination is almost always “fetch → parse → next link”.

Key rules:

  • cap pages while developing
  • dedupe by canonical URL
  • stop if next URL is null
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class Crawler {

  public static void main(String[] args) throws Exception {
    String startUrl = args.length > 0 ? args[0] : "https://example.com/list";

    Fetcher fetcher = new Fetcher();

    Set<String> seen = new HashSet<>();

    String url = startUrl;
    int page = 0;

    while (url != null && page < 5) {
      page++;

      String html = Retry.withBackoff(() -> fetcher.get(url), 5);
      List<Parser.Item> items = Parser.parseItems(html, url);

      int newCount = 0;
      for (Parser.Item it : items) {
        if (seen.add(it.url())) {
          newCount++;
          System.out.println(it.title() + " | " + it.price() + " | " + it.url());
        }
      }

      System.out.println("page " + page + ": items=" + items.size() + " new=" + newCount + " total=" + seen.size());

      url = Parser.findNextPage(html, url);
    }
  }
}

Step 5: Export to CSV

For quick exports, write CSV manually:

import java.io.FileWriter;
import java.io.IOException;
import java.util.List;

public class CsvExport {
  public static void write(String path, List<Parser.Item> items) throws IOException {
    try (FileWriter w = new FileWriter(path)) {
      w.write("title,url,price\n");
      for (Parser.Item it : items) {
        String title = it.title().replace("\"", "\"\"");
        String url = it.url().replace("\"", "\"\"");
        String price = it.price() != null ? it.price().replace("\"", "\"\"") : "";

        w.write("\"" + title + "\",\"" + url + "\",\"" + price + "\"\n");
      }
    }
  }
}

Proxy rotation patterns (and where ProxiesAPI fits)

For many sites, you’ll eventually need some form of proxy strategy.

A clean way is to abstract it:

  • ProxyProvider returns the next proxy endpoint
  • Fetcher uses that proxy for a request

Basic structure

public interface ProxyProvider {
  String nextProxy(); // e.g., http://user:pass@host:port
}

In Java HttpClient, proxies are configured via ProxySelector on the client builder.

If you use ProxiesAPI, you often don’t need to manage a list of proxies yourself — you send requests through ProxiesAPI and it handles rotation upstream.

The exact integration depends on your ProxiesAPI plan/endpoint, but the architectural point is:

  • keep proxy logic out of parsing
  • keep it in the fetch layer

Practical advice: what usually breaks first

  • No timeouts → your jobs hang forever
  • No backoff → you intensify blocks and get 403/429 storms
  • Parsing too much → keep list pages light; fetch details only when needed
  • No caching → you re-download the same pages nightly

QA checklist

  • your fetcher sets connect+read timeouts
  • you backoff on failures
  • selectors are tight (avoid matching the whole page)
  • pagination stops correctly
  • you dedupe

Next upgrades

  • store results in SQLite (resume after crash)
  • concurrency with a fixed worker pool (but keep rate limits)
  • HTML snapshots for selector regression tests
When Java scrapers scale, ProxiesAPI keeps them stable

Java is excellent for long-running crawlers — but once you increase request volume, blocks and throttling show up. ProxiesAPI adds a resilient proxy + retry layer so your Java scraper can run reliably in production.

Related guides

How to Scrape E-Commerce Websites: A Practical Guide
A practical playbook for ecommerce scraping: category discovery, pagination patterns, product detail extraction, variants, rate limits, retries, and proxy-backed fetching with ProxiesAPI.
guide#ecommerce scraping#ecommerce#web-scraping
How to Scrape Google Search Results with Python (Without Getting Blocked)
A practical SERP scraping workflow in Python: handle consent/interstitials, parse organic results defensively, rotate IPs, backoff on blocks, and export clean results. Includes a ProxiesAPI-backed fetch layer.
guide#how to scrape google search results with python#python#serp
Web Scraping with PHP: cURL + DOMDocument Tutorial (2026)
A practical web scraping php starter: fetch HTML with cURL, parse with DOMDocument/XPath, and scale safely with retries and ProxiesAPI.
guide#php#web-scraping#curl
Web Scraping with Python: The Complete 2026 Tutorial
A from-scratch, production-minded guide to web scraping in Python: requests + BeautifulSoup, pagination, retries, caching, proxies, and a reusable scraper template.
guide#web scraping python#python#web-scraping