Web Scraping with Java: JSoup + HttpClient Guide

Jun 20, 2026 · guide · #web scraping java, #java, #jsoup, #httpclient, #html parsing, #proxiesapi

If you searched for web scraping Java, you probably do not need a giant crawling framework on day one. Most teams can get very far with two pieces:

HttpClient for fetching pages
JSoup for parsing HTML

That combination is fast to ship, easy to test, and much easier to maintain than a browser-first stack for ordinary HTML pages.

This guide shows how to:

fetch HTML with Java HttpClient
parse pages with JSoup selectors
follow pagination
retry transient failures
add a proxy wrapper when the scraper grows up

The stack in one sentence

Use HttpClient for network control and JSoup for DOM parsing.

That split is usually better than relying on JSoup alone for everything because:

Tool	Best at	Why it matters
`HttpClient`	timeouts, headers, redirects, transport control	production fetch behavior
`JSoup`	parsing and CSS selectors	fast, readable extraction code

If the target site is mostly server-rendered HTML, this is the sweet spot.

Maven dependency

<dependencies>
  <dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
  </dependency>
</dependencies>

The rest of the guide uses Java 21 syntax, but the core idea works on Java 11+.

Step 1: Create a real fetcher

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;

public class Fetcher {
  private final HttpClient client;

  public Fetcher() {
    this.client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(10))
        .followRedirects(HttpClient.Redirect.NORMAL)
        .build();
  }

  public String get(String url) throws IOException, InterruptedException {
    HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create(url))
        .timeout(Duration.ofSeconds(30))
        .header("User-Agent", "Mozilla/5.0 (compatible; ProxiesAPI-Java-Guide/1.0)")
        .header("Accept", "text/html,application/xhtml+xml")
        .GET()
        .build();

    HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
    if (response.statusCode() >= 400) {
      throw new IOException("HTTP " + response.statusCode() + " for " + url);
    }
    return response.body();
  }
}

This is the boring code that keeps scrapers alive: explicit timeout, explicit headers, explicit error handling.

Step 2: Add retries with backoff

Web scraping breaks on ordinary infrastructure problems all the time: 429s, slow responses, temporary 503s, and the occasional reset.

import java.util.concurrent.Callable;
import java.util.concurrent.ThreadLocalRandom;

public class Retry {
  public static <T> T withBackoff(Callable<T> fn, int maxAttempts) throws Exception {
    for (int attempt = 1; attempt <= maxAttempts; attempt++) {
      try {
        return fn.call();
      } catch (Exception e) {
        if (attempt == maxAttempts) throw e;

        long sleepMs = (long) Math.min(20_000, Math.pow(2, attempt - 1) * 1000)
            + ThreadLocalRandom.current().nextLong(0, 400);

        Thread.sleep(sleepMs);
      }
    }
    throw new IllegalStateException("unreachable");
  }
}

Without retries, a scraper is just a demo.

Step 3: Parse HTML with JSoup

Assume the target list page contains cards like:

<article class="card">
  <a class="title" href="/item/123">Example item</a>
  <span class="price">$19.99</span>
</article>

JSoup turns that into very readable extraction logic:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.util.ArrayList;
import java.util.List;

public class Parser {
  public record Item(String title, String url, String price) {}

  public static List<Item> parseItems(String html, String baseUrl) {
    Document doc = Jsoup.parse(html, baseUrl);
    List<Item> items = new ArrayList<>();

    for (Element card : doc.select("article.card")) {
      Element titleEl = card.selectFirst("a.title");
      if (titleEl == null) continue;

      Element priceEl = card.selectFirst("span.price");
      items.add(new Item(
          titleEl.text(),
          titleEl.absUrl("href"),
          priceEl != null ? priceEl.text() : null
      ));
    }

    return items;
  }
}

The important habit is to keep parsing methods small and selector-specific. When the site changes, you want one tiny method to update, not one 500-line crawler to debug.

Step 4: Follow pagination

Pagination is usually “find next link, fetch next page, stop when null.”

public static String nextPageUrl(String html, String baseUrl) {
  Document doc = Jsoup.parse(html, baseUrl);

  Element relNext = doc.selectFirst("a[rel=next]");
  if (relNext != null) return relNext.absUrl("href");

  for (Element link : doc.select("a[href]")) {
    if (link.text().toLowerCase().contains("next")) {
      return link.absUrl("href");
    }
  }

  return null;
}

Crawler loop:

import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class Crawler {
  public static void main(String[] args) throws Exception {
    String url = args.length > 0 ? args[0] : "https://example.com/list";
    Fetcher fetcher = new Fetcher();
    Set<String> seen = new HashSet<>();

    int page = 0;
    while (url != null && page < 5) {
      page++;
      String html = Retry.withBackoff(() -> fetcher.get(url), 5);
      List<Parser.Item> items = Parser.parseItems(html, url);

      for (Parser.Item item : items) {
        if (seen.add(item.url())) {
          System.out.println(item.title() + " | " + item.price() + " | " + item.url());
        }
      }

      url = Parser.nextPageUrl(html, url);
    }
  }
}

That gets you a working, paginated scraper without any browser automation.

Where proxies fit

You do not need proxies for every Java scraper. Add them when:

volume increases
you start crawling multiple sections or regions
the target begins throttling or soft-blocking

One clean pattern is to keep target URL creation separate from transport. For example:

import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;

public class ProxyHelper {
  public static String proxiesApiUrl(String targetUrl) {
    String key = System.getenv("PROXIESAPI_KEY");
    if (key == null || key.isBlank()) return targetUrl;

    return "http://api.proxiesapi.com/?auth_key="
        + URLEncoder.encode(key, StandardCharsets.UTF_8)
        + "&url="
        + URLEncoder.encode(targetUrl, StandardCharsets.UTF_8);
  }
}

Then call ProxyHelper.proxiesApiUrl(url) before the request is built. Your parser stays exactly the same.

When to stay simple vs when to escalate

Approach	Use it when	Avoid it when
`HttpClient + JSoup`	static HTML, pagination, clean selectors	you need heavy client-side JS execution
Browser automation	React-heavy pages, login flows, infinite scroll	plain HTML would do the job

That is the real productivity rule for Java scraping: do not pay for browser complexity unless the target forces you to.

For most first versions, HttpClient + JSoup is enough. It is fast, readable, and easy to productionize once you add retries, logging, and a proxy layer.

When Java scrapers move from scripts to services, ProxiesAPI helps

Java is a strong choice for long-running scrapers. Once request volume goes up, ProxiesAPI gives you a simpler way to add a proxy layer without rebuilding your parser and crawl loops.

Get 1,000 free API calls View pricing

A practical end-to-end Java web scraping tutorial using Java 21+: HttpClient for requests, JSoup for parsing, pagination loops, retries/backoff, and proxy rotation patterns.

guide#web scraping with java#java#jsoup

How to Scrape Google Search Results with Python

Walk through extracting titles, URLs, and snippets from Google result pages while handling rate limits and anti-bot friction.

guide#scrape google#python#serp

Error Code 520: What It Means and How to Fix It When Scraping

Explain what Cloudflare 520 usually signals in scraping workflows and give a practical checklist to reduce and debug it.

guide#error code 520#cloudflare#web-scraping

Shopify Product Scraping: Prices, Variants, Inventory

Teach a practical approach to extracting Shopify product data, variant details, and stock signals reliably.

guide#shopify product scraping#shopify#ecommerce

Web Scraping with Java: JSoup + HttpClient Guide

Related guides