Web Scraping with Java: JSoup + HttpClient Guide
If you searched for web scraping Java, you probably do not need a giant crawling framework on day one. Most teams can get very far with two pieces:
HttpClientfor fetching pagesJSoupfor parsing HTML
That combination is fast to ship, easy to test, and much easier to maintain than a browser-first stack for ordinary HTML pages.
This guide shows how to:
- fetch HTML with Java
HttpClient - parse pages with JSoup selectors
- follow pagination
- retry transient failures
- add a proxy wrapper when the scraper grows up
The stack in one sentence
Use HttpClient for network control and JSoup for DOM parsing.
That split is usually better than relying on JSoup alone for everything because:
| Tool | Best at | Why it matters |
|---|---|---|
HttpClient | timeouts, headers, redirects, transport control | production fetch behavior |
JSoup | parsing and CSS selectors | fast, readable extraction code |
If the target site is mostly server-rendered HTML, this is the sweet spot.
Maven dependency
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
</dependencies>
The rest of the guide uses Java 21 syntax, but the core idea works on Java 11+.
Step 1: Create a real fetcher
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;
public class Fetcher {
private final HttpClient client;
public Fetcher() {
this.client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.followRedirects(HttpClient.Redirect.NORMAL)
.build();
}
public String get(String url) throws IOException, InterruptedException {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(30))
.header("User-Agent", "Mozilla/5.0 (compatible; ProxiesAPI-Java-Guide/1.0)")
.header("Accept", "text/html,application/xhtml+xml")
.GET()
.build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
if (response.statusCode() >= 400) {
throw new IOException("HTTP " + response.statusCode() + " for " + url);
}
return response.body();
}
}
This is the boring code that keeps scrapers alive: explicit timeout, explicit headers, explicit error handling.
Step 2: Add retries with backoff
Web scraping breaks on ordinary infrastructure problems all the time: 429s, slow responses, temporary 503s, and the occasional reset.
import java.util.concurrent.Callable;
import java.util.concurrent.ThreadLocalRandom;
public class Retry {
public static <T> T withBackoff(Callable<T> fn, int maxAttempts) throws Exception {
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return fn.call();
} catch (Exception e) {
if (attempt == maxAttempts) throw e;
long sleepMs = (long) Math.min(20_000, Math.pow(2, attempt - 1) * 1000)
+ ThreadLocalRandom.current().nextLong(0, 400);
Thread.sleep(sleepMs);
}
}
throw new IllegalStateException("unreachable");
}
}
Without retries, a scraper is just a demo.
Step 3: Parse HTML with JSoup
Assume the target list page contains cards like:
<article class="card">
<a class="title" href="/item/123">Example item</a>
<span class="price">$19.99</span>
</article>
JSoup turns that into very readable extraction logic:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.ArrayList;
import java.util.List;
public class Parser {
public record Item(String title, String url, String price) {}
public static List<Item> parseItems(String html, String baseUrl) {
Document doc = Jsoup.parse(html, baseUrl);
List<Item> items = new ArrayList<>();
for (Element card : doc.select("article.card")) {
Element titleEl = card.selectFirst("a.title");
if (titleEl == null) continue;
Element priceEl = card.selectFirst("span.price");
items.add(new Item(
titleEl.text(),
titleEl.absUrl("href"),
priceEl != null ? priceEl.text() : null
));
}
return items;
}
}
The important habit is to keep parsing methods small and selector-specific. When the site changes, you want one tiny method to update, not one 500-line crawler to debug.
Step 4: Follow pagination
Pagination is usually “find next link, fetch next page, stop when null.”
public static String nextPageUrl(String html, String baseUrl) {
Document doc = Jsoup.parse(html, baseUrl);
Element relNext = doc.selectFirst("a[rel=next]");
if (relNext != null) return relNext.absUrl("href");
for (Element link : doc.select("a[href]")) {
if (link.text().toLowerCase().contains("next")) {
return link.absUrl("href");
}
}
return null;
}
Crawler loop:
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class Crawler {
public static void main(String[] args) throws Exception {
String url = args.length > 0 ? args[0] : "https://example.com/list";
Fetcher fetcher = new Fetcher();
Set<String> seen = new HashSet<>();
int page = 0;
while (url != null && page < 5) {
page++;
String html = Retry.withBackoff(() -> fetcher.get(url), 5);
List<Parser.Item> items = Parser.parseItems(html, url);
for (Parser.Item item : items) {
if (seen.add(item.url())) {
System.out.println(item.title() + " | " + item.price() + " | " + item.url());
}
}
url = Parser.nextPageUrl(html, url);
}
}
}
That gets you a working, paginated scraper without any browser automation.
Where proxies fit
You do not need proxies for every Java scraper. Add them when:
- volume increases
- you start crawling multiple sections or regions
- the target begins throttling or soft-blocking
One clean pattern is to keep target URL creation separate from transport. For example:
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
public class ProxyHelper {
public static String proxiesApiUrl(String targetUrl) {
String key = System.getenv("PROXIESAPI_KEY");
if (key == null || key.isBlank()) return targetUrl;
return "http://api.proxiesapi.com/?auth_key="
+ URLEncoder.encode(key, StandardCharsets.UTF_8)
+ "&url="
+ URLEncoder.encode(targetUrl, StandardCharsets.UTF_8);
}
}
Then call ProxyHelper.proxiesApiUrl(url) before the request is built. Your parser stays exactly the same.
When to stay simple vs when to escalate
| Approach | Use it when | Avoid it when |
|---|---|---|
HttpClient + JSoup | static HTML, pagination, clean selectors | you need heavy client-side JS execution |
| Browser automation | React-heavy pages, login flows, infinite scroll | plain HTML would do the job |
That is the real productivity rule for Java scraping: do not pay for browser complexity unless the target forces you to.
For most first versions, HttpClient + JSoup is enough. It is fast, readable, and easy to productionize once you add retries, logging, and a proxy layer.
Java is a strong choice for long-running scrapers. Once request volume goes up, ProxiesAPI gives you a simpler way to add a proxy layer without rebuilding your parser and crawl loops.