Web Scraping with Java: JSoup + HttpClient Guide (2026)
If you’re building a scraper that needs to run for months, integrate into back-end services, and handle high throughput, Java is a great choice.
This guide is a practical, end-to-end answer to the keyword “web scraping with Java”:
- Use Java HttpClient (built-in) for HTTP
- Use JSoup for HTML parsing
- Add timeouts, retries, and backoff
- Implement pagination
- Add a proxy rotation abstraction (and where ProxiesAPI fits)
By the end, you’ll have a minimal but production-shaped scraper you can adapt to most sites.
What we’re building
We’ll implement a scraper with:
Fetcher(HTTP client with headers + timeouts)Retryutility (backoff + jitter)Parser(JSoup selectors)Crawler(pagination loop + dedupe)Exporter(CSV)
Even if your target site changes, this structure holds.
Dependencies
If you use Maven:
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
</dependencies>
JSoup does HTML parsing and also supports simple requests, but for control (timeouts, proxies, headers) we’ll use Java’s HttpClient.
Step 1: Create an HttpClient fetcher
Java 11+ ships with a modern HTTP client.
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;
public class Fetcher {
private final HttpClient client;
public Fetcher() {
this.client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.followRedirects(HttpClient.Redirect.NORMAL)
.build();
}
public String get(String url) throws IOException, InterruptedException {
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(30))
.header("User-Agent", "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)")
.header("Accept", "text/html,application/xhtml+xml")
.GET()
.build();
HttpResponse<String> res = client.send(req, HttpResponse.BodyHandlers.ofString());
int code = res.statusCode();
if (code >= 400) {
throw new IOException("HTTP " + code + " for " + url);
}
return res.body();
}
}
Why User-Agent matters
Many sites treat the default Java UA as suspicious. A boring UA reduces friction.
Step 2: Retries with exponential backoff + jitter
Scrapers fail for boring reasons:
- transient 503s
- timeouts
- occasional 429 throttling
Backoff is the difference between “works once” and “works nightly”.
import java.time.Duration;
import java.util.concurrent.Callable;
import java.util.concurrent.ThreadLocalRandom;
public class Retry {
public static <T> T withBackoff(Callable<T> fn, int maxAttempts) throws Exception {
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return fn.call();
} catch (Exception e) {
if (attempt == maxAttempts) throw e;
long baseMs = (long) Math.min(20_000, Math.pow(2, attempt - 1) * 1000);
long jitter = ThreadLocalRandom.current().nextLong(0, 400);
long sleepMs = baseMs + jitter;
System.out.println("attempt " + attempt + " failed: " + e.getMessage() + " — sleep " + sleepMs + "ms");
Thread.sleep(sleepMs);
}
}
throw new IllegalStateException("unreachable");
}
}
Step 3: Parse HTML with JSoup selectors
JSoup uses CSS selectors similar to browser devtools.
Example: parse a list page with cards like:
<article class="card">
<a class="title" href="/item/123">Item title</a>
<span class="price">$19.99</span>
</article>
Parser:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.ArrayList;
import java.util.List;
public class Parser {
public record Item(String title, String url, String price) {}
public static List<Item> parseItems(String html, String baseUrl) {
Document doc = Jsoup.parse(html, baseUrl);
List<Item> out = new ArrayList<>();
for (Element card : doc.select("article.card")) {
Element a = card.selectFirst("a.title");
if (a == null) continue;
String title = a.text();
String url = a.absUrl("href");
Element priceEl = card.selectFirst("span.price");
String price = priceEl != null ? priceEl.text() : null;
out.add(new Item(title, url, price));
}
return out;
}
public static String findNextPage(String html, String baseUrl) {
Document doc = Jsoup.parse(html, baseUrl);
// common patterns
Element relNext = doc.selectFirst("a[rel=next]");
if (relNext != null) return relNext.absUrl("href");
// fallback: link text contains 'Next'
for (Element a : doc.select("a[href]")) {
if (a.text().toLowerCase().contains("next")) {
return a.absUrl("href");
}
}
return null;
}
}
Step 4: Pagination loop + dedupe
Pagination is almost always “fetch → parse → next link”.
Key rules:
- cap pages while developing
- dedupe by canonical URL
- stop if next URL is null
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class Crawler {
public static void main(String[] args) throws Exception {
String startUrl = args.length > 0 ? args[0] : "https://example.com/list";
Fetcher fetcher = new Fetcher();
Set<String> seen = new HashSet<>();
String url = startUrl;
int page = 0;
while (url != null && page < 5) {
page++;
String html = Retry.withBackoff(() -> fetcher.get(url), 5);
List<Parser.Item> items = Parser.parseItems(html, url);
int newCount = 0;
for (Parser.Item it : items) {
if (seen.add(it.url())) {
newCount++;
System.out.println(it.title() + " | " + it.price() + " | " + it.url());
}
}
System.out.println("page " + page + ": items=" + items.size() + " new=" + newCount + " total=" + seen.size());
url = Parser.findNextPage(html, url);
}
}
}
Step 5: Export to CSV
For quick exports, write CSV manually:
import java.io.FileWriter;
import java.io.IOException;
import java.util.List;
public class CsvExport {
public static void write(String path, List<Parser.Item> items) throws IOException {
try (FileWriter w = new FileWriter(path)) {
w.write("title,url,price\n");
for (Parser.Item it : items) {
String title = it.title().replace("\"", "\"\"");
String url = it.url().replace("\"", "\"\"");
String price = it.price() != null ? it.price().replace("\"", "\"\"") : "";
w.write("\"" + title + "\",\"" + url + "\",\"" + price + "\"\n");
}
}
}
}
Proxy rotation patterns (and where ProxiesAPI fits)
For many sites, you’ll eventually need some form of proxy strategy.
A clean way is to abstract it:
ProxyProviderreturns the next proxy endpoint- Fetcher uses that proxy for a request
Basic structure
public interface ProxyProvider {
String nextProxy(); // e.g., http://user:pass@host:port
}
In Java HttpClient, proxies are configured via ProxySelector on the client builder.
If you use ProxiesAPI, you often don’t need to manage a list of proxies yourself — you send requests through ProxiesAPI and it handles rotation upstream.
The exact integration depends on your ProxiesAPI plan/endpoint, but the architectural point is:
- keep proxy logic out of parsing
- keep it in the fetch layer
Practical advice: what usually breaks first
- No timeouts → your jobs hang forever
- No backoff → you intensify blocks and get 403/429 storms
- Parsing too much → keep list pages light; fetch details only when needed
- No caching → you re-download the same pages nightly
QA checklist
- your fetcher sets connect+read timeouts
- you backoff on failures
- selectors are tight (avoid matching the whole page)
- pagination stops correctly
- you dedupe
Next upgrades
- store results in SQLite (resume after crash)
- concurrency with a fixed worker pool (but keep rate limits)
- HTML snapshots for selector regression tests
Java is excellent for long-running crawlers — but once you increase request volume, blocks and throttling show up. ProxiesAPI adds a resilient proxy + retry layer so your Java scraper can run reliably in production.