Web Scraping with Kotlin: Ktor + Jsoup Tutorial (2026)
Kotlin is a great language for scrapers:
- fast enough for most workloads
- strong typing for clean data models
- great HTTP clients
- easy concurrency when you need it
In this guide we’ll build a real Kotlin scraper using:
- Ktor client for HTTP requests (timeouts, headers, retries)
- Jsoup for HTML parsing with CSS selectors
We’ll scrape a simple target (Hacker News front page) to keep the focus on architecture:
- fetch HTML with sane timeouts
- parse rows with Jsoup selectors
- paginate
- export results to JSON
- show where ProxiesAPI fits (without rewriting your parser)
Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.
Project setup (Gradle)
Create a Kotlin JVM project. Add dependencies:
dependencies {
implementation("io.ktor:ktor-client-core:2.3.12")
implementation("io.ktor:ktor-client-cio:2.3.12")
implementation("io.ktor:ktor-client-content-negotiation:2.3.12")
implementation("io.ktor:ktor-serialization-kotlinx-json:2.3.12")
implementation("org.jsoup:jsoup:1.18.1")
implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.7.3")
}
We’ll use:
- CIO engine (simple + fast)
- kotlinx serialization (easy JSON export)
Step 1: Build an HTTP client with timeouts + UA
import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.plugins.*
import io.ktor.client.request.*
import io.ktor.http.*
fun httpClient(): HttpClient =
HttpClient(CIO) {
install(HttpTimeout) {
connectTimeoutMillis = 10_000
requestTimeoutMillis = 30_000
socketTimeoutMillis = 30_000
}
defaultRequest {
header(HttpHeaders.UserAgent, "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)")
header(HttpHeaders.AcceptLanguage, "en-US,en;q=0.9")
}
}
Step 2: Fetch HTML with small retry logic
Scraping fails often due to transient network issues. You don’t need a huge framework — just:
- retry a few times
- exponential backoff
import io.ktor.client.statement.*
import kotlinx.coroutines.delay
suspend fun fetchHtml(client: HttpClient, url: String, retries: Int = 3): String {
var lastError: Throwable? = null
for (attempt in 1..retries) {
try {
val resp: HttpResponse = client.get(url)
if (!resp.status.isSuccess()) {
throw IllegalStateException("HTTP ${resp.status.value} for $url")
}
return resp.bodyAsText()
} catch (t: Throwable) {
lastError = t
val backoffMs = 400L * attempt * attempt
delay(backoffMs)
}
}
throw lastError ?: IllegalStateException("failed to fetch $url")
}
Step 3: Parse HTML with Jsoup selectors
Jsoup supports CSS selectors similar to browser DevTools.
For Hacker News:
- story rows are
tr.athing - title link is
span.titleline > a - metadata lives in the next sibling row under
td.subtext
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
@kotlinx.serialization.Serializable
data class HnStory(
val id: String?,
val title: String?,
val url: String?,
val points: Int?,
val author: String?,
val age: String?,
val comments: Int?,
val itemUrl: String?
)
fun parseInt(text: String?): Int? {
if (text.isNullOrBlank()) return null
val m = Regex("(\\d+)").find(text) ?: return null
return m.groupValues[1].toIntOrNull()
}
fun parseFrontPage(html: String, base: String = "https://news.ycombinator.com"): List<HnStory> {
val doc: Document = Jsoup.parse(html, base)
return doc.select("tr.athing").map { row ->
val id = row.attr("id").ifBlank { null }
val a = row.selectFirst("span.titleline > a")
val title = a?.text()
val href = a?.attr("href")
val sub = row.nextElementSibling()?.selectFirst("td.subtext")
val points = parseInt(sub?.selectFirst("span.score")?.text())
val author = sub?.selectFirst("a.hnuser")?.text()
val age = sub?.selectFirst("span.age a")?.text()
val links = sub?.select("a") ?: emptyList()
val comments = if (links.isNotEmpty()) parseInt(links.last().text()) else null
HnStory(
id = id,
title = title,
url = href,
points = points,
author = author,
age = age,
comments = comments,
itemUrl = id?.let { "$base/item?id=$it" }
)
}
}
Step 4: Pagination (crawl N pages)
HN supports ?p=N.
suspend fun crawlHnFrontPages(client: HttpClient, pages: Int = 3): List<HnStory> {
val all = mutableListOf<HnStory>()
val seen = mutableSetOf<String>()
for (p in 1..pages) {
val url = if (p == 1) "https://news.ycombinator.com/" else "https://news.ycombinator.com/?p=$p"
val html = fetchHtml(client, url)
val batch = parseFrontPage(html)
for (s in batch) {
val key = s.id ?: continue
if (key in seen) continue
seen.add(key)
all.add(s)
}
}
return all
}
Step 5: Export JSON (kotlinx serialization)
import kotlinx.serialization.encodeToString
import kotlinx.serialization.json.Json
import java.nio.file.Files
import java.nio.file.Path
fun writeJson(path: Path, stories: List<HnStory>) {
val json = Json { prettyPrint = true }
Files.writeString(path, json.encodeToString(stories))
}
Step 6: Where ProxiesAPI fits (fetch layer only)
If you’re crawling a lot of pages, you’ll hit:
- intermittent blocks
- inconsistent responses
- timeouts
With ProxiesAPI you fetch through a wrapper URL, but you still parse the same HTML.
import java.net.URLEncoder
fun proxiesapiWrap(targetUrl: String, apiKey: String): String {
val base = "http://api.proxiesapi.com/"
val encodedUrl = URLEncoder.encode(targetUrl, "UTF-8")
return "$base?key=$apiKey&url=$encodedUrl"
}
Use it like:
val client = httpClient()
val apiKey = "API_KEY"
val target = "https://news.ycombinator.com/"
val wrapped = proxiesapiWrap(target, apiKey)
val html = fetchHtml(client, wrapped)
val stories = parseFrontPage(html)
Notice the win: Ktor fetch URL changes; Jsoup selectors do not.
Practical advice for Kotlin scrapers
| Problem | What to do |
|---|---|
| Selectors break | Save sample HTML and update Jsoup selectors |
| Requests hang | Add timeouts + retries (don’t rely on defaults) |
| Duplicates across pages | Use a stable key (id/url) and a seen set |
| Getting blocked | Slow down + add fetch-layer resilience (ProxiesAPI) |
If you want to scale further, the next “real” steps are:
- queue + concurrency limits
- structured logging (per URL)
- persistent storage (SQLite/Postgres)
Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.