Web Scraping with Kotlin: Ktor + Jsoup Tutorial (2026)

Kotlin is a great language for scrapers:

  • fast enough for most workloads
  • strong typing for clean data models
  • great HTTP clients
  • easy concurrency when you need it

In this guide we’ll build a real Kotlin scraper using:

  • Ktor client for HTTP requests (timeouts, headers, retries)
  • Jsoup for HTML parsing with CSS selectors

We’ll scrape a simple target (Hacker News front page) to keep the focus on architecture:

  1. fetch HTML with sane timeouts
  2. parse rows with Jsoup selectors
  3. paginate
  4. export results to JSON
  5. show where ProxiesAPI fits (without rewriting your parser)
Keep Kotlin crawlers resilient with ProxiesAPI

Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.


Project setup (Gradle)

Create a Kotlin JVM project. Add dependencies:

dependencies {
    implementation("io.ktor:ktor-client-core:2.3.12")
    implementation("io.ktor:ktor-client-cio:2.3.12")
    implementation("io.ktor:ktor-client-content-negotiation:2.3.12")
    implementation("io.ktor:ktor-serialization-kotlinx-json:2.3.12")
    implementation("org.jsoup:jsoup:1.18.1")

    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.7.3")
}

We’ll use:

  • CIO engine (simple + fast)
  • kotlinx serialization (easy JSON export)

Step 1: Build an HTTP client with timeouts + UA

import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.plugins.*
import io.ktor.client.request.*
import io.ktor.http.*

fun httpClient(): HttpClient =
    HttpClient(CIO) {
        install(HttpTimeout) {
            connectTimeoutMillis = 10_000
            requestTimeoutMillis = 30_000
            socketTimeoutMillis = 30_000
        }

        defaultRequest {
            header(HttpHeaders.UserAgent, "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)")
            header(HttpHeaders.AcceptLanguage, "en-US,en;q=0.9")
        }
    }

Step 2: Fetch HTML with small retry logic

Scraping fails often due to transient network issues. You don’t need a huge framework — just:

  • retry a few times
  • exponential backoff
import io.ktor.client.statement.*
import kotlinx.coroutines.delay

suspend fun fetchHtml(client: HttpClient, url: String, retries: Int = 3): String {
    var lastError: Throwable? = null

    for (attempt in 1..retries) {
        try {
            val resp: HttpResponse = client.get(url)
            if (!resp.status.isSuccess()) {
                throw IllegalStateException("HTTP ${resp.status.value} for $url")
            }
            return resp.bodyAsText()
        } catch (t: Throwable) {
            lastError = t
            val backoffMs = 400L * attempt * attempt
            delay(backoffMs)
        }
    }

    throw lastError ?: IllegalStateException("failed to fetch $url")
}

Step 3: Parse HTML with Jsoup selectors

Jsoup supports CSS selectors similar to browser DevTools.

For Hacker News:

  • story rows are tr.athing
  • title link is span.titleline > a
  • metadata lives in the next sibling row under td.subtext
import org.jsoup.Jsoup
import org.jsoup.nodes.Document

@kotlinx.serialization.Serializable
data class HnStory(
    val id: String?,
    val title: String?,
    val url: String?,
    val points: Int?,
    val author: String?,
    val age: String?,
    val comments: Int?,
    val itemUrl: String?
)

fun parseInt(text: String?): Int? {
    if (text.isNullOrBlank()) return null
    val m = Regex("(\\d+)").find(text) ?: return null
    return m.groupValues[1].toIntOrNull()
}

fun parseFrontPage(html: String, base: String = "https://news.ycombinator.com"): List<HnStory> {
    val doc: Document = Jsoup.parse(html, base)

    return doc.select("tr.athing").map { row ->
        val id = row.attr("id").ifBlank { null }

        val a = row.selectFirst("span.titleline > a")
        val title = a?.text()
        val href = a?.attr("href")

        val sub = row.nextElementSibling()?.selectFirst("td.subtext")
        val points = parseInt(sub?.selectFirst("span.score")?.text())
        val author = sub?.selectFirst("a.hnuser")?.text()
        val age = sub?.selectFirst("span.age a")?.text()

        val links = sub?.select("a") ?: emptyList()
        val comments = if (links.isNotEmpty()) parseInt(links.last().text()) else null

        HnStory(
            id = id,
            title = title,
            url = href,
            points = points,
            author = author,
            age = age,
            comments = comments,
            itemUrl = id?.let { "$base/item?id=$it" }
        )
    }
}

Step 4: Pagination (crawl N pages)

HN supports ?p=N.

suspend fun crawlHnFrontPages(client: HttpClient, pages: Int = 3): List<HnStory> {
    val all = mutableListOf<HnStory>()
    val seen = mutableSetOf<String>()

    for (p in 1..pages) {
        val url = if (p == 1) "https://news.ycombinator.com/" else "https://news.ycombinator.com/?p=$p"
        val html = fetchHtml(client, url)
        val batch = parseFrontPage(html)

        for (s in batch) {
            val key = s.id ?: continue
            if (key in seen) continue
            seen.add(key)
            all.add(s)
        }
    }

    return all
}

Step 5: Export JSON (kotlinx serialization)

import kotlinx.serialization.encodeToString
import kotlinx.serialization.json.Json
import java.nio.file.Files
import java.nio.file.Path

fun writeJson(path: Path, stories: List<HnStory>) {
    val json = Json { prettyPrint = true }
    Files.writeString(path, json.encodeToString(stories))
}

Step 6: Where ProxiesAPI fits (fetch layer only)

If you’re crawling a lot of pages, you’ll hit:

  • intermittent blocks
  • inconsistent responses
  • timeouts

With ProxiesAPI you fetch through a wrapper URL, but you still parse the same HTML.

import java.net.URLEncoder

fun proxiesapiWrap(targetUrl: String, apiKey: String): String {
    val base = "http://api.proxiesapi.com/"
    val encodedUrl = URLEncoder.encode(targetUrl, "UTF-8")
    return "$base?key=$apiKey&url=$encodedUrl"
}

Use it like:

val client = httpClient()
val apiKey = "API_KEY"
val target = "https://news.ycombinator.com/"
val wrapped = proxiesapiWrap(target, apiKey)
val html = fetchHtml(client, wrapped)
val stories = parseFrontPage(html)

Notice the win: Ktor fetch URL changes; Jsoup selectors do not.


Practical advice for Kotlin scrapers

ProblemWhat to do
Selectors breakSave sample HTML and update Jsoup selectors
Requests hangAdd timeouts + retries (don’t rely on defaults)
Duplicates across pagesUse a stable key (id/url) and a seen set
Getting blockedSlow down + add fetch-layer resilience (ProxiesAPI)

If you want to scale further, the next “real” steps are:

  • queue + concurrency limits
  • structured logging (per URL)
  • persistent storage (SQLite/Postgres)
Keep Kotlin crawlers resilient with ProxiesAPI

Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.

Related guides

Web Scraping with Java: JSoup + HttpClient Guide (2026)
A practical end-to-end Java web scraping tutorial using Java 21+: HttpClient for requests, JSoup for parsing, pagination loops, retries/backoff, and proxy rotation patterns.
guide#web scraping with java#java#jsoup
Scrape Book Reviews and Ratings from Goodreads
Extract Goodreads review text, star ratings, review counts, and reviewer metadata for a clean book-sentiment dataset.
tutorial#python#goodreads#web-scraping
Scrape Secondhand Fashion Listings from Vinted
Capture Vinted search listings with title, price, brand, size, image, and listing URL into a reusable resale dataset.
tutorial#python#vinted#ecommerce
Scrape Vinted Listings with Python: Search + Pagination + Clean CSV Export
Build a practical Vinted listings scraper: pull search results via Vinted’s internal catalog endpoint, paginate safely, extract price/brand/size/image URLs, and export a clean CSV. Includes a screenshot + ProxiesAPI integration.
tutorial#vinted#python#web-scraping