Web Scraping with Kotlin: Ktor + Jsoup Tutorial (2026)

Kotlin is a great language for scrapers:

  • fast enough for most workloads
  • strong typing for clean data models
  • great HTTP clients
  • easy concurrency when you need it

In this guide we’ll build a real Kotlin scraper using:

  • Ktor client for HTTP requests (timeouts, headers, retries)
  • Jsoup for HTML parsing with CSS selectors

We’ll scrape a simple target (Hacker News front page) to keep the focus on architecture:

  1. fetch HTML with sane timeouts
  2. parse rows with Jsoup selectors
  3. paginate
  4. export results to JSON
  5. show where ProxiesAPI fits (without rewriting your parser)
Keep Kotlin crawlers resilient with ProxiesAPI

Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.


Project setup (Gradle)

Create a Kotlin JVM project. Add dependencies:

dependencies {
    implementation("io.ktor:ktor-client-core:2.3.12")
    implementation("io.ktor:ktor-client-cio:2.3.12")
    implementation("io.ktor:ktor-client-content-negotiation:2.3.12")
    implementation("io.ktor:ktor-serialization-kotlinx-json:2.3.12")
    implementation("org.jsoup:jsoup:1.18.1")

    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.7.3")
}

We’ll use:

  • CIO engine (simple + fast)
  • kotlinx serialization (easy JSON export)

Step 1: Build an HTTP client with timeouts + UA

import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.plugins.*
import io.ktor.client.request.*
import io.ktor.http.*

fun httpClient(): HttpClient =
    HttpClient(CIO) {
        install(HttpTimeout) {
            connectTimeoutMillis = 10_000
            requestTimeoutMillis = 30_000
            socketTimeoutMillis = 30_000
        }

        defaultRequest {
            header(HttpHeaders.UserAgent, "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)")
            header(HttpHeaders.AcceptLanguage, "en-US,en;q=0.9")
        }
    }

Step 2: Fetch HTML with small retry logic

Scraping fails often due to transient network issues. You don’t need a huge framework — just:

  • retry a few times
  • exponential backoff
import io.ktor.client.statement.*
import kotlinx.coroutines.delay

suspend fun fetchHtml(client: HttpClient, url: String, retries: Int = 3): String {
    var lastError: Throwable? = null

    for (attempt in 1..retries) {
        try {
            val resp: HttpResponse = client.get(url)
            if (!resp.status.isSuccess()) {
                throw IllegalStateException("HTTP ${resp.status.value} for $url")
            }
            return resp.bodyAsText()
        } catch (t: Throwable) {
            lastError = t
            val backoffMs = 400L * attempt * attempt
            delay(backoffMs)
        }
    }

    throw lastError ?: IllegalStateException("failed to fetch $url")
}

Step 3: Parse HTML with Jsoup selectors

Jsoup supports CSS selectors similar to browser DevTools.

For Hacker News:

  • story rows are tr.athing
  • title link is span.titleline > a
  • metadata lives in the next sibling row under td.subtext
import org.jsoup.Jsoup
import org.jsoup.nodes.Document

@kotlinx.serialization.Serializable
data class HnStory(
    val id: String?,
    val title: String?,
    val url: String?,
    val points: Int?,
    val author: String?,
    val age: String?,
    val comments: Int?,
    val itemUrl: String?
)

fun parseInt(text: String?): Int? {
    if (text.isNullOrBlank()) return null
    val m = Regex("(\\d+)").find(text) ?: return null
    return m.groupValues[1].toIntOrNull()
}

fun parseFrontPage(html: String, base: String = "https://news.ycombinator.com"): List<HnStory> {
    val doc: Document = Jsoup.parse(html, base)

    return doc.select("tr.athing").map { row ->
        val id = row.attr("id").ifBlank { null }

        val a = row.selectFirst("span.titleline > a")
        val title = a?.text()
        val href = a?.attr("href")

        val sub = row.nextElementSibling()?.selectFirst("td.subtext")
        val points = parseInt(sub?.selectFirst("span.score")?.text())
        val author = sub?.selectFirst("a.hnuser")?.text()
        val age = sub?.selectFirst("span.age a")?.text()

        val links = sub?.select("a") ?: emptyList()
        val comments = if (links.isNotEmpty()) parseInt(links.last().text()) else null

        HnStory(
            id = id,
            title = title,
            url = href,
            points = points,
            author = author,
            age = age,
            comments = comments,
            itemUrl = id?.let { "$base/item?id=$it" }
        )
    }
}

Step 4: Pagination (crawl N pages)

HN supports ?p=N.

suspend fun crawlHnFrontPages(client: HttpClient, pages: Int = 3): List<HnStory> {
    val all = mutableListOf<HnStory>()
    val seen = mutableSetOf<String>()

    for (p in 1..pages) {
        val url = if (p == 1) "https://news.ycombinator.com/" else "https://news.ycombinator.com/?p=$p"
        val html = fetchHtml(client, url)
        val batch = parseFrontPage(html)

        for (s in batch) {
            val key = s.id ?: continue
            if (key in seen) continue
            seen.add(key)
            all.add(s)
        }
    }

    return all
}

Step 5: Export JSON (kotlinx serialization)

import kotlinx.serialization.encodeToString
import kotlinx.serialization.json.Json
import java.nio.file.Files
import java.nio.file.Path

fun writeJson(path: Path, stories: List<HnStory>) {
    val json = Json { prettyPrint = true }
    Files.writeString(path, json.encodeToString(stories))
}

Step 6: Where ProxiesAPI fits (fetch layer only)

If you’re crawling a lot of pages, you’ll hit:

  • intermittent blocks
  • inconsistent responses
  • timeouts

With ProxiesAPI you fetch through a wrapper URL, but you still parse the same HTML.

import java.net.URLEncoder

fun proxiesapiWrap(targetUrl: String, apiKey: String): String {
    val base = "http://api.proxiesapi.com/"
    val encodedUrl = URLEncoder.encode(targetUrl, "UTF-8")
    return "$base?key=$apiKey&url=$encodedUrl"
}

Use it like:

val client = httpClient()
val apiKey = "API_KEY"
val target = "https://news.ycombinator.com/"
val wrapped = proxiesapiWrap(target, apiKey)
val html = fetchHtml(client, wrapped)
val stories = parseFrontPage(html)

Notice the win: Ktor fetch URL changes; Jsoup selectors do not.


Practical advice for Kotlin scrapers

ProblemWhat to do
Selectors breakSave sample HTML and update Jsoup selectors
Requests hangAdd timeouts + retries (don’t rely on defaults)
Duplicates across pagesUse a stable key (id/url) and a seen set
Getting blockedSlow down + add fetch-layer resilience (ProxiesAPI)

If you want to scale further, the next “real” steps are:

  • queue + concurrency limits
  • structured logging (per URL)
  • persistent storage (SQLite/Postgres)
Keep Kotlin crawlers resilient with ProxiesAPI

Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.

Related guides

Web Scraping with Java: JSoup + HttpClient Guide (2026)
A practical end-to-end Java web scraping tutorial using Java 21+: HttpClient for requests, JSoup for parsing, pagination loops, retries/backoff, and proxy rotation patterns.
guide#web scraping with java#java#jsoup
Python BeautifulSoup Tutorial: Scraping Your First Website (2026)
A beginner-friendly BeautifulSoup tutorial: fetch HTML with requests, parse elements with CSS selectors, handle pagination, avoid common pitfalls, and export results. Includes an honest ProxiesAPI section for when you scale.
tutorial#python beautifulsoup tutorial#python#beautifulsoup
Scrape eBay Listings + Sold Prices with Python (Active + Completed Listings)
Build a small eBay dataset (title, price, condition, shipping) from search results, then pull completed/sold prices from the Sold filter. Includes pagination, CSV export, and ProxiesAPI in the fetch layer.
tutorial#python#ebay#web-scraping
Scrape Goodreads Book Reviews + Ratings with Python (Pagination + CSV)
Extract Goodreads community reviews (rating, review text, reviewer, date) from a book page, paginate using Goodreads’ "More reviews" cursor link, and export results to CSV. Includes screenshot and ProxiesAPI fetch-layer integration.
tutorial#python#goodreads#web-scraping