Web Scraping with Kotlin: Ktor + Jsoup Tutorial (2026)

May 20, 2026 · tutorial · #web scraping with kotlin, #kotlin, #ktor, #jsoup, #html, #pagination, #retries, #proxies

Kotlin is a great language for scrapers:

fast enough for most workloads
strong typing for clean data models
great HTTP clients
easy concurrency when you need it

In this guide we’ll build a real Kotlin scraper using:

Ktor client for HTTP requests (timeouts, headers, retries)
Jsoup for HTML parsing with CSS selectors

We’ll scrape a simple target (Hacker News front page) to keep the focus on architecture:

fetch HTML with sane timeouts
parse rows with Jsoup selectors
paginate
export results to JSON
show where ProxiesAPI fits (without rewriting your parser)

Keep Kotlin crawlers resilient with ProxiesAPI

Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.

Get 1,000 free API calls View pricing

Project setup (Gradle)

Create a Kotlin JVM project. Add dependencies:

dependencies {
    implementation("io.ktor:ktor-client-core:2.3.12")
    implementation("io.ktor:ktor-client-cio:2.3.12")
    implementation("io.ktor:ktor-client-content-negotiation:2.3.12")
    implementation("io.ktor:ktor-serialization-kotlinx-json:2.3.12")
    implementation("org.jsoup:jsoup:1.18.1")

    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.7.3")
}

We’ll use:

CIO engine (simple + fast)
kotlinx serialization (easy JSON export)

Step 1: Build an HTTP client with timeouts + UA

import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.plugins.*
import io.ktor.client.request.*
import io.ktor.http.*

fun httpClient(): HttpClient =
    HttpClient(CIO) {
        install(HttpTimeout) {
            connectTimeoutMillis = 10_000
            requestTimeoutMillis = 30_000
            socketTimeoutMillis = 30_000
        }

        defaultRequest {
            header(HttpHeaders.UserAgent, "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)")
            header(HttpHeaders.AcceptLanguage, "en-US,en;q=0.9")
        }
    }

Step 2: Fetch HTML with small retry logic

Scraping fails often due to transient network issues. You don’t need a huge framework — just:

retry a few times
exponential backoff

import io.ktor.client.statement.*
import kotlinx.coroutines.delay

suspend fun fetchHtml(client: HttpClient, url: String, retries: Int = 3): String {
    var lastError: Throwable? = null

    for (attempt in 1..retries) {
        try {
            val resp: HttpResponse = client.get(url)
            if (!resp.status.isSuccess()) {
                throw IllegalStateException("HTTP ${resp.status.value} for $url")
            }
            return resp.bodyAsText()
        } catch (t: Throwable) {
            lastError = t
            val backoffMs = 400L * attempt * attempt
            delay(backoffMs)
        }
    }

    throw lastError ?: IllegalStateException("failed to fetch $url")
}

Step 3: Parse HTML with Jsoup selectors

Jsoup supports CSS selectors similar to browser DevTools.

For Hacker News:

story rows are tr.athing
title link is span.titleline > a
metadata lives in the next sibling row under td.subtext

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

@kotlinx.serialization.Serializable
data class HnStory(
    val id: String?,
    val title: String?,
    val url: String?,
    val points: Int?,
    val author: String?,
    val age: String?,
    val comments: Int?,
    val itemUrl: String?
)

fun parseInt(text: String?): Int? {
    if (text.isNullOrBlank()) return null
    val m = Regex("(\\d+)").find(text) ?: return null
    return m.groupValues[1].toIntOrNull()
}

fun parseFrontPage(html: String, base: String = "https://news.ycombinator.com"): List<HnStory> {
    val doc: Document = Jsoup.parse(html, base)

    return doc.select("tr.athing").map { row ->
        val id = row.attr("id").ifBlank { null }

        val a = row.selectFirst("span.titleline > a")
        val title = a?.text()
        val href = a?.attr("href")

        val sub = row.nextElementSibling()?.selectFirst("td.subtext")
        val points = parseInt(sub?.selectFirst("span.score")?.text())
        val author = sub?.selectFirst("a.hnuser")?.text()
        val age = sub?.selectFirst("span.age a")?.text()

        val links = sub?.select("a") ?: emptyList()
        val comments = if (links.isNotEmpty()) parseInt(links.last().text()) else null

        HnStory(
            id = id,
            title = title,
            url = href,
            points = points,
            author = author,
            age = age,
            comments = comments,
            itemUrl = id?.let { "$base/item?id=$it" }
        )
    }
}

Step 4: Pagination (crawl N pages)

HN supports ?p=N.

suspend fun crawlHnFrontPages(client: HttpClient, pages: Int = 3): List<HnStory> {
    val all = mutableListOf<HnStory>()
    val seen = mutableSetOf<String>()

    for (p in 1..pages) {
        val url = if (p == 1) "https://news.ycombinator.com/" else "https://news.ycombinator.com/?p=$p"
        val html = fetchHtml(client, url)
        val batch = parseFrontPage(html)

        for (s in batch) {
            val key = s.id ?: continue
            if (key in seen) continue
            seen.add(key)
            all.add(s)
        }
    }

    return all
}

Step 5: Export JSON (kotlinx serialization)

import kotlinx.serialization.encodeToString
import kotlinx.serialization.json.Json
import java.nio.file.Files
import java.nio.file.Path

fun writeJson(path: Path, stories: List<HnStory>) {
    val json = Json { prettyPrint = true }
    Files.writeString(path, json.encodeToString(stories))
}

Step 6: Where ProxiesAPI fits (fetch layer only)

If you’re crawling a lot of pages, you’ll hit:

intermittent blocks
inconsistent responses
timeouts

With ProxiesAPI you fetch through a wrapper URL, but you still parse the same HTML.

import java.net.URLEncoder

fun proxiesapiWrap(targetUrl: String, apiKey: String): String {
    val base = "http://api.proxiesapi.com/"
    val encodedUrl = URLEncoder.encode(targetUrl, "UTF-8")
    return "$base?key=$apiKey&url=$encodedUrl"
}

Use it like:

val client = httpClient()
val apiKey = "API_KEY"
val target = "https://news.ycombinator.com/"
val wrapped = proxiesapiWrap(target, apiKey)
val html = fetchHtml(client, wrapped)
val stories = parseFrontPage(html)

Notice the win: Ktor fetch URL changes; Jsoup selectors do not.

Practical advice for Kotlin scrapers

Problem	What to do
Selectors break	Save sample HTML and update Jsoup selectors
Requests hang	Add timeouts + retries (don’t rely on defaults)
Duplicates across pages	Use a stable key (id/url) and a `seen` set
Getting blocked	Slow down + add fetch-layer resilience (ProxiesAPI)

If you want to scale further, the next “real” steps are:

queue + concurrency limits
structured logging (per URL)
persistent storage (SQLite/Postgres)

Keep Kotlin crawlers resilient with ProxiesAPI

Kotlin scraping usually fails for boring reasons: timeouts, flaky responses, and blocks. ProxiesAPI belongs in the fetch layer so your parsing logic (Jsoup selectors) stays clean while you scale.

Get 1,000 free API calls View pricing

A practical end-to-end Java web scraping tutorial using Java 21+: HttpClient for requests, JSoup for parsing, pagination loops, retries/backoff, and proxy rotation patterns.

guide#web scraping with java#java#jsoup

Scrape eBay Listings and Prices

Build an eBay scraper that captures titles, prices, item URLs, and pagination into CSV-ready output.

tutorial#python#ebay#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, pagination cursors, and reviewer metadata into a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Vinted Listings with Python: Search + Pagination + Clean CSV Export

Build a practical Vinted listings scraper: pull search results via Vinted’s internal catalog endpoint, paginate safely, extract price/brand/size/image URLs, and export a clean CSV. Includes a screenshot + ProxiesAPI integration.

tutorial#vinted#python#web-scraping

Web Scraping with Kotlin: Ktor + Jsoup Tutorial (2026)

Related guides