Web Scraping with Go (Colly Framework): Complete Guide

Go is an underrated scraping language.

If you care about:

  • performance
  • stable binaries
  • predictable concurrency
  • low memory overhead

Go + Colly is a great stack.

In this guide, you’ll build a complete Colly scraper that includes what real scraping jobs need:

  • clean selectors
  • concurrency control
  • rate limiting
  • retries + backoff
  • pagination
  • exporting data to JSON + CSV
  • a practical pattern to route requests through ProxiesAPI
Make your Go scraper more reliable with ProxiesAPI

Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.


Why Colly is a good default

Colly is a fast HTTP crawling framework for Go.

It’s good for server-rendered HTML sites because:

  • it handles request callbacks cleanly
  • it supports rate limiting
  • it can run many requests concurrently

It’s not for:

  • heavy JS rendering

For JS-heavy targets, use Playwright (or a first-party API).


Setup

Create a new Go module:

mkdir go-colly-scraper
cd go-colly-scraper
go mod init example.com/scraper

Install Colly:

go get github.com/gocolly/colly/v2

The scraper we’ll build

We’ll build a crawler that:

  1. starts from a listing page
  2. extracts items (title + URL)
  3. follows pagination
  4. exports results

To keep this tutorial runnable, we’ll use a friendly target:

  • Hacker News: https://news.ycombinator.com/

The patterns apply to any target.


Step 1: A robust Colly collector (timeouts, UA, allowed domains)

package main

import (
  "encoding/csv"
  "encoding/json"
  "fmt"
  "log"
  "net/url"
  "os"
  "strings"
  "time"

  "github.com/gocolly/colly/v2"
  "github.com/gocolly/colly/v2/extensions"
)

type Row struct {
  Title string `json:"title"`
  URL   string `json:"url"`
}

func mustEnv(key string) string {
  v := strings.TrimSpace(os.Getenv(key))
  if v == "" {
    log.Fatalf("missing env var %s", key)
  }
  return v
}

func main() {
  start := "https://news.ycombinator.com/"

  c := colly.NewCollector(
    colly.AllowedDomains("news.ycombinator.com"),
    colly.MaxDepth(4),
  )

  // Good defaults
  extensions.RandomUserAgent(c)
  extensions.Referer(c)

  c.SetRequestTimeout(45 * time.Second)

  // Rate limit: 1 request/sec with small jitter
  c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 2,
    Delay:       900 * time.Millisecond,
    RandomDelay: 400 * time.Millisecond,
  })

  // Collect results
  rows := make([]Row, 0, 200)
  seen := make(map[string]bool)

  c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
    a := e.DOM.Find("span.titleline > a")
    href, _ := a.Attr("href")
    title := strings.TrimSpace(a.Text())
    if href == "" || title == "" {
      return
    }

    // HN has absolute URLs; for other sites you might need url.Parse + ResolveReference
    u := href
    if seen[u] {
      return
    }
    seen[u] = true

    rows = append(rows, Row{Title: title, URL: u})
  })

  // Pagination: follow the "More" link
  c.OnHTML("a.morelink", func(e *colly.HTMLElement) {
    next := e.Request.AbsoluteURL(e.Attr("href"))
    if next != "" {
      _ = e.Request.Visit(next)
    }
  })

  c.OnRequest(func(r *colly.Request) {
    log.Printf("fetch %s", r.URL.String())
  })

  c.OnError(func(r *colly.Response, err error) {
    log.Printf("error %s: %v", r.Request.URL.String(), err)
  })

  if err := c.Visit(start); err != nil {
    log.Fatal(err)
  }
  c.Wait()

  if len(rows) == 0 {
    log.Fatal("no rows scraped")
  }

  // Export
  mustWriteJSON(rows, "output.json")
  mustWriteCSV(rows, "output.csv")

  log.Printf("done rows=%d", len(rows))
}

func mustWriteJSON(rows []Row, path string) {
  f, err := os.Create(path)
  if err != nil {
    log.Fatal(err)
  }
  defer f.Close()

  enc := json.NewEncoder(f)
  enc.SetIndent("", "  ")
  if err := enc.Encode(rows); err != nil {
    log.Fatal(err)
  }

  log.Printf("wrote %s", path)
}

func mustWriteCSV(rows []Row, path string) {
  f, err := os.Create(path)
  if err != nil {
    log.Fatal(err)
  }
  defer f.Close()

  w := csv.NewWriter(f)
  defer w.Flush()

  _ = w.Write([]string{"title", "url"})
  for _, r := range rows {
    _ = w.Write([]string{r.Title, r.URL})
  }

  log.Printf("wrote %s", path)
}

This is already a practical scraper: it fetches pages, extracts rows, paginates, and exports.


Step 2: Concurrency + rate limiting (the part most people get wrong)

Go makes it tempting to go fast.

But for scraping, “fast” without constraints becomes:

  • timeouts
  • 429 rate limiting
  • blocks

Use Colly’s LimitRule to cap:

  • Parallelism
  • Delay
  • RandomDelay

A good starting point for most sites is:

  • Parallelism: 2–4
  • Delay: 500ms–1500ms

Then monitor success rate.


Step 3: Retries with backoff (Colly pattern)

Colly doesn’t do automatic retries by default. A practical approach is:

  • on OnError, re-queue the URL with a counter

Here’s a simple approach using request context:

c.OnError(func(r *colly.Response, err error) {
  retriesAny := r.Ctx.GetAny("retries")
  retries := 0
  if retriesAny != nil {
    retries = retriesAny.(int)
  }

  if retries >= 3 {
    log.Printf("give up %s: %v", r.Request.URL, err)
    return
  }

  retries++
  r.Ctx.Put("retries", retries)

  backoff := time.Duration(1<<retries) * time.Second
  log.Printf("retry %s in %s (attempt %d)", r.Request.URL, backoff, retries)
  time.Sleep(backoff)

  _ = r.Request.Retry()
})

This is intentionally simple. In production you’ll also separate:

  • retryable errors (timeouts, 502/503)
  • non-retryable errors (404)

Step 4: ProxiesAPI integration (pattern)

ProxiesAPI works by calling:

http://api.proxiesapi.com/?auth_key=KEY&url=TARGET

Colly has proxy support, but ProxiesAPI isn’t a “proxy server” URL in the traditional sense—it’s an HTTP API that fetches the target for you.

So the simplest pattern in Go is:

  • if useProxiesApi is on, rewrite the request URL to the ProxiesAPI endpoint
  • keep the original target URL in context (for logging/dedup)

Example helper:

func proxiesApiFetchURL(authKey string, target string) string {
  u, _ := url.Parse("http://api.proxiesapi.com/")
  q := u.Query()
  q.Set("auth_key", authKey)
  q.Set("url", target)
  u.RawQuery = q.Encode()
  return u.String()
}

Then, before visiting:

authKey := mustEnv("PROXIESAPI_KEY")
useProxies := true

visit := func(target string) error {
  if !useProxies {
    return c.Visit(target)
  }

  fetchURL := proxiesApiFetchURL(authKey, target)
  reqCtx := colly.NewContext()
  reqCtx.Put("original_url", target)

  return c.Request("GET", fetchURL, nil, reqCtx, nil)
}

And update your logs:

c.OnRequest(func(r *colly.Request) {
  orig := r.Ctx.Get("original_url")
  if orig != "" {
    log.Printf("fetch via ProxiesAPI orig=%s", orig)
  } else {
    log.Printf("fetch %s", r.URL.String())
  }
})

This approach is honest and explicit: you’re not pretending ProxiesAPI is a normal proxy; you’re using it as a fetch layer.


Comparison: Go scraping options (2026)

LibraryBest forProsCons
CollyHTML crawlingFast, clean callbacksNot for JS rendering
goquery + net/httpCustom parsersMinimal depsYou build crawler tooling
Playwright (Go)JS-heavy sitesAccurate renderingHeavier, slower

Practical advice for production scrapers

  1. Log every request (URL + status + latency)
  2. Store a “seen” set (don’t rescrape same URLs)
  3. Persist checkpoints (SQLite works well)
  4. Backpressure over speed (rate limiting beats bans)
  5. Separate fetch vs parse (makes debugging 10x easier)

QA checklist

  • Your collector has timeouts
  • Rate limit is set (parallelism + delays)
  • Pagination works and increases row count
  • Exported JSON/CSV files open correctly
  • Retry logic doesn’t loop infinitely
Make your Go scraper more reliable with ProxiesAPI

Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.

Related guides

Scrape Government Contract Data from SAM.gov (Opportunities + Details)
Build an end-to-end SAM.gov scraper: search opportunities, paginate results, fetch detail pages, normalize fields, and export JSON/CSV using ProxiesAPI. Includes screenshots + robust retry patterns.
tutorial#python#sam-gov#government
Scrape Numbeo Cost of Living Data with Python (cities, indices, and tables)
Extract Numbeo cost-of-living tables into a structured dataset (with a screenshot), then export to JSON/CSV using ProxiesAPI-backed requests.
tutorial#python#web-scraping#beautifulsoup
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Web Scraping with Rust: reqwest + scraper Crate Tutorial
A modern Rust scraping starter: fetch pages with reqwest, parse HTML with the scraper crate, handle pagination, export JSON/CSV, and add proxy support (including ProxiesAPI via HTTP proxy env vars).
guide#rust#reqwest#scraper