Web Scraping with Go (Colly Framework): Complete Guide

Go is an underrated scraping language.

If you care about:

  • performance
  • stable binaries
  • predictable concurrency
  • low memory overhead

Go + Colly is a great stack.

In this guide, you’ll build a complete Colly scraper that includes what real scraping jobs need:

  • clean selectors
  • concurrency control
  • rate limiting
  • retries + backoff
  • pagination
  • exporting data to JSON + CSV
  • a practical pattern to route requests through ProxiesAPI
Make your Go scraper more reliable with ProxiesAPI

Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.


Why Colly is a good default

Colly is a fast HTTP crawling framework for Go.

It’s good for server-rendered HTML sites because:

  • it handles request callbacks cleanly
  • it supports rate limiting
  • it can run many requests concurrently

It’s not for:

  • heavy JS rendering

For JS-heavy targets, use Playwright (or a first-party API).


Setup

Create a new Go module:

mkdir go-colly-scraper
cd go-colly-scraper
go mod init example.com/scraper

Install Colly:

go get github.com/gocolly/colly/v2

The scraper we’ll build

We’ll build a crawler that:

  1. starts from a listing page
  2. extracts items (title + URL)
  3. follows pagination
  4. exports results

To keep this tutorial runnable, we’ll use a friendly target:

  • Hacker News: https://news.ycombinator.com/

The patterns apply to any target.


Step 1: A robust Colly collector (timeouts, UA, allowed domains)

package main

import (
  "encoding/csv"
  "encoding/json"
  "fmt"
  "log"
  "net/url"
  "os"
  "strings"
  "time"

  "github.com/gocolly/colly/v2"
  "github.com/gocolly/colly/v2/extensions"
)

type Row struct {
  Title string `json:"title"`
  URL   string `json:"url"`
}

func mustEnv(key string) string {
  v := strings.TrimSpace(os.Getenv(key))
  if v == "" {
    log.Fatalf("missing env var %s", key)
  }
  return v
}

func main() {
  start := "https://news.ycombinator.com/"

  c := colly.NewCollector(
    colly.AllowedDomains("news.ycombinator.com"),
    colly.MaxDepth(4),
  )

  // Good defaults
  extensions.RandomUserAgent(c)
  extensions.Referer(c)

  c.SetRequestTimeout(45 * time.Second)

  // Rate limit: 1 request/sec with small jitter
  c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 2,
    Delay:       900 * time.Millisecond,
    RandomDelay: 400 * time.Millisecond,
  })

  // Collect results
  rows := make([]Row, 0, 200)
  seen := make(map[string]bool)

  c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
    a := e.DOM.Find("span.titleline > a")
    href, _ := a.Attr("href")
    title := strings.TrimSpace(a.Text())
    if href == "" || title == "" {
      return
    }

    // HN has absolute URLs; for other sites you might need url.Parse + ResolveReference
    u := href
    if seen[u] {
      return
    }
    seen[u] = true

    rows = append(rows, Row{Title: title, URL: u})
  })

  // Pagination: follow the "More" link
  c.OnHTML("a.morelink", func(e *colly.HTMLElement) {
    next := e.Request.AbsoluteURL(e.Attr("href"))
    if next != "" {
      _ = e.Request.Visit(next)
    }
  })

  c.OnRequest(func(r *colly.Request) {
    log.Printf("fetch %s", r.URL.String())
  })

  c.OnError(func(r *colly.Response, err error) {
    log.Printf("error %s: %v", r.Request.URL.String(), err)
  })

  if err := c.Visit(start); err != nil {
    log.Fatal(err)
  }
  c.Wait()

  if len(rows) == 0 {
    log.Fatal("no rows scraped")
  }

  // Export
  mustWriteJSON(rows, "output.json")
  mustWriteCSV(rows, "output.csv")

  log.Printf("done rows=%d", len(rows))
}

func mustWriteJSON(rows []Row, path string) {
  f, err := os.Create(path)
  if err != nil {
    log.Fatal(err)
  }
  defer f.Close()

  enc := json.NewEncoder(f)
  enc.SetIndent("", "  ")
  if err := enc.Encode(rows); err != nil {
    log.Fatal(err)
  }

  log.Printf("wrote %s", path)
}

func mustWriteCSV(rows []Row, path string) {
  f, err := os.Create(path)
  if err != nil {
    log.Fatal(err)
  }
  defer f.Close()

  w := csv.NewWriter(f)
  defer w.Flush()

  _ = w.Write([]string{"title", "url"})
  for _, r := range rows {
    _ = w.Write([]string{r.Title, r.URL})
  }

  log.Printf("wrote %s", path)
}

This is already a practical scraper: it fetches pages, extracts rows, paginates, and exports.


Step 2: Concurrency + rate limiting (the part most people get wrong)

Go makes it tempting to go fast.

But for scraping, “fast” without constraints becomes:

  • timeouts
  • 429 rate limiting
  • blocks

Use Colly’s LimitRule to cap:

  • Parallelism
  • Delay
  • RandomDelay

A good starting point for most sites is:

  • Parallelism: 2–4
  • Delay: 500ms–1500ms

Then monitor success rate.


Step 3: Retries with backoff (Colly pattern)

Colly doesn’t do automatic retries by default. A practical approach is:

  • on OnError, re-queue the URL with a counter

Here’s a simple approach using request context:

c.OnError(func(r *colly.Response, err error) {
  retriesAny := r.Ctx.GetAny("retries")
  retries := 0
  if retriesAny != nil {
    retries = retriesAny.(int)
  }

  if retries >= 3 {
    log.Printf("give up %s: %v", r.Request.URL, err)
    return
  }

  retries++
  r.Ctx.Put("retries", retries)

  backoff := time.Duration(1<<retries) * time.Second
  log.Printf("retry %s in %s (attempt %d)", r.Request.URL, backoff, retries)
  time.Sleep(backoff)

  _ = r.Request.Retry()
})

This is intentionally simple. In production you’ll also separate:

  • retryable errors (timeouts, 502/503)
  • non-retryable errors (404)

Step 4: ProxiesAPI integration (pattern)

ProxiesAPI works by calling:

http://api.proxiesapi.com/?auth_key=KEY&url=TARGET

Colly has proxy support, but ProxiesAPI isn’t a “proxy server” URL in the traditional sense—it’s an HTTP API that fetches the target for you.

So the simplest pattern in Go is:

  • if useProxiesApi is on, rewrite the request URL to the ProxiesAPI endpoint
  • keep the original target URL in context (for logging/dedup)

Example helper:

func proxiesApiFetchURL(authKey string, target string) string {
  u, _ := url.Parse("http://api.proxiesapi.com/")
  q := u.Query()
  q.Set("auth_key", authKey)
  q.Set("url", target)
  u.RawQuery = q.Encode()
  return u.String()
}

Then, before visiting:

authKey := mustEnv("PROXIESAPI_KEY")
useProxies := true

visit := func(target string) error {
  if !useProxies {
    return c.Visit(target)
  }

  fetchURL := proxiesApiFetchURL(authKey, target)
  reqCtx := colly.NewContext()
  reqCtx.Put("original_url", target)

  return c.Request("GET", fetchURL, nil, reqCtx, nil)
}

And update your logs:

c.OnRequest(func(r *colly.Request) {
  orig := r.Ctx.Get("original_url")
  if orig != "" {
    log.Printf("fetch via ProxiesAPI orig=%s", orig)
  } else {
    log.Printf("fetch %s", r.URL.String())
  }
})

This approach is honest and explicit: you’re not pretending ProxiesAPI is a normal proxy; you’re using it as a fetch layer.


Comparison: Go scraping options (2026)

LibraryBest forProsCons
CollyHTML crawlingFast, clean callbacksNot for JS rendering
goquery + net/httpCustom parsersMinimal depsYou build crawler tooling
Playwright (Go)JS-heavy sitesAccurate renderingHeavier, slower

Practical advice for production scrapers

  1. Log every request (URL + status + latency)
  2. Store a “seen” set (don’t rescrape same URLs)
  3. Persist checkpoints (SQLite works well)
  4. Backpressure over speed (rate limiting beats bans)
  5. Separate fetch vs parse (makes debugging 10x easier)

QA checklist

  • Your collector has timeouts
  • Rate limit is set (parallelism + delays)
  • Pagination works and increases row count
  • Exported JSON/CSV files open correctly
  • Retry logic doesn’t loop infinitely
Make your Go scraper more reliable with ProxiesAPI

Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.

Related guides

Scrape Numbeo Cost of Living Data with Python (cities, indices, and tables)
Extract Numbeo cost-of-living tables into a structured dataset (with a screenshot), then export to JSON/CSV using ProxiesAPI-backed requests.
tutorial#python#web-scraping#beautifulsoup
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Web Scraping with JavaScript and Node.js: Full Tutorial (2026)
A practical web scraping tutorial for JavaScript/Node.js using fetch + Cheerio, with pagination, retries, CSV export, and ProxiesAPI integration for more reliable crawling.
guide#javascript#nodejs#web-scraping
How to Scrape Google Finance Data with Python (Quotes, News, and Historical Prices)
Scrape Google Finance quote pages for price, key stats, news headlines, and a simple historical price series with Python. Includes selector-first HTML parsing, CSV export, and block-avoidance tactics (timeouts, retries, and ProxiesAPI-friendly patterns).
guide#python#google-finance#web-scraping