Web Scraping with Go (Colly Framework): Complete Guide

May 26, 2026 · guide · #go, #golang, #colly, #web-scraping, #concurrency, #csv, #json, #proxies

Go is an underrated scraping language.

If you care about:

performance
stable binaries
predictable concurrency
low memory overhead

…Go + Colly is a great stack.

In this guide, you’ll build a complete Colly scraper that includes what real scraping jobs need:

clean selectors
concurrency control
rate limiting
retries + backoff
pagination
exporting data to JSON + CSV
a practical pattern to route requests through ProxiesAPI

Make your Go scraper more reliable with ProxiesAPI

Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.

Get 1,000 free API calls View pricing

Why Colly is a good default

Colly is a fast HTTP crawling framework for Go.

It’s good for server-rendered HTML sites because:

it handles request callbacks cleanly
it supports rate limiting
it can run many requests concurrently

It’s not for:

heavy JS rendering

For JS-heavy targets, use Playwright (or a first-party API).

Setup

Create a new Go module:

mkdir go-colly-scraper
cd go-colly-scraper
go mod init example.com/scraper

Install Colly:

go get github.com/gocolly/colly/v2

The scraper we’ll build

We’ll build a crawler that:

starts from a listing page
extracts items (title + URL)
follows pagination
exports results

To keep this tutorial runnable, we’ll use a friendly target:

Hacker News: https://news.ycombinator.com/

The patterns apply to any target.

Step 1: A robust Colly collector (timeouts, UA, allowed domains)

package main

import (
  "encoding/csv"
  "encoding/json"
  "fmt"
  "log"
  "net/url"
  "os"
  "strings"
  "time"

  "github.com/gocolly/colly/v2"
  "github.com/gocolly/colly/v2/extensions"
)

type Row struct {
  Title string `json:"title"`
  URL   string `json:"url"`
}

func mustEnv(key string) string {
  v := strings.TrimSpace(os.Getenv(key))
  if v == "" {
    log.Fatalf("missing env var %s", key)
  }
  return v
}

func main() {
  start := "https://news.ycombinator.com/"

  c := colly.NewCollector(
    colly.AllowedDomains("news.ycombinator.com"),
    colly.MaxDepth(4),
  )

  // Good defaults
  extensions.RandomUserAgent(c)
  extensions.Referer(c)

  c.SetRequestTimeout(45 * time.Second)

  // Rate limit: 1 request/sec with small jitter
  c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 2,
    Delay:       900 * time.Millisecond,
    RandomDelay: 400 * time.Millisecond,
  })

  // Collect results
  rows := make([]Row, 0, 200)
  seen := make(map[string]bool)

  c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
    a := e.DOM.Find("span.titleline > a")
    href, _ := a.Attr("href")
    title := strings.TrimSpace(a.Text())
    if href == "" || title == "" {
      return
    }

    // HN has absolute URLs; for other sites you might need url.Parse + ResolveReference
    u := href
    if seen[u] {
      return
    }
    seen[u] = true

    rows = append(rows, Row{Title: title, URL: u})
  })

  // Pagination: follow the "More" link
  c.OnHTML("a.morelink", func(e *colly.HTMLElement) {
    next := e.Request.AbsoluteURL(e.Attr("href"))
    if next != "" {
      _ = e.Request.Visit(next)
    }
  })

  c.OnRequest(func(r *colly.Request) {
    log.Printf("fetch %s", r.URL.String())
  })

  c.OnError(func(r *colly.Response, err error) {
    log.Printf("error %s: %v", r.Request.URL.String(), err)
  })

  if err := c.Visit(start); err != nil {
    log.Fatal(err)
  }
  c.Wait()

  if len(rows) == 0 {
    log.Fatal("no rows scraped")
  }

  // Export
  mustWriteJSON(rows, "output.json")
  mustWriteCSV(rows, "output.csv")

  log.Printf("done rows=%d", len(rows))
}

func mustWriteJSON(rows []Row, path string) {
  f, err := os.Create(path)
  if err != nil {
    log.Fatal(err)
  }
  defer f.Close()

  enc := json.NewEncoder(f)
  enc.SetIndent("", "  ")
  if err := enc.Encode(rows); err != nil {
    log.Fatal(err)
  }

  log.Printf("wrote %s", path)
}

func mustWriteCSV(rows []Row, path string) {
  f, err := os.Create(path)
  if err != nil {
    log.Fatal(err)
  }
  defer f.Close()

  w := csv.NewWriter(f)
  defer w.Flush()

  _ = w.Write([]string{"title", "url"})
  for _, r := range rows {
    _ = w.Write([]string{r.Title, r.URL})
  }

  log.Printf("wrote %s", path)
}

This is already a practical scraper: it fetches pages, extracts rows, paginates, and exports.

Step 2: Concurrency + rate limiting (the part most people get wrong)

Go makes it tempting to go fast.

But for scraping, “fast” without constraints becomes:

timeouts
429 rate limiting
blocks

Use Colly’s LimitRule to cap:

Parallelism
Delay
RandomDelay

A good starting point for most sites is:

Parallelism: 2–4
Delay: 500ms–1500ms

Then monitor success rate.

Step 3: Retries with backoff (Colly pattern)

Colly doesn’t do automatic retries by default. A practical approach is:

on OnError, re-queue the URL with a counter

Here’s a simple approach using request context:

c.OnError(func(r *colly.Response, err error) {
  retriesAny := r.Ctx.GetAny("retries")
  retries := 0
  if retriesAny != nil {
    retries = retriesAny.(int)
  }

  if retries >= 3 {
    log.Printf("give up %s: %v", r.Request.URL, err)
    return
  }

  retries++
  r.Ctx.Put("retries", retries)

  backoff := time.Duration(1<<retries) * time.Second
  log.Printf("retry %s in %s (attempt %d)", r.Request.URL, backoff, retries)
  time.Sleep(backoff)

  _ = r.Request.Retry()
})

This is intentionally simple. In production you’ll also separate:

retryable errors (timeouts, 502/503)
non-retryable errors (404)

Step 4: ProxiesAPI integration (pattern)

ProxiesAPI works by calling:

http://api.proxiesapi.com/?auth_key=KEY&url=TARGET

Colly has proxy support, but ProxiesAPI isn’t a “proxy server” URL in the traditional sense—it’s an HTTP API that fetches the target for you.

So the simplest pattern in Go is:

if useProxiesApi is on, rewrite the request URL to the ProxiesAPI endpoint
keep the original target URL in context (for logging/dedup)

Example helper:

func proxiesApiFetchURL(authKey string, target string) string {
  u, _ := url.Parse("http://api.proxiesapi.com/")
  q := u.Query()
  q.Set("auth_key", authKey)
  q.Set("url", target)
  u.RawQuery = q.Encode()
  return u.String()
}

Then, before visiting:

authKey := mustEnv("PROXIESAPI_KEY")
useProxies := true

visit := func(target string) error {
  if !useProxies {
    return c.Visit(target)
  }

  fetchURL := proxiesApiFetchURL(authKey, target)
  reqCtx := colly.NewContext()
  reqCtx.Put("original_url", target)

  return c.Request("GET", fetchURL, nil, reqCtx, nil)
}

And update your logs:

c.OnRequest(func(r *colly.Request) {
  orig := r.Ctx.Get("original_url")
  if orig != "" {
    log.Printf("fetch via ProxiesAPI orig=%s", orig)
  } else {
    log.Printf("fetch %s", r.URL.String())
  }
})

This approach is honest and explicit: you’re not pretending ProxiesAPI is a normal proxy; you’re using it as a fetch layer.

Comparison: Go scraping options (2026)

Library	Best for	Pros	Cons
Colly	HTML crawling	Fast, clean callbacks	Not for JS rendering
goquery + net/http	Custom parsers	Minimal deps	You build crawler tooling
Playwright (Go)	JS-heavy sites	Accurate rendering	Heavier, slower

Practical advice for production scrapers

Log every request (URL + status + latency)
Store a “seen” set (don’t rescrape same URLs)
Persist checkpoints (SQLite works well)
Backpressure over speed (rate limiting beats bans)
Separate fetch vs parse (makes debugging 10x easier)

QA checklist

Your collector has timeouts
Rate limit is set (parallelism + delays)
Pagination works and increases row count
Exported JSON/CSV files open correctly
Retry logic doesn’t loop infinitely

Make your Go scraper more reliable with ProxiesAPI

Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.

Get 1,000 free API calls View pricing

Related guides

Scrape Book Data from Goodreads

Build a Goodreads dataset with book titles, authors, ratings, and review counts from a public list page using Python and an optional ProxiesAPI fetch layer.

tutorial#python#goodreads#books

Scrape Secondhand Fashion Listings from Vinted

Show how to extract Vinted listing titles, prices, brands, sizes, and image URLs from the public catalog with real selectors and a screenshot.

tutorial#python#vinted#web-scraping

Scrape GitHub Trending Repositories with Python

Build a daily GitHub Trending dataset with Python: collect repository names, languages, star counts, and URLs, then export clean CSV or JSON with an optional ProxiesAPI fetch layer.

tutorial#python#github#web-scraping

Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)

A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.

tutorial#python#goodreads#books