Web Scraping with Go (Colly Framework): Complete Guide
Go is an underrated scraping language.
If you care about:
- performance
- stable binaries
- predictable concurrency
- low memory overhead
…Go + Colly is a great stack.
In this guide, you’ll build a complete Colly scraper that includes what real scraping jobs need:
- clean selectors
- concurrency control
- rate limiting
- retries + backoff
- pagination
- exporting data to JSON + CSV
- a practical pattern to route requests through ProxiesAPI
Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.
Why Colly is a good default
Colly is a fast HTTP crawling framework for Go.
It’s good for server-rendered HTML sites because:
- it handles request callbacks cleanly
- it supports rate limiting
- it can run many requests concurrently
It’s not for:
- heavy JS rendering
For JS-heavy targets, use Playwright (or a first-party API).
Setup
Create a new Go module:
mkdir go-colly-scraper
cd go-colly-scraper
go mod init example.com/scraper
Install Colly:
go get github.com/gocolly/colly/v2
The scraper we’ll build
We’ll build a crawler that:
- starts from a listing page
- extracts items (title + URL)
- follows pagination
- exports results
To keep this tutorial runnable, we’ll use a friendly target:
- Hacker News:
https://news.ycombinator.com/
The patterns apply to any target.
Step 1: A robust Colly collector (timeouts, UA, allowed domains)
package main
import (
"encoding/csv"
"encoding/json"
"fmt"
"log"
"net/url"
"os"
"strings"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/extensions"
)
type Row struct {
Title string `json:"title"`
URL string `json:"url"`
}
func mustEnv(key string) string {
v := strings.TrimSpace(os.Getenv(key))
if v == "" {
log.Fatalf("missing env var %s", key)
}
return v
}
func main() {
start := "https://news.ycombinator.com/"
c := colly.NewCollector(
colly.AllowedDomains("news.ycombinator.com"),
colly.MaxDepth(4),
)
// Good defaults
extensions.RandomUserAgent(c)
extensions.Referer(c)
c.SetRequestTimeout(45 * time.Second)
// Rate limit: 1 request/sec with small jitter
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 900 * time.Millisecond,
RandomDelay: 400 * time.Millisecond,
})
// Collect results
rows := make([]Row, 0, 200)
seen := make(map[string]bool)
c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
a := e.DOM.Find("span.titleline > a")
href, _ := a.Attr("href")
title := strings.TrimSpace(a.Text())
if href == "" || title == "" {
return
}
// HN has absolute URLs; for other sites you might need url.Parse + ResolveReference
u := href
if seen[u] {
return
}
seen[u] = true
rows = append(rows, Row{Title: title, URL: u})
})
// Pagination: follow the "More" link
c.OnHTML("a.morelink", func(e *colly.HTMLElement) {
next := e.Request.AbsoluteURL(e.Attr("href"))
if next != "" {
_ = e.Request.Visit(next)
}
})
c.OnRequest(func(r *colly.Request) {
log.Printf("fetch %s", r.URL.String())
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("error %s: %v", r.Request.URL.String(), err)
})
if err := c.Visit(start); err != nil {
log.Fatal(err)
}
c.Wait()
if len(rows) == 0 {
log.Fatal("no rows scraped")
}
// Export
mustWriteJSON(rows, "output.json")
mustWriteCSV(rows, "output.csv")
log.Printf("done rows=%d", len(rows))
}
func mustWriteJSON(rows []Row, path string) {
f, err := os.Create(path)
if err != nil {
log.Fatal(err)
}
defer f.Close()
enc := json.NewEncoder(f)
enc.SetIndent("", " ")
if err := enc.Encode(rows); err != nil {
log.Fatal(err)
}
log.Printf("wrote %s", path)
}
func mustWriteCSV(rows []Row, path string) {
f, err := os.Create(path)
if err != nil {
log.Fatal(err)
}
defer f.Close()
w := csv.NewWriter(f)
defer w.Flush()
_ = w.Write([]string{"title", "url"})
for _, r := range rows {
_ = w.Write([]string{r.Title, r.URL})
}
log.Printf("wrote %s", path)
}
This is already a practical scraper: it fetches pages, extracts rows, paginates, and exports.
Step 2: Concurrency + rate limiting (the part most people get wrong)
Go makes it tempting to go fast.
But for scraping, “fast” without constraints becomes:
- timeouts
- 429 rate limiting
- blocks
Use Colly’s LimitRule to cap:
ParallelismDelayRandomDelay
A good starting point for most sites is:
- Parallelism: 2–4
- Delay: 500ms–1500ms
Then monitor success rate.
Step 3: Retries with backoff (Colly pattern)
Colly doesn’t do automatic retries by default. A practical approach is:
- on
OnError, re-queue the URL with a counter
Here’s a simple approach using request context:
c.OnError(func(r *colly.Response, err error) {
retriesAny := r.Ctx.GetAny("retries")
retries := 0
if retriesAny != nil {
retries = retriesAny.(int)
}
if retries >= 3 {
log.Printf("give up %s: %v", r.Request.URL, err)
return
}
retries++
r.Ctx.Put("retries", retries)
backoff := time.Duration(1<<retries) * time.Second
log.Printf("retry %s in %s (attempt %d)", r.Request.URL, backoff, retries)
time.Sleep(backoff)
_ = r.Request.Retry()
})
This is intentionally simple. In production you’ll also separate:
- retryable errors (timeouts, 502/503)
- non-retryable errors (404)
Step 4: ProxiesAPI integration (pattern)
ProxiesAPI works by calling:
http://api.proxiesapi.com/?auth_key=KEY&url=TARGET
Colly has proxy support, but ProxiesAPI isn’t a “proxy server” URL in the traditional sense—it’s an HTTP API that fetches the target for you.
So the simplest pattern in Go is:
- if
useProxiesApiis on, rewrite the request URL to the ProxiesAPI endpoint - keep the original target URL in context (for logging/dedup)
Example helper:
func proxiesApiFetchURL(authKey string, target string) string {
u, _ := url.Parse("http://api.proxiesapi.com/")
q := u.Query()
q.Set("auth_key", authKey)
q.Set("url", target)
u.RawQuery = q.Encode()
return u.String()
}
Then, before visiting:
authKey := mustEnv("PROXIESAPI_KEY")
useProxies := true
visit := func(target string) error {
if !useProxies {
return c.Visit(target)
}
fetchURL := proxiesApiFetchURL(authKey, target)
reqCtx := colly.NewContext()
reqCtx.Put("original_url", target)
return c.Request("GET", fetchURL, nil, reqCtx, nil)
}
And update your logs:
c.OnRequest(func(r *colly.Request) {
orig := r.Ctx.Get("original_url")
if orig != "" {
log.Printf("fetch via ProxiesAPI orig=%s", orig)
} else {
log.Printf("fetch %s", r.URL.String())
}
})
This approach is honest and explicit: you’re not pretending ProxiesAPI is a normal proxy; you’re using it as a fetch layer.
Comparison: Go scraping options (2026)
| Library | Best for | Pros | Cons |
|---|---|---|---|
| Colly | HTML crawling | Fast, clean callbacks | Not for JS rendering |
| goquery + net/http | Custom parsers | Minimal deps | You build crawler tooling |
| Playwright (Go) | JS-heavy sites | Accurate rendering | Heavier, slower |
Practical advice for production scrapers
- Log every request (URL + status + latency)
- Store a “seen” set (don’t rescrape same URLs)
- Persist checkpoints (SQLite works well)
- Backpressure over speed (rate limiting beats bans)
- Separate fetch vs parse (makes debugging 10x easier)
QA checklist
- Your collector has timeouts
- Rate limit is set (parallelism + delays)
- Pagination works and increases row count
- Exported JSON/CSV files open correctly
- Retry logic doesn’t loop infinitely
Go scrapers can run fast—and fast scrapers hit network failures faster. ProxiesAPI gives you a simple proxy route you can plug into your Go request layer to improve crawl stability as you scale.