Web Scraping Queues: Concurrency, Retries, and Backpressure in Production
A web scraping queue is not just a list of URLs waiting for workers. In production, the queue is the control system that decides:
- how many requests run at once
- when a failed job should retry
- when workers should slow down
- how to avoid flooding one domain while another sits idle
If you skip that control system, the scraper usually fails in one of two ways:
- it overwhelms the target and gets blocked
- it overwhelms itself with retries, memory growth, or stuck workers
This guide covers the production basics: concurrency, retries, and backpressure, with practical patterns you can implement quickly.
ProxiesAPI can make outbound requests more reliable, but your queue still needs bounded concurrency, retry discipline, and backpressure. Otherwise you just fail faster.
The three jobs your scraping queue must do
At minimum, a queue for scraping should do three things well:
| Job | What it means | Failure if missing |
|---|---|---|
| schedule work | decide which URL runs next | hot targets dominate the queue |
| control concurrency | cap how many workers run together | you self-DDoS or get blocked |
| absorb failure | retry safely without retry storms | transient errors become outages |
Everything else is optional compared with those three.
Concurrency should be bounded, not "as fast as possible"
A common beginner mistake is launching a huge number of workers because the machine can handle it. That is the wrong limit.
The real limits are:
- target-site tolerance
- proxy pool capacity
- database write throughput
- parser CPU cost
That means concurrency should be bounded globally and, ideally, per domain.
| Queue policy | Result |
|---|---|
| unlimited concurrency | bursty failures, blocks, unstable latencies |
| small fixed concurrency | stable, predictable runs |
| per-domain concurrency caps | better fairness and fewer hot-spot bans |
In practice, many scrapers become healthier when concurrency goes down, not up.
Retries should be selective
Retries are necessary, but not every error deserves one.
| Condition | Retry? | Reason |
|---|---|---|
| timeout | yes | often transient |
| 429 rate limited | yes, with longer delay | target is asking you to slow down |
| 500/502/503/504 | yes | upstream instability |
| parser bug | no | code will fail the same way again |
| hard 404 | usually no | likely permanent |
| repeated captcha/challenge page | no immediate tight retry | needs slower policy or different routing |
This is the core principle: retry transport failures, not logic failures.
Backpressure is how the system tells itself to slow down
Backpressure means your system can detect overload and reduce the rate of new work instead of pretending everything is fine.
In scraping, overload usually shows up as:
- queue length growing faster than it drains
- worker latency climbing
- rising 429 or 5xx rates
- database writes falling behind
- proxy errors increasing
Without backpressure, operators respond by adding more workers, which often makes the incident worse.
A practical queue shape
A solid production queue often looks like this:
- producer discovers or refreshes URLs
- scheduler ranks them and pushes jobs into a work queue
- workers fetch and parse
- failures go to a retry queue with delay metadata
- poison jobs go to a dead-letter queue
This separation matters because "not successful yet" and "should never be retried immediately" are not the same state.
Example: bounded worker pool with retry metadata
Here is a compact Python example using asyncio to show the control flow:
from __future__ import annotations
import asyncio
import random
from dataclasses import dataclass, field
from time import monotonic
@dataclass
class Job:
url: str
domain: str
attempt: int = 1
next_run_at: float = field(default_factory=monotonic)
MAX_CONCURRENCY = 8
MAX_RETRIES = 4
PER_DOMAIN_LIMIT = 2
domain_semaphores: dict[str, asyncio.Semaphore] = {}
def domain_gate(domain: str) -> asyncio.Semaphore:
if domain not in domain_semaphores:
domain_semaphores[domain] = asyncio.Semaphore(PER_DOMAIN_LIMIT)
return domain_semaphores[domain]
async def fetch(job: Job) -> str:
# Replace with real HTTP work.
await asyncio.sleep(0.3)
if random.random() < 0.15:
raise TimeoutError("transient timeout")
return f"<html>{job.url}</html>"
async def handle(job: Job, retry_queue: asyncio.PriorityQueue) -> None:
async with domain_gate(job.domain):
html = await fetch(job)
print("fetched", job.url, "bytes", len(html))
async def worker(work_queue: asyncio.Queue, retry_queue: asyncio.PriorityQueue) -> None:
while True:
job = await work_queue.get()
try:
await handle(job, retry_queue)
except TimeoutError:
if job.attempt < MAX_RETRIES:
delay = min(2 ** job.attempt, 60) + random.random()
retry_job = Job(
url=job.url,
domain=job.domain,
attempt=job.attempt + 1,
next_run_at=monotonic() + delay,
)
await retry_queue.put((retry_job.next_run_at, retry_job))
else:
print("dead-letter", job.url)
finally:
work_queue.task_done()
This example shows the important parts:
- bounded global concurrency
- per-domain concurrency caps
- delayed retries instead of instant loops
- a dead-letter outcome after enough failures
Why immediate retries are dangerous
Immediate retries create retry storms:
- the same unstable target gets hit again instantly
- workers stay occupied by the same failing jobs
- fresh work never gets a chance
That is why retry queues need scheduled delays, not just "put it back at the end."
Exponential backoff with jitter is the default safe choice.
Backpressure rules worth implementing early
You do not need a full distributed systems thesis. A few simple rules solve most real issues.
| Signal | Backpressure action |
|---|---|
| queue size above threshold | pause discovery or reduce enqueue rate |
| 429 rate spikes on one domain | lower that domain's concurrency |
| average fetch latency doubles | reduce global concurrency |
| retry queue dominates total work | stop adding fresh low-value jobs |
| database lag rises | slow workers before writes start failing |
These rules turn overload from a surprise into a managed state.
Separate high-value jobs from background jobs
Not every URL should compete in the same pool.
Good examples of separate lanes:
- urgent refresh jobs for high-value product pages
- normal scheduled recrawls
- low-priority discovery jobs
- heavy browser jobs that need Playwright
If you mix all of those together, simple HTML tasks get stuck behind expensive browser work, and the queue stops feeling predictable.
Where ProxiesAPI fits
ProxiesAPI belongs in the fetch layer, not the queue layer.
That means:
- workers decide what to fetch and when
- the fetch layer decides how to route the request
- parser logic stays unchanged
This separation is useful because queue behavior problems are rarely fixed by proxy routing alone. If the queue is unbounded or retries are undisciplined, better networking just lets the system misbehave more efficiently.
Production checklist
Before calling your scraper "production-ready," check these:
| Capability | Why it matters |
|---|---|
| bounded global concurrency | prevents self-inflicted spikes |
| per-domain concurrency caps | protects targets and lowers ban risk |
| delayed retry queue | avoids retry storms |
| dead-letter handling | stops hopeless jobs from looping forever |
| queue metrics | lets you see overload before users do |
| priority lanes | protects important jobs from noisy background work |
This is the boring engineering that keeps scrapers alive.
The practical takeaway
If you are designing a web scraping queue, do not start with the question "How many workers can I run?"
Start with:
- what work deserves priority
- how much concurrency each target can tolerate
- which failures deserve retries
- what signal should make the system slow down
That is the difference between a scraper that runs fast in a demo and a scraper that survives in production for months.
ProxiesAPI can make outbound requests more reliable, but your queue still needs bounded concurrency, retry discipline, and backpressure. Otherwise you just fail faster.