Async Web Scraping in Python: asyncio + aiohttp Guide (Patterns That Don’t Get You Banned)
Async scraping is the fastest way to collect data at scale in Python.
It’s also the fastest way to:
- get rate-limited (429)
- melt your own network stack
- accidentally DDoS a site
- get banned because your crawler looks like a glitchy robot
This guide is a practical template you can actually reuse.
You’ll build:
- a bounded-concurrency async fetcher (
aiohttp+ semaphores) - retry + exponential backoff + jitter
- timeouts that prevent hung sockets
- polite per-host limits
- a clean “export” step (JSON/CSV)
Async scrapers amplify both throughput and failure modes. Keep concurrency bounded, retries polite, and add ProxiesAPI when you need clean IP rotation at scale.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install aiohttp aiodns cchardet
Notes:
aiodnsandcchardetare optional speed-ups that can help at higher concurrency.
The golden rule: concurrency is a knob, not a vibe
Most “async scraping tutorials” stop at:
await asyncio.gather(*tasks)
That’s not a scraper, it’s a stress test.
What you actually want:
- a global concurrency limit (so you don’t open 2,000 sockets)
- a per-host limit (so you don’t hammer one domain)
- a retry strategy (so transient failures don’t kill the crawl)
A production-grade template (copy/paste friendly)
from __future__ import annotations
import asyncio
import json
import random
from dataclasses import dataclass
import aiohttp
@dataclass(frozen=True)
class CrawlConfig:
total_concurrency: int = 20
per_host: int = 5
timeout_s: float = 30.0
max_retries: int = 3
# Optional proxy URL (including ProxiesAPI upstream proxy mode, if you use it).
proxy: str | None = None
def backoff_s(attempt: int) -> float:
# exponential backoff + jitter
return (2**attempt) + random.random()
async def fetch(
session: aiohttp.ClientSession,
url: str,
*,
config: CrawlConfig,
sem: asyncio.Semaphore,
) -> str:
async with sem:
last_exc: Exception | None = None
for attempt in range(config.max_retries):
try:
async with session.get(url, proxy=config.proxy) as resp:
resp.raise_for_status()
return await resp.text()
except Exception as e: # noqa: BLE001
last_exc = e
await asyncio.sleep(backoff_s(attempt))
raise last_exc # type: ignore[misc]
async def crawl(urls: list[str], config: CrawlConfig) -> list[dict]:
sem = asyncio.Semaphore(config.total_concurrency)
timeout = aiohttp.ClientTimeout(total=config.timeout_s)
connector = aiohttp.TCPConnector(limit_per_host=config.per_host, ssl=False)
headers = {
"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)"
}
async with aiohttp.ClientSession(timeout=timeout, connector=connector, headers=headers) as session:
tasks = [fetch(session, u, config=config, sem=sem) for u in urls]
pages = await asyncio.gather(*tasks, return_exceptions=True)
out: list[dict] = []
for url, result in zip(urls, pages):
if isinstance(result, Exception):
out.append({"url": url, "ok": False, "error": str(result)})
else:
out.append({"url": url, "ok": True, "html_len": len(result)})
return out
if __name__ == "__main__":
urls = [
"https://example.com/",
"https://www.iana.org/domains/reserved",
]
cfg = CrawlConfig(total_concurrency=10, per_host=3)
data = asyncio.run(crawl(urls, cfg))
print(json.dumps(data, indent=2))
This gives you a stable spine:
- if a page fails, you get an error record, not a crashed crawl
- you can tune concurrency based on target behavior
- you have one place to plug in a proxy
Practical patterns that keep you unbanned
1) Don’t spike traffic
If you need 1,000 pages, don’t fetch them all at once.
Batch your crawl and insert a short pause between batches.
2) Cache HTML while iterating
When you’re developing extraction logic, cache pages to disk so you don’t repeatedly hit the site.
3) Separate fetch from parse
Fetch returns HTML/JSON.
Parse turns it into structured rows.
When your fetch layer is reliable, you can swap parsers and tools without rewriting everything.
Exporting results (JSON + CSV)
Once you have structured rows, exporting should be boring:
import pandas as pd
pd.DataFrame(data).to_csv("crawl.csv", index=False)
Where ProxiesAPI fits (without hype)
Async scraping fails in predictable ways:
- rate limits
- burst traffic from one IP
- noisy failure rates at scale
That’s where proxy routing can help — not as a “bypass button”, but as a way to keep traffic distribution cleaner when you’re crawling a lot.
Keep the integration as a config value (proxy=), so you can turn it on only when it’s needed.
Wrap-up
You now have an async scraping template that’s fast and sane:
- bounded concurrency
- per-host limits
- retries + backoff + jitter
- timeouts
- structured output + easy export
If you want to level it up next:
- add URL deduplication and priority queues
- instrument latency + status code stats
- implement robots.txt checks + per-site rules
Async scrapers amplify both throughput and failure modes. Keep concurrency bounded, retries polite, and add ProxiesAPI when you need clean IP rotation at scale.