Async Web Scraping in Python: asyncio + aiohttp Guide (Patterns That Don’t Get You Banned)

Jun 01, 2026 · guide · #python, #asyncio, #aiohttp, #web-scraping, #concurrency, #rate-limits, #backoff, #proxies

Async scraping is the fastest way to collect data at scale in Python.

It’s also the fastest way to:

get rate-limited (429)
melt your own network stack
accidentally DDoS a site
get banned because your crawler looks like a glitchy robot

This guide is a practical template you can actually reuse.

You’ll build:

a bounded-concurrency async fetcher (aiohttp + semaphores)
retry + exponential backoff + jitter
timeouts that prevent hung sockets
polite per-host limits
a clean “export” step (JSON/CSV)

Async speed is great — keep your crawl stable with ProxiesAPI

Async scrapers amplify both throughput and failure modes. Keep concurrency bounded, retries polite, and add ProxiesAPI when you need clean IP rotation at scale.

Get 1,000 free API calls View pricing

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install aiohttp aiodns cchardet

Notes:

aiodns and cchardet are optional speed-ups that can help at higher concurrency.

The golden rule: concurrency is a knob, not a vibe

Most “async scraping tutorials” stop at:

await asyncio.gather(*tasks)

That’s not a scraper, it’s a stress test.

What you actually want:

a global concurrency limit (so you don’t open 2,000 sockets)
a per-host limit (so you don’t hammer one domain)
a retry strategy (so transient failures don’t kill the crawl)

A production-grade template (copy/paste friendly)

from __future__ import annotations

import asyncio
import json
import random
from dataclasses import dataclass

import aiohttp


@dataclass(frozen=True)
class CrawlConfig:
    total_concurrency: int = 20
    per_host: int = 5
    timeout_s: float = 30.0
    max_retries: int = 3

    # Optional proxy URL (including ProxiesAPI upstream proxy mode, if you use it).
    proxy: str | None = None


def backoff_s(attempt: int) -> float:
    # exponential backoff + jitter
    return (2**attempt) + random.random()


async def fetch(
    session: aiohttp.ClientSession,
    url: str,
    *,
    config: CrawlConfig,
    sem: asyncio.Semaphore,
) -> str:
    async with sem:
        last_exc: Exception | None = None
        for attempt in range(config.max_retries):
            try:
                async with session.get(url, proxy=config.proxy) as resp:
                    resp.raise_for_status()
                    return await resp.text()
            except Exception as e:  # noqa: BLE001
                last_exc = e
                await asyncio.sleep(backoff_s(attempt))
        raise last_exc  # type: ignore[misc]


async def crawl(urls: list[str], config: CrawlConfig) -> list[dict]:
    sem = asyncio.Semaphore(config.total_concurrency)

    timeout = aiohttp.ClientTimeout(total=config.timeout_s)
    connector = aiohttp.TCPConnector(limit_per_host=config.per_host, ssl=False)

    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)"
    }

    async with aiohttp.ClientSession(timeout=timeout, connector=connector, headers=headers) as session:
        tasks = [fetch(session, u, config=config, sem=sem) for u in urls]
        pages = await asyncio.gather(*tasks, return_exceptions=True)

    out: list[dict] = []
    for url, result in zip(urls, pages):
        if isinstance(result, Exception):
            out.append({"url": url, "ok": False, "error": str(result)})
        else:
            out.append({"url": url, "ok": True, "html_len": len(result)})
    return out


if __name__ == "__main__":
    urls = [
        "https://example.com/",
        "https://www.iana.org/domains/reserved",
    ]
    cfg = CrawlConfig(total_concurrency=10, per_host=3)
    data = asyncio.run(crawl(urls, cfg))
    print(json.dumps(data, indent=2))

This gives you a stable spine:

if a page fails, you get an error record, not a crashed crawl
you can tune concurrency based on target behavior
you have one place to plug in a proxy

Practical patterns that keep you unbanned

1) Don’t spike traffic

If you need 1,000 pages, don’t fetch them all at once.

Batch your crawl and insert a short pause between batches.

2) Cache HTML while iterating

When you’re developing extraction logic, cache pages to disk so you don’t repeatedly hit the site.

3) Separate fetch from parse

Fetch returns HTML/JSON.

Parse turns it into structured rows.

When your fetch layer is reliable, you can swap parsers and tools without rewriting everything.

Exporting results (JSON + CSV)

Once you have structured rows, exporting should be boring:

import pandas as pd

pd.DataFrame(data).to_csv("crawl.csv", index=False)

Where ProxiesAPI fits (without hype)

Async scraping fails in predictable ways:

rate limits
burst traffic from one IP
noisy failure rates at scale

That’s where proxy routing can help — not as a “bypass button”, but as a way to keep traffic distribution cleaner when you’re crawling a lot.

Keep the integration as a config value (proxy=), so you can turn it on only when it’s needed.

Wrap-up

You now have an async scraping template that’s fast and sane:

bounded concurrency
per-host limits
retries + backoff + jitter
timeouts
structured output + easy export

If you want to level it up next:

add URL deduplication and priority queues
instrument latency + status code stats
implement robots.txt checks + per-site rules

Async speed is great — keep your crawl stable with ProxiesAPI

Async scrapers amplify both throughput and failure modes. Keep concurrency bounded, retries polite, and add ProxiesAPI when you need clean IP rotation at scale.

Get 1,000 free API calls View pricing

Learn production-grade async scraping in Python with asyncio + aiohttp: bounded concurrency, per-host limits, retry/backoff, timeouts, and proxy rotation patterns. Includes a complete working crawler template.

guide#python#asyncio#aiohttp

Web Scraping Rate Limiting: How to Throttle Requests Without Killing Throughput

Design rate limiting for scrapers that stays polite enough to reduce bans but fast enough for production, with practical token-bucket patterns, concurrency controls, and retry strategy.

guide#rate-limiting#web-scraping#python

Web Scraping with HTTPX: Async Fetching, Retries, and Timeouts

A practical guide to web scraping with HTTPX in Python: sane timeouts, bounded async fetching, explicit retries, and production-ready request patterns.

guide#python#httpx#web-scraping

403 Forbidden When Scraping: Why It Happens and 7 Fixes That Work

A practical guide to diagnosing 403 blocks in web scraping, separating them from soft blocks and rate limits, and applying the right fixes in the right order.

guides#403 forbidden web scraping#web-scraping#anti-bot

Async Web Scraping in Python: asyncio + aiohttp Guide (Patterns That Don’t Get You Banned)

Related guides