Async Web Scraping in Python: asyncio + aiohttp Guide (Patterns That Don’t Get You Banned)

Async scraping is the fastest way to collect data at scale in Python.

It’s also the fastest way to:

  • get rate-limited (429)
  • melt your own network stack
  • accidentally DDoS a site
  • get banned because your crawler looks like a glitchy robot

This guide is a practical template you can actually reuse.

You’ll build:

  • a bounded-concurrency async fetcher (aiohttp + semaphores)
  • retry + exponential backoff + jitter
  • timeouts that prevent hung sockets
  • polite per-host limits
  • a clean “export” step (JSON/CSV)
Async speed is great — keep your crawl stable with ProxiesAPI

Async scrapers amplify both throughput and failure modes. Keep concurrency bounded, retries polite, and add ProxiesAPI when you need clean IP rotation at scale.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install aiohttp aiodns cchardet

Notes:

  • aiodns and cchardet are optional speed-ups that can help at higher concurrency.

The golden rule: concurrency is a knob, not a vibe

Most “async scraping tutorials” stop at:

await asyncio.gather(*tasks)

That’s not a scraper, it’s a stress test.

What you actually want:

  • a global concurrency limit (so you don’t open 2,000 sockets)
  • a per-host limit (so you don’t hammer one domain)
  • a retry strategy (so transient failures don’t kill the crawl)

A production-grade template (copy/paste friendly)

from __future__ import annotations

import asyncio
import json
import random
from dataclasses import dataclass

import aiohttp


@dataclass(frozen=True)
class CrawlConfig:
    total_concurrency: int = 20
    per_host: int = 5
    timeout_s: float = 30.0
    max_retries: int = 3

    # Optional proxy URL (including ProxiesAPI upstream proxy mode, if you use it).
    proxy: str | None = None


def backoff_s(attempt: int) -> float:
    # exponential backoff + jitter
    return (2**attempt) + random.random()


async def fetch(
    session: aiohttp.ClientSession,
    url: str,
    *,
    config: CrawlConfig,
    sem: asyncio.Semaphore,
) -> str:
    async with sem:
        last_exc: Exception | None = None
        for attempt in range(config.max_retries):
            try:
                async with session.get(url, proxy=config.proxy) as resp:
                    resp.raise_for_status()
                    return await resp.text()
            except Exception as e:  # noqa: BLE001
                last_exc = e
                await asyncio.sleep(backoff_s(attempt))
        raise last_exc  # type: ignore[misc]


async def crawl(urls: list[str], config: CrawlConfig) -> list[dict]:
    sem = asyncio.Semaphore(config.total_concurrency)

    timeout = aiohttp.ClientTimeout(total=config.timeout_s)
    connector = aiohttp.TCPConnector(limit_per_host=config.per_host, ssl=False)

    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)"
    }

    async with aiohttp.ClientSession(timeout=timeout, connector=connector, headers=headers) as session:
        tasks = [fetch(session, u, config=config, sem=sem) for u in urls]
        pages = await asyncio.gather(*tasks, return_exceptions=True)

    out: list[dict] = []
    for url, result in zip(urls, pages):
        if isinstance(result, Exception):
            out.append({"url": url, "ok": False, "error": str(result)})
        else:
            out.append({"url": url, "ok": True, "html_len": len(result)})
    return out


if __name__ == "__main__":
    urls = [
        "https://example.com/",
        "https://www.iana.org/domains/reserved",
    ]
    cfg = CrawlConfig(total_concurrency=10, per_host=3)
    data = asyncio.run(crawl(urls, cfg))
    print(json.dumps(data, indent=2))

This gives you a stable spine:

  • if a page fails, you get an error record, not a crashed crawl
  • you can tune concurrency based on target behavior
  • you have one place to plug in a proxy

Practical patterns that keep you unbanned

1) Don’t spike traffic

If you need 1,000 pages, don’t fetch them all at once.

Batch your crawl and insert a short pause between batches.

2) Cache HTML while iterating

When you’re developing extraction logic, cache pages to disk so you don’t repeatedly hit the site.

3) Separate fetch from parse

Fetch returns HTML/JSON.

Parse turns it into structured rows.

When your fetch layer is reliable, you can swap parsers and tools without rewriting everything.


Exporting results (JSON + CSV)

Once you have structured rows, exporting should be boring:

import pandas as pd

pd.DataFrame(data).to_csv("crawl.csv", index=False)

Where ProxiesAPI fits (without hype)

Async scraping fails in predictable ways:

  • rate limits
  • burst traffic from one IP
  • noisy failure rates at scale

That’s where proxy routing can help — not as a “bypass button”, but as a way to keep traffic distribution cleaner when you’re crawling a lot.

Keep the integration as a config value (proxy=), so you can turn it on only when it’s needed.


Wrap-up

You now have an async scraping template that’s fast and sane:

  • bounded concurrency
  • per-host limits
  • retries + backoff + jitter
  • timeouts
  • structured output + easy export

If you want to level it up next:

  • add URL deduplication and priority queues
  • instrument latency + status code stats
  • implement robots.txt checks + per-site rules
Async speed is great — keep your crawl stable with ProxiesAPI

Async scrapers amplify both throughput and failure modes. Keep concurrency bounded, retries polite, and add ProxiesAPI when you need clean IP rotation at scale.

Related guides

Async Web Scraping in Python: asyncio + aiohttp (Concurrency Without Getting Banned)
Learn production-grade async scraping in Python with asyncio + aiohttp: bounded concurrency, per-host limits, retry/backoff, timeouts, and proxy rotation patterns. Includes a complete working crawler template.
guide#python#asyncio#aiohttp
Rotating Proxies: What They Are, How Rotation Works, and When You Need Them
A practical, non-hype guide to rotating proxies: request vs session rotation, sticky IPs, block signals, and how to wire rotation into a scraper (including ProxiesAPI-ready examples).
guides#rotating proxies#proxies#web-scraping
Web Scraping Pagination: 7 Patterns That Don’t Break (Offset, Cursor, Infinite Scroll)
A practical playbook for reliable pagination: offset vs cursor, next-page discovery, infinite scroll, duplicate prevention, and retry/backoff patterns you can copy into production.
guide#web-scraping#pagination#python
Error Code 520 When Scraping: What It Means and a Practical Fix Checklist
Cloudflare 520 errors are vague by design. This guide explains what a 520 actually means, the most common scraping causes, and a step-by-step debugging flow with resilient retry and proxy patterns.
guide#error code 520#cloudflare#web-scraping