Web Scraping with Scrapy: Getting Started Guide (2026)

Scrapy is the fastest way to go from “I can scrape one page” to “I can crawl a whole site without my laptop melting.”

If you’ve only used requests + BeautifulSoup, Scrapy will feel different at first:

  • you don’t write loops — you yield requests
  • you don’t manage concurrency — Scrapy does
  • you don’t manually follow pagination — you parse links and schedule them
  • you don’t think in “scripts” — you think in spiders + pipelines

This guide is a getting started tutorial optimized for the keyword “web scraping with scrapy” and for real-world use:

  • a complete spider you can run today
  • clean item output
  • pagination
  • pipelines
  • and how to add proxy rotation in a sane way (including ProxiesAPI)
Add proxy rotation to Scrapy with ProxiesAPI

Scrapy is built for scale — but at scale, blocks and uneven success rates show up fast. ProxiesAPI gives you a stable proxy endpoint so your spiders can keep crawling without brittle IP management.


When you should use Scrapy (and when you shouldn’t)

Use Scrapy when:

  • you’re scraping many pages (hundreds to millions)
  • you need retries, concurrency, throttling
  • you want structured outputs and clean exports
  • you need to maintain a crawler long-term

Don’t use Scrapy when:

  • the site is heavily JS-rendered and you need a browser (use Playwright)
  • you only need one-off HTML parsing for a single page

A common pattern is: Scrapy for bulk crawling + Playwright for a subset of JS-only pages.


Setup: create a Scrapy project

python -m venv .venv
source .venv/bin/activate
pip install scrapy python-dotenv

scrapy startproject demo_crawler
cd demo_crawler

Your structure will look like:

demo_crawler/
  scrapy.cfg
  demo_crawler/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
      __init__.py

Step 1: Build a spider that scrapes a list page + detail pages

In the real world you usually:

  1. scrape a listing page
  2. follow links to details
  3. export items

We’ll demonstrate on Books to Scrape (a classic practice site):

https://books.toscrape.com/

Create: demo_crawler/spiders/books.py

import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        # listing page: extract book detail links
        for a in response.css("article.product_pod h3 a::attr(href)"):
            yield response.follow(a.get(), callback=self.parse_detail)

        # pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        title = response.css("div.product_main h1::text").get()
        price = response.css("p.price_color::text").get()
        availability = response.css("p.availability::text").getall()
        availability = " ".join([a.strip() for a in availability if a.strip()])

        yield {
            "title": title,
            "price": price,
            "availability": availability,
            "url": response.url,
        }

Run it:

scrapy crawl books -O books.json

You should get a JSON array with hundreds of items.


Step 2: Scrapy selectors (CSS vs XPath)

Scrapy supports both. CSS is usually faster to write; XPath is sometimes more precise.

Examples:

  • CSS: response.css("h1::text").get()
  • XPath: response.xpath("//h1/text()").get()

Rules of thumb:

  • Prefer CSS for most work
  • Switch to XPath when you need complex relationships

Step 3: Items (optional, but nice)

If your project grows, define items so you don’t scatter field names everywhere.

demo_crawler/items.py

import scrapy


class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    availability = scrapy.Field()
    url = scrapy.Field()

Then yield a BookItem() in your spider.


Step 4: Pipelines (clean + enrich data)

Pipelines are where you:

  • normalize values
  • validate required fields
  • write to DB

demo_crawler/pipelines.py

import re


class CleanPipeline:
    def process_item(self, item, spider):
        if "price" in item and item["price"]:
            item["price"] = re.sub(r"\s+", " ", item["price"]).strip()
        return item

Enable it in settings.py:

ITEM_PIPELINES = {
    "demo_crawler.pipelines.CleanPipeline": 300,
}

Step 5: Throttling, retries, and being a good citizen

In settings.py:

ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 10
RETRY_ENABLED = True
RETRY_TIMES = 3

Step 6: Proxy rotation in Scrapy (the right layer)

In Scrapy, proxies belong in downloader middleware.

There are two patterns:

  1. Single proxy endpoint (you set one proxy for all requests)
  2. Per-request proxy selection (you rotate among endpoints)

If your proxy provider gives you a stable endpoint that rotates IPs behind the scenes, option (1) is perfect.

ProxiesAPI in Scrapy

Store your key in .env:

PROXIESAPI_KEY="YOUR_KEY_HERE"

demo_crawler/middlewares.py

import os
from dotenv import load_dotenv

load_dotenv()


class ProxiesApiProxyMiddleware:
    def __init__(self):
        key = os.getenv("PROXIESAPI_KEY")
        if not key:
            raise RuntimeError("Missing PROXIESAPI_KEY")

        # Adapt this if your account uses a different endpoint format.
        self.proxy_url = f"http://{key}:@proxy.proxiesapi.com:10000"

    def process_request(self, request, spider):
        request.meta["proxy"] = self.proxy_url

Enable the middleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "demo_crawler.middlewares.ProxiesApiProxyMiddleware": 350,
}

This is the cleanest integration: your spider logic stays the same; networking becomes more resilient.


Common Scrapy gotchas (and fixes)

1) “My spider is fast but gets blocked”

Fix:

  • enable AutoThrottle
  • add a delay
  • reduce concurrency
  • add proxies (ProxiesAPI)

Useful settings:

CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 0.5

2) “My selectors work in browser but not in Scrapy”

You might be looking at client-side rendered DOM.

Fix:

  • view-source and verify the HTML has the data
  • if not, use Playwright or an API endpoint

3) “Pagination loops forever”

Fix:

  • normalize URLs via response.follow
  • keep a seen set (or use dupefilter behavior)

A minimal production checklist

  • clear start URLs
  • pagination + detail page crawling
  • proper exports (-O JSON/CSV)
  • throttling enabled
  • retries enabled
  • proxy middleware configured (if needed)

Summary

If you want to do web scraping with Scrapy seriously, the winning combination is:

  • Scrapy spiders for structured crawling
  • pipelines for clean data
  • AutoThrottle + delays for polite behavior
  • ProxiesAPI for higher success rates as you scale

Once you have this baseline, you can evolve into distributed crawling, incremental runs, and storing outputs in a real database.

Add proxy rotation to Scrapy with ProxiesAPI

Scrapy is built for scale — but at scale, blocks and uneven success rates show up fast. ProxiesAPI gives you a stable proxy endpoint so your spiders can keep crawling without brittle IP management.

Related guides

Web Scraping Tools: The 2026 Buyer's Guide (What to Use and When)
A practical buyer’s guide to web scraping tools in 2026: Requests/BS4, Scrapy, Playwright, Apify, proxies, and hosted scrapers—plus a decision checklist and comparison table.
guide#web-scraping#tools#python
Rotating Proxies: What They Are, How Rotation Works, and When You Need Them
A practical, non-hype guide to rotating proxies: request vs session rotation, sticky IPs, block signals, and how to wire rotation into a scraper (including ProxiesAPI-ready examples).
guides#rotating proxies#proxies#web-scraping
Best Free Proxy Lists for Web Scraping (and Why They Fail in Production)
Free proxy lists look tempting—until you measure uptime, bans, and fraud. Here’s where to find them, how to test them, and when to switch to a proxy API.
guides#proxies#web-scraping#proxy-list
Scraping Airbnb Listings: Pricing, Availability, Reviews (What’s Realistic in 2026)
Airbnb is a high-friction target. Here’s what data is realistic to collect in 2026, what gets blocked, safer alternatives, and how to design a risk-aware pipeline.
guides#airbnb#web-scraping#anti-bot