Web Scraping with Scrapy: Getting Started Guide (2026)

Scrapy is the fastest way to go from “I can scrape one page” to “I can crawl a whole site without my laptop melting.”

If you’ve only used requests + BeautifulSoup, Scrapy will feel different at first:

  • you don’t write loops — you yield requests
  • you don’t manage concurrency — Scrapy does
  • you don’t manually follow pagination — you parse links and schedule them
  • you don’t think in “scripts” — you think in spiders + pipelines

This guide is a getting started tutorial optimized for the keyword “web scraping with scrapy” and for real-world use:

  • a complete spider you can run today
  • clean item output
  • pagination
  • pipelines
  • and how to add proxy rotation in a sane way (including ProxiesAPI)
Add proxy rotation to Scrapy with ProxiesAPI

Scrapy is built for scale — but at scale, blocks and uneven success rates show up fast. ProxiesAPI gives you a stable proxy endpoint so your spiders can keep crawling without brittle IP management.


When you should use Scrapy (and when you shouldn’t)

Use Scrapy when:

  • you’re scraping many pages (hundreds to millions)
  • you need retries, concurrency, throttling
  • you want structured outputs and clean exports
  • you need to maintain a crawler long-term

Don’t use Scrapy when:

  • the site is heavily JS-rendered and you need a browser (use Playwright)
  • you only need one-off HTML parsing for a single page

A common pattern is: Scrapy for bulk crawling + Playwright for a subset of JS-only pages.


Setup: create a Scrapy project

python -m venv .venv
source .venv/bin/activate
pip install scrapy python-dotenv

scrapy startproject demo_crawler
cd demo_crawler

Your structure will look like:

demo_crawler/
  scrapy.cfg
  demo_crawler/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
      __init__.py

Step 1: Build a spider that scrapes a list page + detail pages

In the real world you usually:

  1. scrape a listing page
  2. follow links to details
  3. export items

We’ll demonstrate on Books to Scrape (a classic practice site):

https://books.toscrape.com/

Create: demo_crawler/spiders/books.py

import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        # listing page: extract book detail links
        for a in response.css("article.product_pod h3 a::attr(href)"):
            yield response.follow(a.get(), callback=self.parse_detail)

        # pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        title = response.css("div.product_main h1::text").get()
        price = response.css("p.price_color::text").get()
        availability = response.css("p.availability::text").getall()
        availability = " ".join([a.strip() for a in availability if a.strip()])

        yield {
            "title": title,
            "price": price,
            "availability": availability,
            "url": response.url,
        }

Run it:

scrapy crawl books -O books.json

You should get a JSON array with hundreds of items.


Step 2: Scrapy selectors (CSS vs XPath)

Scrapy supports both. CSS is usually faster to write; XPath is sometimes more precise.

Examples:

  • CSS: response.css("h1::text").get()
  • XPath: response.xpath("//h1/text()").get()

Rules of thumb:

  • Prefer CSS for most work
  • Switch to XPath when you need complex relationships

Step 3: Items (optional, but nice)

If your project grows, define items so you don’t scatter field names everywhere.

demo_crawler/items.py

import scrapy


class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    availability = scrapy.Field()
    url = scrapy.Field()

Then yield a BookItem() in your spider.


Step 4: Pipelines (clean + enrich data)

Pipelines are where you:

  • normalize values
  • validate required fields
  • write to DB

demo_crawler/pipelines.py

import re


class CleanPipeline:
    def process_item(self, item, spider):
        if "price" in item and item["price"]:
            item["price"] = re.sub(r"\s+", " ", item["price"]).strip()
        return item

Enable it in settings.py:

ITEM_PIPELINES = {
    "demo_crawler.pipelines.CleanPipeline": 300,
}

Step 5: Throttling, retries, and being a good citizen

In settings.py:

ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 10
RETRY_ENABLED = True
RETRY_TIMES = 3

Step 6: Proxy rotation in Scrapy (the right layer)

In Scrapy, proxies belong in downloader middleware.

There are two patterns:

  1. Single proxy endpoint (you set one proxy for all requests)
  2. Per-request proxy selection (you rotate among endpoints)

If your proxy provider gives you a stable endpoint that rotates IPs behind the scenes, option (1) is perfect.

ProxiesAPI in Scrapy

Store your key in .env:

PROXIESAPI_KEY="YOUR_KEY_HERE"

demo_crawler/middlewares.py

import os
from dotenv import load_dotenv

load_dotenv()


class ProxiesApiProxyMiddleware:
    def __init__(self):
        key = os.getenv("PROXIESAPI_KEY")
        if not key:
            raise RuntimeError("Missing PROXIESAPI_KEY")

        # Adapt this if your account uses a different endpoint format.
        self.proxy_url = f"http://{key}:@proxy.proxiesapi.com:10000"

    def process_request(self, request, spider):
        request.meta["proxy"] = self.proxy_url

Enable the middleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "demo_crawler.middlewares.ProxiesApiProxyMiddleware": 350,
}

This is the cleanest integration: your spider logic stays the same; networking becomes more resilient.


Common Scrapy gotchas (and fixes)

1) “My spider is fast but gets blocked”

Fix:

  • enable AutoThrottle
  • add a delay
  • reduce concurrency
  • add proxies (ProxiesAPI)

Useful settings:

CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 0.5

2) “My selectors work in browser but not in Scrapy”

You might be looking at client-side rendered DOM.

Fix:

  • view-source and verify the HTML has the data
  • if not, use Playwright or an API endpoint

3) “Pagination loops forever”

Fix:

  • normalize URLs via response.follow
  • keep a seen set (or use dupefilter behavior)

A minimal production checklist

  • clear start URLs
  • pagination + detail page crawling
  • proper exports (-O JSON/CSV)
  • throttling enabled
  • retries enabled
  • proxy middleware configured (if needed)

Summary

If you want to do web scraping with Scrapy seriously, the winning combination is:

  • Scrapy spiders for structured crawling
  • pipelines for clean data
  • AutoThrottle + delays for polite behavior
  • ProxiesAPI for higher success rates as you scale

Once you have this baseline, you can evolve into distributed crawling, incremental runs, and storing outputs in a real database.

Add proxy rotation to Scrapy with ProxiesAPI

Scrapy is built for scale — but at scale, blocks and uneven success rates show up fast. ProxiesAPI gives you a stable proxy endpoint so your spiders can keep crawling without brittle IP management.

Related guides

Scrape Product Comparisons from CNET (Python + ProxiesAPI)
Collect CNET comparison tables and spec blocks, normalize the data into a clean dataset, and keep the crawl stable with retries + ProxiesAPI. Includes screenshot workflow.
tutorial#python#cnet#web-scraping
Scrape Glassdoor Salaries and Reviews (Python + ProxiesAPI)
Extract Glassdoor company reviews and salary ranges more reliably: discover URLs, handle pagination, keep sessions consistent, rotate proxies when blocked, and export clean JSON.
tutorial#python#glassdoor#web-scraping
How to Scrape Etsy Product Listings with Python (ProxiesAPI + Pagination)
Extract title, price, rating, and shop info from Etsy search pages reliably with rotating proxies, retries, and pagination.
tutorial#python#etsy#web-scraping
Scrape NBA Scores and Standings from ESPN with Python (Box Scores + Schedule)
Build a clean dataset of today’s NBA games and standings from ESPN pages using robust selectors and proxy-safe requests.
tutorial#python#nba#espn