Web Scraping with Scrapy: Getting Started Guide (2026)
Scrapy is the fastest way to go from “I can scrape one page” to “I can crawl a whole site without my laptop melting.”
If you’ve only used requests + BeautifulSoup, Scrapy will feel different at first:
- you don’t write loops — you yield requests
- you don’t manage concurrency — Scrapy does
- you don’t manually follow pagination — you parse links and schedule them
- you don’t think in “scripts” — you think in spiders + pipelines
This guide is a getting started tutorial optimized for the keyword “web scraping with scrapy” and for real-world use:
- a complete spider you can run today
- clean item output
- pagination
- pipelines
- and how to add proxy rotation in a sane way (including ProxiesAPI)
Scrapy is built for scale — but at scale, blocks and uneven success rates show up fast. ProxiesAPI gives you a stable proxy endpoint so your spiders can keep crawling without brittle IP management.
When you should use Scrapy (and when you shouldn’t)
Use Scrapy when:
- you’re scraping many pages (hundreds to millions)
- you need retries, concurrency, throttling
- you want structured outputs and clean exports
- you need to maintain a crawler long-term
Don’t use Scrapy when:
- the site is heavily JS-rendered and you need a browser (use Playwright)
- you only need one-off HTML parsing for a single page
A common pattern is: Scrapy for bulk crawling + Playwright for a subset of JS-only pages.
Setup: create a Scrapy project
python -m venv .venv
source .venv/bin/activate
pip install scrapy python-dotenv
scrapy startproject demo_crawler
cd demo_crawler
Your structure will look like:
demo_crawler/
scrapy.cfg
demo_crawler/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
Step 1: Build a spider that scrapes a list page + detail pages
In the real world you usually:
- scrape a listing page
- follow links to details
- export items
We’ll demonstrate on Books to Scrape (a classic practice site):
https://books.toscrape.com/
Create: demo_crawler/spiders/books.py
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
# listing page: extract book detail links
for a in response.css("article.product_pod h3 a::attr(href)"):
yield response.follow(a.get(), callback=self.parse_detail)
# pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_detail(self, response):
title = response.css("div.product_main h1::text").get()
price = response.css("p.price_color::text").get()
availability = response.css("p.availability::text").getall()
availability = " ".join([a.strip() for a in availability if a.strip()])
yield {
"title": title,
"price": price,
"availability": availability,
"url": response.url,
}
Run it:
scrapy crawl books -O books.json
You should get a JSON array with hundreds of items.
Step 2: Scrapy selectors (CSS vs XPath)
Scrapy supports both. CSS is usually faster to write; XPath is sometimes more precise.
Examples:
- CSS:
response.css("h1::text").get() - XPath:
response.xpath("//h1/text()").get()
Rules of thumb:
- Prefer CSS for most work
- Switch to XPath when you need complex relationships
Step 3: Items (optional, but nice)
If your project grows, define items so you don’t scatter field names everywhere.
demo_crawler/items.py
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
availability = scrapy.Field()
url = scrapy.Field()
Then yield a BookItem() in your spider.
Step 4: Pipelines (clean + enrich data)
Pipelines are where you:
- normalize values
- validate required fields
- write to DB
demo_crawler/pipelines.py
import re
class CleanPipeline:
def process_item(self, item, spider):
if "price" in item and item["price"]:
item["price"] = re.sub(r"\s+", " ", item["price"]).strip()
return item
Enable it in settings.py:
ITEM_PIPELINES = {
"demo_crawler.pipelines.CleanPipeline": 300,
}
Step 5: Throttling, retries, and being a good citizen
In settings.py:
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 10
RETRY_ENABLED = True
RETRY_TIMES = 3
Step 6: Proxy rotation in Scrapy (the right layer)
In Scrapy, proxies belong in downloader middleware.
There are two patterns:
- Single proxy endpoint (you set one proxy for all requests)
- Per-request proxy selection (you rotate among endpoints)
If your proxy provider gives you a stable endpoint that rotates IPs behind the scenes, option (1) is perfect.
ProxiesAPI in Scrapy
Store your key in .env:
PROXIESAPI_KEY="YOUR_KEY_HERE"
demo_crawler/middlewares.py
import os
from dotenv import load_dotenv
load_dotenv()
class ProxiesApiProxyMiddleware:
def __init__(self):
key = os.getenv("PROXIESAPI_KEY")
if not key:
raise RuntimeError("Missing PROXIESAPI_KEY")
# Adapt this if your account uses a different endpoint format.
self.proxy_url = f"http://{key}:@proxy.proxiesapi.com:10000"
def process_request(self, request, spider):
request.meta["proxy"] = self.proxy_url
Enable the middleware in settings.py:
DOWNLOADER_MIDDLEWARES = {
"demo_crawler.middlewares.ProxiesApiProxyMiddleware": 350,
}
This is the cleanest integration: your spider logic stays the same; networking becomes more resilient.
Common Scrapy gotchas (and fixes)
1) “My spider is fast but gets blocked”
Fix:
- enable AutoThrottle
- add a delay
- reduce concurrency
- add proxies (ProxiesAPI)
Useful settings:
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 0.5
2) “My selectors work in browser but not in Scrapy”
You might be looking at client-side rendered DOM.
Fix:
- view-source and verify the HTML has the data
- if not, use Playwright or an API endpoint
3) “Pagination loops forever”
Fix:
- normalize URLs via
response.follow - keep a seen set (or use
dupefilterbehavior)
A minimal production checklist
- clear start URLs
- pagination + detail page crawling
- proper exports (
-OJSON/CSV) - throttling enabled
- retries enabled
- proxy middleware configured (if needed)
Summary
If you want to do web scraping with Scrapy seriously, the winning combination is:
- Scrapy spiders for structured crawling
- pipelines for clean data
- AutoThrottle + delays for polite behavior
- ProxiesAPI for higher success rates as you scale
Once you have this baseline, you can evolve into distributed crawling, incremental runs, and storing outputs in a real database.
Scrapy is built for scale — but at scale, blocks and uneven success rates show up fast. ProxiesAPI gives you a stable proxy endpoint so your spiders can keep crawling without brittle IP management.