Beautiful Soup vs Scrapy vs Selenium (2026): Which Python Scraper Should You Use?
If you’re scraping with Python, you’ll hear three names over and over:
- Beautiful Soup (usually with
requests) - Scrapy
- Selenium (or Playwright, in the same “browser automation” category)
They’re not interchangeable.
They solve different problems, and choosing the wrong one costs you weeks:
- slow scripts that never finish
- brittle selectors
- banned IPs
- “works on my machine” crawlers that fail in production
This guide is a decision framework — not a religious argument.
No framework solves blocking by itself. Keep reliability in your fetch layer (timeouts, retries, optional ProxiesAPI) so you can swap tools without rewriting your extraction logic.
TL;DR decision rules
Pick Beautiful Soup when:
- the site is mostly server-rendered HTML
- you’re scraping dozens to hundreds of pages
- you want full control over parsing and exports
Pick Scrapy when:
- you’re scraping thousands to millions of pages
- you need a real crawling engine (queues, dedupe, pipelines)
- you care about throughput and resilience
Pick Selenium when:
- the content requires real browser rendering
- navigation requires clicks, scroll, or authenticated sessions
- anti-bot measures break simple HTTP fetches
Comparison table (practical)
| Tool | Best for | Throughput | Reliability | Complexity |
|---|---|---|---|---|
| Beautiful Soup + requests | simple sites, quick scripts | Medium | Medium | Low |
| Scrapy | large crawls, structured pipelines | High | High | Medium |
| Selenium | JS-heavy sites, complex flows | Low | Medium | High |
Notes:
- Scrapy is “fast” because it’s asynchronous and designed for crawling.
- Selenium is “slow” because you’re running a full browser per page.
Beautiful Soup: the surgical knife
Beautiful Soup shines when you already know the URLs (or can generate them) and the HTML is present server-side.
Minimal pattern:
import requests
from bs4 import BeautifulSoup
TIMEOUT = (10, 30)
session = requests.Session()
def fetch(url: str) -> str:
r = session.get(url, timeout=TIMEOUT, headers={"User-Agent": "Mozilla/5.0"})
r.raise_for_status()
return r.text
def parse(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
rows = []
for card in soup.select(".card"):
a = card.select_one("a.title")
rows.append({"title": a.get_text(strip=True) if a else None})
return rows
Where it breaks down:
- you need a real queue + dedupe
- you need to auto-discover new pages (true crawling)
- the site is JS-rendered so the HTML you fetch is empty
Scrapy: the crawling engine
Scrapy is not “a parser.” It’s an engine:
- concurrent requests
- built-in retries and throttling hooks
- request fingerprinting + dedupe
- pipelines for cleaning/export/storage
Minimal spider:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ["https://example.com/list"]
def parse(self, response):
for href in response.css("a::attr(href)").getall():
yield response.follow(href, callback=self.parse_item)
def parse_item(self, response):
yield {
"title": response.css("h1::text").get(),
"url": response.url,
}
Where Scrapy wins:
- large-scale crawls
- crawl politeness (delays, concurrency limits)
- storing results in a durable pipeline
Where Scrapy struggles:
- JS-heavy sites (you’ll need a renderer integration)
- flows that require clicking/scrolling/auth in a real browser
Selenium: the browser hammer
Selenium is the “make it work” tool when the page isn’t really a document — it’s an app.
Minimal pattern:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
els = driver.find_elements(By.CSS_SELECTOR, "a.title")
rows = [{"title": e.text, "url": e.get_attribute("href")} for e in els]
driver.quit()
Selenium is expensive:
- browser startup cost
- page rendering cost
- bot detection is more intense
Use it when you must, and keep your run size reasonable.
The real secret: architecture matters more than tool choice
Most scrapers fail because the system is not cleanly separated.
Design for:
- Fetch layer (timeouts, retries, rate limits, optional ProxiesAPI)
- Parse layer (selectors → raw fields)
- Normalize layer (types, defaults, cleanup)
- Export/store layer (JSON/CSV/DB)
If you keep these boundaries, you can migrate:
- Beautiful Soup → Scrapy
- Selenium → Playwright
- direct fetch → ProxiesAPI
…without rewriting the entire project.
When proxies actually matter
Proxies are not a cheat code.
They matter when:
- your request volume increases (you look like a bot)
- the target throttles by IP
- you hit geo restrictions
- you need higher success rate across many URLs
If you’re scraping 10 pages once, solve the basics first:
- correct selectors
- timeouts
- backoff
- politeness
Then, if you still hit blocks at scale, move reliability into the fetch layer.
A simple “choose your tool” checklist
- Is the HTML present in
curloutput?- Yes → Beautiful Soup or Scrapy
- No → Selenium/Playwright (rendered)
- Do you need to discover URLs by following links?
- Yes → Scrapy
- No → Beautiful Soup is often enough
- Are you scraping 10k+ pages?
- Yes → Scrapy (or you’ll reinvent it badly)
- Is the site essentially a single-page app?
- Yes → Selenium/Playwright
If you follow those rules, you’ll be right most of the time.
No framework solves blocking by itself. Keep reliability in your fetch layer (timeouts, retries, optional ProxiesAPI) so you can swap tools without rewriting your extraction logic.