Selenium Web Scraping with Python: Complete Guide
If you’re searching for selenium web scraping with python, you likely have a site where:
- the HTML you get from
requests.get(...)is empty or missing key data - the page requires scrolling/clicking to load results
- content only appears after JavaScript runs
Selenium can solve those — but it’s also the slowest and most brittle tool in your scraping toolbox. This guide is opinionated: use Selenium when you must, and know when to switch tools.
Selenium gives you a real browser, but at scale you still hit IP-based throttling. ProxiesAPI helps on the network layer (especially your non-browser discovery/fetch jobs) so the overall crawl stays resilient.
Selenium vs alternatives (pick the right tool)
| Tool | Best for | Pros | Cons |
|---|---|---|---|
requests + BeautifulSoup | server-rendered HTML | fastest, cheapest, easiest to scale | fails on JS apps |
| Playwright | modern JS sites | reliable waits, auto-waits, great debugging | heavier than requests |
| Selenium | legacy apps + complex UI flows | huge ecosystem, broad compatibility | slow, flakier, more bot detection |
| Direct API/XHR scraping | data behind JSON calls | fastest + most stable (when allowed) | requires endpoint discovery + sometimes auth |
Recommendation:
- Start with requests + parse.
- If the HTML is missing, try Playwright next.
- Use Selenium when you specifically need its ecosystem or compatibility.
Setup (Python + Selenium)
python -m venv .venv
source .venv/bin/activate
pip install selenium
Chrome + driver
Selenium needs a browser and a matching driver.
- For Chrome, install a compatible
chromedriver(must match major Chrome version). - On macOS (Homebrew):
brew install chromedriver
If the driver is mismatched you’ll see errors like:
session not created: This version of ChromeDriver only supports Chrome version ...
Fix: update Chrome or update chromedriver.
The core pattern: explicit waits (not sleeps)
The #1 Selenium mistake is writing time.sleep(5) everywhere. Instead, wait for a specific condition.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
driver.get("https://example.com")
h1 = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1")))
print(h1.text)
driver.quit()
“Page loaded” isn’t enough
Many sites load in stages (HTML → JS → XHR → render). Waiting for document.readyState alone often gives you “empty” pages. Wait for:
- an element to exist
- a list length to be > 0
- text to appear
Selectors: CSS first, XPath second
Prefer CSS selectors:
div.carda[href^="/product/"]button[data-testid="submit"]
Use XPath when you need structural selection (parent/sibling relationships) that CSS can’t express cleanly.
Headless mode (and why behavior changes)
Headless is great for CI/servers, but it can change rendering and break lazy loading. Always set a viewport size.
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("--headless=new")
opts.add_argument("--window-size=1400,900")
opts.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=opts)
If a scrape works headed but fails headless, it’s usually:
- missing viewport size
- missing waits for content
- blocked requests (403/captcha/empty HTML)
Anti-bot basics (low-risk, practical)
You can’t “outsmart” every system, but you can avoid obvious mistakes:
- pace requests and add jitter
- reuse a browser session (don’t relaunch per URL)
- detect blocks early (empty content, captcha markers)
- stop and back off when failure rates spike
If the site’s terms prohibit scraping, don’t do it (or get explicit permission).
Export data cleanly
Browser automation fails; your exports should survive partial runs. Extract dicts and write them frequently.
import csv
rows = [{"name": "Item A", "price": "$10"}, {"name": "Item B", "price": "$12"}]
with open("out.csv", "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
w.writeheader()
w.writerows(rows)
When Selenium is the wrong tool
Selenium becomes a liability when you need to crawl thousands of URLs or run continuously. Switch away when:
- your bottleneck is IP blocks (not rendering)
- you can fetch JSON/XHR endpoints directly
- you can parse HTML without a browser
A common production pattern is hybrid scraping:
- try
requestsfirst (fast path) - fall back to Selenium for the minority of pages that truly need rendering
Where ProxiesAPI fits (sensibly)
Selenium itself doesn’t plug into ProxiesAPI via a single wrapper URL without extra browser proxy configuration — but ProxiesAPI still helps in two common architectures:
- Discovery via HTTP, rendering only when needed: use ProxiesAPI on your bulk fetch layer (category pages, listing pages, sitemaps), then pass only hard URLs to Selenium.
- Hybrid scraping with fallbacks: ProxiesAPI stabilizes your HTTP fetches so your crawler spends less time failing and less time falling back to the expensive browser path.
Keep the system modular (fetch → parse → export → renderer fallback) and Selenium stays a tool — not your whole product.
Selenium gives you a real browser, but at scale you still hit IP-based throttling. ProxiesAPI helps on the network layer (especially your non-browser discovery/fetch jobs) so the overall crawl stays resilient.