Web Scraping Dynamic Content: How to Handle JavaScript-Rendered Pages (Without Overusing Headless)
Dynamic pages are where new scrapers go to die.
You visit a page in Chrome, see a rich UI… then you requests.get(url) and your scraper gets:
- an empty shell
- a “please enable JavaScript” message
- a blob of script tags
That’s the classic web scraping dynamic content problem.
The mistake most people make is jumping straight to “run headless Chrome for everything”.
Headless works — but it’s expensive, slower, and harder to scale.
This guide gives you a decision framework and practical patterns so you can:
- scrape many dynamic sites using only HTTP (no browser)
- use headless only when you truly need it
- keep costs and complexity down
Dynamic pages often mean more requests, more retries, and more failure modes. ProxiesAPI helps stabilize the network layer so your hybrid (HTML + headless) scraper stays dependable.
Step 1: Diagnose what “dynamic” means
A page can feel dynamic for different reasons:
- Server-rendered HTML but updates after load (you can still scrape HTML)
- HTML shell + JSON API calls (best case: scrape the JSON)
- GraphQL / internal API behind auth (still sometimes usable)
- Heavily client-rendered + protected (headless may be required)
Quick test: view-source vs Elements
view-source:shows the initial HTML from the server- DevTools “Elements” shows the DOM after JS runs
If view-source already contains the data you need, you don’t need headless.
Step 2: The cheapest path first (pure HTTP)
Pattern A — scrape server-rendered HTML
This is the easiest case.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://example.com", timeout=(10, 30))
r.raise_for_status()
soup = BeautifulSoup(r.text, "lxml")
items = [h.get_text(strip=True) for h in soup.select("h2.item-title")]
print(items[:5])
Pattern B — find the JSON API the page calls
Most “dynamic” sites load data via XHR/fetch.
How to find it:
- Open DevTools → Network
- Filter by Fetch/XHR
- Reload page
- Click requests that return JSON
Then replicate that request in Python.
import requests
api_url = "https://example.com/api/products?page=1"
r = requests.get(api_url, timeout=(10, 30), headers={
"Accept": "application/json",
"User-Agent": "Mozilla/5.0"
})
r.raise_for_status()
data = r.json()
print(data.keys())
This is the highest leverage trick in scraping.
Step 3: Intermediate options before headless
Option 1 — parse embedded JSON (NEXT_DATA, hydration state)
Frameworks like Next.js often embed data in the HTML.
Look for:
__NEXT_DATA__window.__APOLLO_STATE____NUXT__
Example extractor:
import json
from bs4 import BeautifulSoup
def extract_next_data(html: str) -> dict | None:
soup = BeautifulSoup(html, "lxml")
script = soup.select_one("script#__NEXT_DATA__")
if not script:
return None
return json.loads(script.get_text())
If this works, you get clean structured data with zero browser automation.
Option 2 — use “render endpoints” (when they exist)
Some sites have endpoints that return pre-rendered fragments.
You’ll see responses that return HTML partials.
Same playbook: identify in Network tab, replicate.
Step 4: When you actually need headless (Playwright)
Use headless when:
- data is only present after complex JS execution
- the API calls are heavily protected / signed
- the DOM is assembled in a way that’s painful to replicate
Minimal Playwright example (Python)
pip install playwright
python -m playwright install chromium
from playwright.sync_api import sync_playwright
def scrape_with_browser(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={"width": 1280, "height": 720})
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
return html
html = scrape_with_browser("https://example.com")
print(len(html))
Don’t overuse it
If you run Playwright for every URL, your crawl becomes:
- slow (seconds per page)
- resource-heavy
- harder to parallelize
Instead, use a hybrid architecture.
A hybrid architecture that scales
A practical pattern:
- Try HTTP-only scraping first
- If fields are missing, fallback to headless for that URL
- Cache rendered HTML / extracted JSON
Pseudo-code:
def scrape(url):
html = fetch_http(url)
data = parse(html)
if data_is_complete(data):
return data
html = fetch_headless(url)
return parse(html)
This keeps headless usage low (and costs down).
Cost control (the hidden part)
Dynamic scraping gets expensive because:
- more retries
- more time per URL
- more failures
Ways to keep costs down:
- cache aggressively (ETags, last-modified, local snapshots)
- crawl incrementally (only new/changed URLs)
- avoid full renders when JSON endpoints exist
- batch headless tasks (reuse a browser instance)
Where ProxiesAPI helps
Dynamic scraping often increases request volume because:
- you fetch HTML + API calls
- you retry more
- you have more failure modes
ProxiesAPI helps by stabilizing the network layer:
- higher request success rate
- fewer hard blocks
- more predictable crawl schedules
It won’t replace headless, but it makes your pipeline less fragile.
Comparison table: approaches
| Approach | When it works | Pros | Cons |
|---|---|---|---|
| Requests + BeautifulSoup | HTML contains data | Fast, cheap | Breaks on client-only pages |
| JSON API replication | Data loaded via XHR | Clean structured data | APIs can change / require headers |
| Embedded state parsing | Next/Nuxt hydration | Very efficient | Site-specific parsing |
| Playwright headless | Complex JS-only pages | Most robust | Slow and costly |
| Hybrid | Most real projects | Balanced | More engineering |
Practical checklist
- Check
view-sourcefirst - Look for XHR JSON endpoints
- Search HTML for
__NEXT_DATA__etc. - Use Playwright only as a fallback
- Cache everything
- Add retries + backoff
Next upgrades
- build per-site “strategy configs” (http/json/headless)
- add a block-page classifier (captcha / 403 / consent)
- use a queue + worker model for headless fallbacks
Dynamic pages often mean more requests, more retries, and more failure modes. ProxiesAPI helps stabilize the network layer so your hybrid (HTML + headless) scraper stays dependable.