Web Scraping Dynamic Content: How to Handle JavaScript-Rendered Pages
“Why is my scraper returning an empty page?”
If you’ve scraped a few websites, you’ve hit this: you requests.get() a URL, parse the HTML, and… the content isn’t there.
That’s usually because the page is JavaScript-rendered:
- the server returns a lightweight shell
- the browser runs JS
- JS calls APIs (XHR/fetch)
- the page fills in results after load
This post gives you a practical decision tree to handle web scraping dynamic content without overengineering.
We’ll cover:
- how to detect JS-rendered pages
- how to find the underlying API calls (the easiest path)
- when to scrape “HTML endpoints” instead
- when to use a headless browser (Playwright)
- where proxies help (and when they don’t)
JS-heavy sites often mean more requests (API calls + assets) and more rate-limiting. ProxiesAPI gives you a proxy layer you can turn on when reliability matters.
Step 1: Identify whether the content is JS-rendered
A fast checklist:
- View Page Source in your browser
- if the data you want is missing, it’s likely rendered by JS
curlorrequeststhe page- compare the HTML to what you see in the browser
- Look for placeholders in HTML
- empty
<div id="app">or lots of script tags and minimal markup
- empty
Quick terminal test
curl -s "https://example.com" | head -n 40
If you don’t see your target data anywhere in the HTML, you have three main paths.
The decision tree (what to do next)
Path A (best): scrape the underlying API calls
Most JS pages are powered by JSON responses.
If you can find the API endpoint that returns the data, you get:
- faster requests
- simpler parsing
- less brittle selectors
How to find it:
- Open DevTools → Network tab
- Filter by Fetch/XHR
- Reload the page
- Click requests that look like
search,listings,products,graphql, etc. - Check the Response tab
If you see JSON with the fields you want, that’s your target.
Python example: call a JSON API directly
import requests
TIMEOUT = (10, 30)
r = requests.get(
"https://api.example.com/search",
params={"q": "laptop", "page": 1},
headers={
"User-Agent": "Mozilla/5.0",
"Accept": "application/json",
},
timeout=TIMEOUT,
)
r.raise_for_status()
data = r.json()
items = data.get("items", [])
print("items", len(items))
print(items[0])
If the API requires headers/cookies (common), capture them from DevTools and reproduce.
Path B: scrape an HTML endpoint (often hidden)
Some sites provide HTML endpoints for:
- SEO crawlers
- older clients
- alternate views
Examples:
- adding query params like
?output=1 - using
?render=1 - switching to an “AMP” or “print” version
How to discover:
- search the HTML for alternate links (
rel="amphtml",canonical) - check if the site has a
/sitemap.xml - look at internal navigation links (sometimes they point to server-rendered pages)
This approach keeps things simple: requests + BeautifulSoup.
Path C (fallback): use a headless browser
If the site:
- requires JS to render the content and
- hides data behind GraphQL calls with complex signatures or
- requires interactions (infinite scroll, button clicks, logged-in flows)
…then headless browser automation is the pragmatic choice.
The best tool today is Playwright.
Playwright example (Python): extract rendered HTML
pip install playwright
python -m playwright install chromium
import asyncio
from playwright.async_api import async_playwright
async def main():
url = "https://example.com/products"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
# If content loads after scroll:
# await page.mouse.wheel(0, 2000)
# await page.wait_for_timeout(1000)
html = await page.content()
print("html bytes", len(html))
await browser.close()
asyncio.run(main())
Once you have HTML, parse with BeautifulSoup as usual.
Performance + reliability tradeoffs
A quick comparison table:
-
API scraping (XHR/JSON)
- Fastest, most stable
- Requires investigation in DevTools
-
HTML endpoint scraping
- Simple, cheap
- Might be incomplete or removed
-
Headless browser scraping
- Most compatible
- Slowest, most resource-heavy
- More moving parts (timeouts, navigation, selectors)
Practical anti-block basics (without overclaiming)
Dynamic sites often mean:
- more requests per “page” (API calls + assets)
- stricter rate limits
- bot detection heuristics
Practical steps that help:
- set realistic timeouts
- retry on 429/503 with backoff
- keep concurrency low (start with 1–3)
- cache responses (huge for debugging)
- rotate user agents sparingly (don’t randomize every request)
Where ProxiesAPI fits (honestly)
Proxies are not a silver bullet. But they do help in common failure modes:
- your tracker runs every hour/day and starts getting 429s
- some runs fail due to regional/rate-limit variability
- your IP gets temporarily throttled after repeated API calls
If you keep proxy usage as a toggle in your fetch layer, you can turn ProxiesAPI on for:
- scheduled jobs
- larger watchlists
- higher page depth
…and keep local dev/prototyping proxy-free.
A simple playbook you can reuse
When a page is dynamic:
- Try API scraping first (Network → XHR → JSON)
- If no obvious API, try an alternate HTML endpoint
- If interactions are required, use Playwright
- Add proxies only when reliability demands it
That sequence keeps your scraper fast, maintainable, and easier to debug.
JS-heavy sites often mean more requests (API calls + assets) and more rate-limiting. ProxiesAPI gives you a proxy layer you can turn on when reliability matters.