Web Scraping Dynamic Content: 5 Reliable Ways to Handle JavaScript-Rendered Pages
If you’ve ever run a scraper and got back an HTML page with… basically nothing in it, you’ve met the problem:
dynamic content.
In 2026, many sites render key data with JavaScript:
- product lists
- availability/pricing
- reviews
- infinite scroll feeds
This post is a practical playbook for web scraping dynamic content without cargo-culting headless browsers for everything.
We’ll cover:
- how to detect JS-rendered content
- 5 reliable strategies (from simplest to heaviest)
- when each strategy wins
- practical code examples
Target keyword (natural): web scraping dynamic content
Even when the content is rendered by JavaScript, you still have to fetch multiple endpoints reliably (HTML, XHR JSON, assets). ProxiesAPI gives you a stable fetch primitive you can apply to both page HTML and API calls.
First: how to detect a JavaScript-rendered page
Before you pick a tool, do a 30-second diagnosis.
1) View Source vs Inspect Element
- View Source shows the raw HTML returned by the server.
- Inspect Element shows the DOM after JavaScript runs.
If “Inspect” shows lots of items but “View Source” doesn’t, the content is likely rendered client-side.
2) Quick curl test
curl -s "https://example.com" | head -n 30
If you don’t see the data you expect (product names, prices, etc.), it’s probably dynamic.
3) Network tab: XHR/Fetch calls
Open DevTools → Network → filter by Fetch/XHR.
If you see JSON responses containing your target data, you’re in luck: you can often scrape the API directly.
Strategy 1: Scrape the underlying JSON/XHR endpoint (best default)
If the site fetches data via an API call, you can usually:
- reproduce the request (URL + headers + params)
- parse JSON directly
- avoid browser automation entirely
Example: mimic an XHR JSON call
import requests
TIMEOUT = (10, 30)
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
"Accept": "application/json,text/plain,*/*",
})
def fetch_json(url: str) -> dict:
r = session.get(url, timeout=TIMEOUT)
r.raise_for_status()
return r.json()
# Replace with the real XHR URL you find in DevTools.
# data = fetch_json("https://example.com/api/search?q=shoes&page=1")
# print(data.keys())
Pros
- fastest + most reliable
- structured data (no messy HTML parsing)
- easier to paginate
Cons
- endpoints may require auth tokens
- signatures/anti-bot can exist
Strategy 2: Reverse-engineer pagination + filters (keep it API-first)
Dynamic sites often paginate via:
page=2- cursors like
cursor=abc123 - offsets like
offset=40
Once you find the request shape, implement a crawler that:
- iterates pages/cursors
- dedupes results
- respects rate limits
from time import sleep
def crawl_pages(base_url: str, pages: int = 5) -> list[dict]:
out = []
for p in range(1, pages + 1):
url = f"{base_url}&page={p}"
data = fetch_json(url)
# TODO: adapt to your endpoint structure
items = data.get("items") or []
out.extend(items)
print("page", p, "items", len(items), "total", len(out))
sleep(0.5)
return out
This approach scales better than “open a browser and scroll.”
Strategy 3: Use a headless browser (Playwright) when you must
Sometimes:
- the API endpoints are heavily protected
- data is assembled from multiple calls
- the page uses complex runtime rendering
That’s when a browser automation tool like Playwright is appropriate.
Minimal Playwright example (Python)
pip install playwright
playwright install
from playwright.sync_api import sync_playwright
def get_rendered_html(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
return html
# html = get_rendered_html("https://example.com")
# print(len(html))
Pros
- most “human-like” rendering
- works when there’s no clean API
Cons
- slower, heavier, more expensive
- more moving parts (browser installs, timeouts)
Strategy 4: Hybrid approach (browser to discover API, Requests to crawl)
This is the move that most “serious” scrapers end up using:
- open the page in Playwright
- capture the XHR requests that contain data
- extract the real API URL + headers
- switch back to Requests for the bulk crawl
Why it’s great:
- you use the browser only for discovery
- your crawler stays fast and cheap
Conceptually:
Playwright (1 time) -> discover endpoint + tokens
Requests (N times) -> crawl pages, parse JSON
Playwright can listen to network responses; once you have the endpoint, you can implement the crawler with the retry/timeouts patterns from your Requests stack.
Strategy 5: Last resort techniques (when the site fights back)
If you’re dealing with aggressive anti-bot measures, you may need to combine:
- session/cookie management
- realistic headers
- rate limiting
- multiple fetch strategies
And importantly: detect bot pages.
A practical heuristic:
- if HTML contains “enable JavaScript” / “are you a robot” / CAPTCHA markers
- treat the page as blocked and don’t feed it into your parser
def looks_blocked(html: str) -> bool:
markers = [
"captcha",
"are you a robot",
"enable javascript",
]
h = (html or "").lower()
return any(m in h for m in markers)
Decision table: which strategy should you use?
| Situation | Best strategy |
|---|---|
| Data is in XHR JSON | 1) scrape the API |
| API paginates cleanly | 2) reverse-engineer pagination |
| No usable API, content only after JS | 3) Playwright |
| You can discover API via browser | 4) Hybrid |
| Heavy anti-bot | 5) Last resort combo |
Where ProxiesAPI fits (honestly)
Dynamic scraping often means you’re fetching more than one thing:
- the initial HTML
- one or more JSON endpoints
- detail endpoints
Even if you use Playwright for rendering, your pipeline usually still includes plain HTTP calls for scale.
ProxiesAPI can help by giving you a consistent fetch wrapper for both HTML and JSON endpoints:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com/api/search?page=1" | head
In practice:
- use API-first strategies when possible
- reserve browsers for discovery or truly browser-only pages
- keep retries/timeouts/dedupe regardless of approach
Quick checklist
- confirm it’s dynamic (View Source vs Inspect)
- look for XHR JSON endpoints first
- implement pagination + dedupe
- add retries + timeouts
- use Playwright only when necessary
Even when the content is rendered by JavaScript, you still have to fetch multiple endpoints reliably (HTML, XHR JSON, assets). ProxiesAPI gives you a stable fetch primitive you can apply to both page HTML and API calls.