JSON-LD Scraping: Extract Structured Data Without Brittle Selectors
A lot of scrapers start the hard way: by reverse-engineering a page’s CSS selectors, then patching breakages every time the frontend team moves a <div>.
JSON-LD gives you a better first move.
Many sites embed structured data in:
<script type="application/ld+json">...</script>
Those blocks often contain the exact fields you wanted from the HTML anyway:
- product names and prices
- ratings and review counts
- article headlines and publish dates
- recipe ingredients and nutrition fields
So before you write twenty selectors, check whether the page already ships the data as JSON-LD.
In this guide we’ll build a JSON-LD-first scraper, show how to normalize common schema types, and fall back to HTML only when necessary.
JSON-LD often gives you the cleanest product or article fields on a page. Once you know which script blocks matter, ProxiesAPI helps you fetch more of those pages reliably without turning HTML parsing into your bottleneck.
Why JSON-LD scraping is usually the better first pass
JSON-LD has three big advantages:
- it is structured already
- it often maps directly to Schema.org types
- it breaks less often than presentation-layer selectors
That does not make it perfect. Some sites omit fields, duplicate blocks, or ship multiple objects in an @graph. But even then, JSON-LD is usually the fastest path to a reliable extraction.
What JSON-LD looks like
A product page might include a block like this:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Noise-Cancelling Headphones",
"sku": "NC-100",
"brand": {"@type": "Brand", "name": "Acme"},
"offers": {
"@type": "Offer",
"priceCurrency": "USD",
"price": "199.99",
"availability": "https://schema.org/InStock"
},
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.6",
"reviewCount": "1832"
}
}
</script>
If you only need the name, price, stock state, and rating, the page just handed them to you.
Step 1: Fetch the page
Create json_ld_scraper.py:
from __future__ import annotations
import json
import os
from typing import Any
import requests
from bs4 import BeautifulSoup
TIMEOUT = (10, 30)
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
)
}
session = requests.Session()
session.headers.update(HEADERS)
def build_proxies() -> dict[str, str] | None:
proxy = os.getenv("PROXIESAPI_PROXY")
if not proxy:
return None
return {"http": f"http://{proxy}", "https": f"http://{proxy}"}
PROXIES = build_proxies()
def fetch_html(url: str) -> str:
response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)
response.raise_for_status()
return response.text
As usual, the ProxiesAPI integration is just the proxies=PROXIES hook. The parsing logic does not change.
Step 2: Extract every JSON-LD script block
Some pages have one block. Others have many. Some use a list, and others wrap multiple objects in @graph.
So the correct first step is: collect everything.
def extract_json_ld_blocks(html: str) -> list[Any]:
soup = BeautifulSoup(html, "lxml")
blocks: list[Any] = []
for script in soup.select('script[type="application/ld+json"]'):
raw = script.string or script.get_text()
raw = raw.strip()
if not raw:
continue
try:
data = json.loads(raw)
except json.JSONDecodeError:
continue
blocks.append(data)
return blocks
Do not assume there is only one script or only one object.
Step 3: Flatten lists and @graph
This is the part many quick demos skip.
One page may return:
- a single object
- a list of objects
- one object containing
@graph
Flatten that early so the rest of your parser stays simple.
def flatten_json_ld(blocks: list[Any]) -> list[dict[str, Any]]:
out: list[dict[str, Any]] = []
for block in blocks:
if isinstance(block, dict):
graph = block.get("@graph")
if isinstance(graph, list):
for item in graph:
if isinstance(item, dict):
out.append(item)
else:
out.append(block)
elif isinstance(block, list):
for item in block:
if isinstance(item, dict):
out.append(item)
return out
Now you have one list of JSON-like objects you can filter by schema type.
Step 4: Select the objects that matter
For most scraping jobs, you only care about a few schema types:
ProductArticleorNewsArticleRecipeReviewFAQPage
def get_types(obj: dict[str, Any]) -> set[str]:
raw_type = obj.get("@type")
if isinstance(raw_type, list):
return {str(t) for t in raw_type}
if isinstance(raw_type, str):
return {raw_type}
return set()
def pick_first_of_type(items: list[dict[str, Any]], wanted: set[str]) -> dict[str, Any] | None:
for item in items:
if get_types(item) & wanted:
return item
return None
Once you do that, the scraper becomes much more readable.
Step 5: Normalize common schema types
Product example
def normalize_product(product: dict[str, Any]) -> dict[str, Any]:
offers = product.get("offers") or {}
rating = product.get("aggregateRating") or {}
brand = product.get("brand") or {}
if isinstance(offers, list):
offers = offers[0] if offers else {}
return {
"schema_type": "Product",
"name": product.get("name"),
"sku": product.get("sku"),
"brand": brand.get("name") if isinstance(brand, dict) else brand,
"price": offers.get("price"),
"currency": offers.get("priceCurrency"),
"availability": offers.get("availability"),
"rating_value": rating.get("ratingValue"),
"review_count": rating.get("reviewCount"),
"url": product.get("url"),
}
Article example
def normalize_article(article: dict[str, Any]) -> dict[str, Any]:
author = article.get("author")
if isinstance(author, list):
author = ", ".join(
a.get("name", "") if isinstance(a, dict) else str(a) for a in author
)
elif isinstance(author, dict):
author = author.get("name")
return {
"schema_type": "Article",
"headline": article.get("headline"),
"description": article.get("description"),
"date_published": article.get("datePublished"),
"date_modified": article.get("dateModified"),
"author": author,
"url": article.get("url"),
}
Recipe example
def normalize_recipe(recipe: dict[str, Any]) -> dict[str, Any]:
rating = recipe.get("aggregateRating") or {}
return {
"schema_type": "Recipe",
"name": recipe.get("name"),
"category": recipe.get("recipeCategory"),
"cuisine": recipe.get("recipeCuisine"),
"yield": recipe.get("recipeYield"),
"ingredients": recipe.get("recipeIngredient") or [],
"rating_value": rating.get("ratingValue"),
"review_count": rating.get("reviewCount"),
"total_time": recipe.get("totalTime"),
"url": recipe.get("url"),
}
Step 6: Build a JSON-LD-first scraper with HTML fallback
This is the production pattern you actually want:
- try JSON-LD first
- see what is missing
- fill only the missing fields from HTML selectors
def scrape_page(url: str) -> dict[str, Any]:
html = fetch_html(url)
soup = BeautifulSoup(html, "lxml")
blocks = extract_json_ld_blocks(html)
items = flatten_json_ld(blocks)
product = pick_first_of_type(items, {"Product"})
article = pick_first_of_type(items, {"Article", "NewsArticle", "BlogPosting"})
recipe = pick_first_of_type(items, {"Recipe"})
if product:
data = normalize_product(product)
if not data.get("name"):
title = soup.select_one("h1")
data["name"] = title.get_text(" ", strip=True) if title else None
return data
if article:
data = normalize_article(article)
if not data.get("headline"):
title = soup.select_one("h1")
data["headline"] = title.get_text(" ", strip=True) if title else None
return data
if recipe:
data = normalize_recipe(recipe)
if not data.get("ingredients"):
data["ingredients"] = [
li.get_text(" ", strip=True)
for li in soup.select(".ingredients-item")
]
return data
return {"schema_type": None, "url": url}
Notice what we are not doing:
- we are not building the whole scraper from CSS selectors first
- we are not parsing dozens of presentational elements unless JSON-LD misses something
That keeps the maintenance surface much smaller.
Comparison: JSON-LD-first vs selector-first
| Approach | Strength | Weakness |
|---|---|---|
| JSON-LD-first | Cleaner data, faster implementation, fewer brittle selectors | Some sites omit fields or ship noisy graph objects |
| Selector-first | Works even when no structured data exists | More fragile, more time-consuming, harder to maintain |
| Hybrid | Best production choice for many sites | Slightly more code, but lower long-term pain |
The hybrid approach wins most of the time:
- JSON-LD for the obvious fields
- HTML selectors only for the gaps
Common JSON-LD scraping issues
1. Multiple unrelated objects
A page may contain Organization, BreadcrumbList, and WebSite before the Product you actually want. Always filter by @type.
2. Invalid JSON
Some sites ship malformed blocks. Wrap json.loads() in try/except and skip broken scripts instead of killing the whole scrape.
3. Arrays vs single objects
offers, author, and other properties may be a dict on one site and a list on another. Normalize both shapes.
4. Incomplete structured data
Sometimes the JSON-LD has the price but not the visible discount, or the headline but not the body text. That is exactly when a selective HTML fallback makes sense.
Where ProxiesAPI fits
JSON-LD reduces parser brittleness. It does not remove fetch-layer problems.
If you are scraping many product or article pages, you still may need:
- retry handling
- proxy rotation
- better success rates over time
That is why the fetch layer should stay separate from the parser:
response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)
Once that is in place, your JSON-LD extraction code can scale without becoming the thing that breaks first.
Final takeaway
JSON-LD scraping is one of the highest-leverage habits you can build into a scraper.
Before you touch the visible HTML, check for:
script[type="application/ld+json"]
If the page already gives you structured Product, Article, Recipe, or Review data, use that first. Then add HTML selectors only where the schema is incomplete.
That one change usually leads to:
- fewer selectors
- fewer breakages
- cleaner datasets
- faster production scrapers
JSON-LD often gives you the cleanest product or article fields on a page. Once you know which script blocks matter, ProxiesAPI helps you fetch more of those pages reliably without turning HTML parsing into your bottleneck.