JSON-LD Scraping: Extract Structured Data Without Brittle Selectors

A lot of scrapers start the hard way: by reverse-engineering a page’s CSS selectors, then patching breakages every time the frontend team moves a <div>.

JSON-LD gives you a better first move.

Many sites embed structured data in:

<script type="application/ld+json">...</script>

Those blocks often contain the exact fields you wanted from the HTML anyway:

  • product names and prices
  • ratings and review counts
  • article headlines and publish dates
  • recipe ingredients and nutrition fields

So before you write twenty selectors, check whether the page already ships the data as JSON-LD.

In this guide we’ll build a JSON-LD-first scraper, show how to normalize common schema types, and fall back to HTML only when necessary.

Pull the clean data first, then scale the fetch layer

JSON-LD often gives you the cleanest product or article fields on a page. Once you know which script blocks matter, ProxiesAPI helps you fetch more of those pages reliably without turning HTML parsing into your bottleneck.


Why JSON-LD scraping is usually the better first pass

JSON-LD has three big advantages:

  1. it is structured already
  2. it often maps directly to Schema.org types
  3. it breaks less often than presentation-layer selectors

That does not make it perfect. Some sites omit fields, duplicate blocks, or ship multiple objects in an @graph. But even then, JSON-LD is usually the fastest path to a reliable extraction.


What JSON-LD looks like

A product page might include a block like this:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Noise-Cancelling Headphones",
  "sku": "NC-100",
  "brand": {"@type": "Brand", "name": "Acme"},
  "offers": {
    "@type": "Offer",
    "priceCurrency": "USD",
    "price": "199.99",
    "availability": "https://schema.org/InStock"
  },
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.6",
    "reviewCount": "1832"
  }
}
</script>

If you only need the name, price, stock state, and rating, the page just handed them to you.


Step 1: Fetch the page

Create json_ld_scraper.py:

from __future__ import annotations

import json
import os
from typing import Any

import requests
from bs4 import BeautifulSoup

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/126.0.0.0 Safari/537.36"
    )
}

session = requests.Session()
session.headers.update(HEADERS)


def build_proxies() -> dict[str, str] | None:
    proxy = os.getenv("PROXIESAPI_PROXY")
    if not proxy:
        return None
    return {"http": f"http://{proxy}", "https": f"http://{proxy}"}


PROXIES = build_proxies()


def fetch_html(url: str) -> str:
    response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)
    response.raise_for_status()
    return response.text

As usual, the ProxiesAPI integration is just the proxies=PROXIES hook. The parsing logic does not change.


Step 2: Extract every JSON-LD script block

Some pages have one block. Others have many. Some use a list, and others wrap multiple objects in @graph.

So the correct first step is: collect everything.

def extract_json_ld_blocks(html: str) -> list[Any]:
    soup = BeautifulSoup(html, "lxml")
    blocks: list[Any] = []

    for script in soup.select('script[type="application/ld+json"]'):
        raw = script.string or script.get_text()
        raw = raw.strip()
        if not raw:
            continue

        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            continue

        blocks.append(data)

    return blocks

Do not assume there is only one script or only one object.


Step 3: Flatten lists and @graph

This is the part many quick demos skip.

One page may return:

  • a single object
  • a list of objects
  • one object containing @graph

Flatten that early so the rest of your parser stays simple.

def flatten_json_ld(blocks: list[Any]) -> list[dict[str, Any]]:
    out: list[dict[str, Any]] = []

    for block in blocks:
        if isinstance(block, dict):
            graph = block.get("@graph")
            if isinstance(graph, list):
                for item in graph:
                    if isinstance(item, dict):
                        out.append(item)
            else:
                out.append(block)

        elif isinstance(block, list):
            for item in block:
                if isinstance(item, dict):
                    out.append(item)

    return out

Now you have one list of JSON-like objects you can filter by schema type.


Step 4: Select the objects that matter

For most scraping jobs, you only care about a few schema types:

  • Product
  • Article or NewsArticle
  • Recipe
  • Review
  • FAQPage
def get_types(obj: dict[str, Any]) -> set[str]:
    raw_type = obj.get("@type")
    if isinstance(raw_type, list):
        return {str(t) for t in raw_type}
    if isinstance(raw_type, str):
        return {raw_type}
    return set()


def pick_first_of_type(items: list[dict[str, Any]], wanted: set[str]) -> dict[str, Any] | None:
    for item in items:
        if get_types(item) & wanted:
            return item
    return None

Once you do that, the scraper becomes much more readable.


Step 5: Normalize common schema types

Product example

def normalize_product(product: dict[str, Any]) -> dict[str, Any]:
    offers = product.get("offers") or {}
    rating = product.get("aggregateRating") or {}
    brand = product.get("brand") or {}

    if isinstance(offers, list):
        offers = offers[0] if offers else {}

    return {
        "schema_type": "Product",
        "name": product.get("name"),
        "sku": product.get("sku"),
        "brand": brand.get("name") if isinstance(brand, dict) else brand,
        "price": offers.get("price"),
        "currency": offers.get("priceCurrency"),
        "availability": offers.get("availability"),
        "rating_value": rating.get("ratingValue"),
        "review_count": rating.get("reviewCount"),
        "url": product.get("url"),
    }

Article example

def normalize_article(article: dict[str, Any]) -> dict[str, Any]:
    author = article.get("author")
    if isinstance(author, list):
        author = ", ".join(
            a.get("name", "") if isinstance(a, dict) else str(a) for a in author
        )
    elif isinstance(author, dict):
        author = author.get("name")

    return {
        "schema_type": "Article",
        "headline": article.get("headline"),
        "description": article.get("description"),
        "date_published": article.get("datePublished"),
        "date_modified": article.get("dateModified"),
        "author": author,
        "url": article.get("url"),
    }

Recipe example

def normalize_recipe(recipe: dict[str, Any]) -> dict[str, Any]:
    rating = recipe.get("aggregateRating") or {}

    return {
        "schema_type": "Recipe",
        "name": recipe.get("name"),
        "category": recipe.get("recipeCategory"),
        "cuisine": recipe.get("recipeCuisine"),
        "yield": recipe.get("recipeYield"),
        "ingredients": recipe.get("recipeIngredient") or [],
        "rating_value": rating.get("ratingValue"),
        "review_count": rating.get("reviewCount"),
        "total_time": recipe.get("totalTime"),
        "url": recipe.get("url"),
    }

Step 6: Build a JSON-LD-first scraper with HTML fallback

This is the production pattern you actually want:

  1. try JSON-LD first
  2. see what is missing
  3. fill only the missing fields from HTML selectors
def scrape_page(url: str) -> dict[str, Any]:
    html = fetch_html(url)
    soup = BeautifulSoup(html, "lxml")

    blocks = extract_json_ld_blocks(html)
    items = flatten_json_ld(blocks)

    product = pick_first_of_type(items, {"Product"})
    article = pick_first_of_type(items, {"Article", "NewsArticle", "BlogPosting"})
    recipe = pick_first_of_type(items, {"Recipe"})

    if product:
        data = normalize_product(product)
        if not data.get("name"):
            title = soup.select_one("h1")
            data["name"] = title.get_text(" ", strip=True) if title else None
        return data

    if article:
        data = normalize_article(article)
        if not data.get("headline"):
            title = soup.select_one("h1")
            data["headline"] = title.get_text(" ", strip=True) if title else None
        return data

    if recipe:
        data = normalize_recipe(recipe)
        if not data.get("ingredients"):
            data["ingredients"] = [
                li.get_text(" ", strip=True)
                for li in soup.select(".ingredients-item")
            ]
        return data

    return {"schema_type": None, "url": url}

Notice what we are not doing:

  • we are not building the whole scraper from CSS selectors first
  • we are not parsing dozens of presentational elements unless JSON-LD misses something

That keeps the maintenance surface much smaller.


Comparison: JSON-LD-first vs selector-first

ApproachStrengthWeakness
JSON-LD-firstCleaner data, faster implementation, fewer brittle selectorsSome sites omit fields or ship noisy graph objects
Selector-firstWorks even when no structured data existsMore fragile, more time-consuming, harder to maintain
HybridBest production choice for many sitesSlightly more code, but lower long-term pain

The hybrid approach wins most of the time:

  • JSON-LD for the obvious fields
  • HTML selectors only for the gaps

Common JSON-LD scraping issues

1. Multiple unrelated objects

A page may contain Organization, BreadcrumbList, and WebSite before the Product you actually want. Always filter by @type.

2. Invalid JSON

Some sites ship malformed blocks. Wrap json.loads() in try/except and skip broken scripts instead of killing the whole scrape.

3. Arrays vs single objects

offers, author, and other properties may be a dict on one site and a list on another. Normalize both shapes.

4. Incomplete structured data

Sometimes the JSON-LD has the price but not the visible discount, or the headline but not the body text. That is exactly when a selective HTML fallback makes sense.


Where ProxiesAPI fits

JSON-LD reduces parser brittleness. It does not remove fetch-layer problems.

If you are scraping many product or article pages, you still may need:

  • retry handling
  • proxy rotation
  • better success rates over time

That is why the fetch layer should stay separate from the parser:

response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)

Once that is in place, your JSON-LD extraction code can scale without becoming the thing that breaks first.


Final takeaway

JSON-LD scraping is one of the highest-leverage habits you can build into a scraper.

Before you touch the visible HTML, check for:

script[type="application/ld+json"]

If the page already gives you structured Product, Article, Recipe, or Review data, use that first. Then add HTML selectors only where the schema is incomplete.

That one change usually leads to:

  • fewer selectors
  • fewer breakages
  • cleaner datasets
  • faster production scrapers
Pull the clean data first, then scale the fetch layer

JSON-LD often gives you the cleanest product or article fields on a page. Once you know which script blocks matter, ProxiesAPI helps you fetch more of those pages reliably without turning HTML parsing into your bottleneck.

Related guides

Scrape Stack Overflow User Profiles and Badges with Python
Extract reputation, badge counts, top tags, and profile metadata from public Stack Overflow user pages into JSON/CSV with robust selectors and a ProxiesAPI-ready fetch layer.
tutorial#python#stack-overflow#web-scraping
Scrape Product Data from Amazon
Extract Amazon product titles, prices, ratings, and availability with Python, BeautifulSoup, and a proxy-backed fetch layer that plugs cleanly into ProxiesAPI.
tutorial#python#amazon#web-scraping
Scrape GitHub Repository Data
Collect GitHub repository metadata, stars, forks, topics, and README-linked context from the public HTML with Python. Includes defensive selectors, CSV export, and a screenshot.
tutorial#python#github#web-scraping
Scrape Secondhand Fashion Listings from Vinted
Show how to extract Vinted search listings, prices, brands, and image URLs into a resale-market dataset with Python, screenshots, and a ProxiesAPI-ready fetch layer.
tutorial#python#vinted#web-scraping