JSON-LD Scraping: Extract Structured Data Without Brittle Selectors

Jun 18, 2026 · tutorial · #web-scraping, #json ld scraping, #python, #json-ld, #schema.org, #beautifulsoup, #proxies

A lot of scrapers start the hard way: by reverse-engineering a page’s CSS selectors, then patching breakages every time the frontend team moves a <div>.

JSON-LD gives you a better first move.

Many sites embed structured data in:

<script type="application/ld+json">...</script>

Those blocks often contain the exact fields you wanted from the HTML anyway:

product names and prices
ratings and review counts
article headlines and publish dates
recipe ingredients and nutrition fields

So before you write twenty selectors, check whether the page already ships the data as JSON-LD.

In this guide we’ll build a JSON-LD-first scraper, show how to normalize common schema types, and fall back to HTML only when necessary.

Pull the clean data first, then scale the fetch layer

JSON-LD often gives you the cleanest product or article fields on a page. Once you know which script blocks matter, ProxiesAPI helps you fetch more of those pages reliably without turning HTML parsing into your bottleneck.

Get 1,000 free API calls View pricing

Why JSON-LD scraping is usually the better first pass

JSON-LD has three big advantages:

it is structured already
it often maps directly to Schema.org types
it breaks less often than presentation-layer selectors

That does not make it perfect. Some sites omit fields, duplicate blocks, or ship multiple objects in an @graph. But even then, JSON-LD is usually the fastest path to a reliable extraction.

What JSON-LD looks like

A product page might include a block like this:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Noise-Cancelling Headphones",
  "sku": "NC-100",
  "brand": {"@type": "Brand", "name": "Acme"},
  "offers": {
    "@type": "Offer",
    "priceCurrency": "USD",
    "price": "199.99",
    "availability": "https://schema.org/InStock"
  },
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.6",
    "reviewCount": "1832"
  }
}
</script>

If you only need the name, price, stock state, and rating, the page just handed them to you.

Step 1: Fetch the page

Create json_ld_scraper.py:

from __future__ import annotations

import json
import os
from typing import Any

import requests
from bs4 import BeautifulSoup

TIMEOUT = (10, 30)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/126.0.0.0 Safari/537.36"
    )
}

session = requests.Session()
session.headers.update(HEADERS)


def build_proxies() -> dict[str, str] | None:
    proxy = os.getenv("PROXIESAPI_PROXY")
    if not proxy:
        return None
    return {"http": f"http://{proxy}", "https": f"http://{proxy}"}


PROXIES = build_proxies()


def fetch_html(url: str) -> str:
    response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)
    response.raise_for_status()
    return response.text

As usual, the ProxiesAPI integration is just the proxies=PROXIES hook. The parsing logic does not change.

Step 2: Extract every JSON-LD script block

Some pages have one block. Others have many. Some use a list, and others wrap multiple objects in @graph.

So the correct first step is: collect everything.

def extract_json_ld_blocks(html: str) -> list[Any]:
    soup = BeautifulSoup(html, "lxml")
    blocks: list[Any] = []

    for script in soup.select('script[type="application/ld+json"]'):
        raw = script.string or script.get_text()
        raw = raw.strip()
        if not raw:
            continue

        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            continue

        blocks.append(data)

    return blocks

Do not assume there is only one script or only one object.

Step 3: Flatten lists and `@graph`

This is the part many quick demos skip.

One page may return:

a single object
a list of objects
one object containing @graph

Flatten that early so the rest of your parser stays simple.

def flatten_json_ld(blocks: list[Any]) -> list[dict[str, Any]]:
    out: list[dict[str, Any]] = []

    for block in blocks:
        if isinstance(block, dict):
            graph = block.get("@graph")
            if isinstance(graph, list):
                for item in graph:
                    if isinstance(item, dict):
                        out.append(item)
            else:
                out.append(block)

        elif isinstance(block, list):
            for item in block:
                if isinstance(item, dict):
                    out.append(item)

    return out

Now you have one list of JSON-like objects you can filter by schema type.

Step 4: Select the objects that matter

For most scraping jobs, you only care about a few schema types:

Product
Article or NewsArticle
Recipe
Review
FAQPage

def get_types(obj: dict[str, Any]) -> set[str]:
    raw_type = obj.get("@type")
    if isinstance(raw_type, list):
        return {str(t) for t in raw_type}
    if isinstance(raw_type, str):
        return {raw_type}
    return set()


def pick_first_of_type(items: list[dict[str, Any]], wanted: set[str]) -> dict[str, Any] | None:
    for item in items:
        if get_types(item) & wanted:
            return item
    return None

Once you do that, the scraper becomes much more readable.

Step 5: Normalize common schema types

Product example

def normalize_product(product: dict[str, Any]) -> dict[str, Any]:
    offers = product.get("offers") or {}
    rating = product.get("aggregateRating") or {}
    brand = product.get("brand") or {}

    if isinstance(offers, list):
        offers = offers[0] if offers else {}

    return {
        "schema_type": "Product",
        "name": product.get("name"),
        "sku": product.get("sku"),
        "brand": brand.get("name") if isinstance(brand, dict) else brand,
        "price": offers.get("price"),
        "currency": offers.get("priceCurrency"),
        "availability": offers.get("availability"),
        "rating_value": rating.get("ratingValue"),
        "review_count": rating.get("reviewCount"),
        "url": product.get("url"),
    }

Article example

def normalize_article(article: dict[str, Any]) -> dict[str, Any]:
    author = article.get("author")
    if isinstance(author, list):
        author = ", ".join(
            a.get("name", "") if isinstance(a, dict) else str(a) for a in author
        )
    elif isinstance(author, dict):
        author = author.get("name")

    return {
        "schema_type": "Article",
        "headline": article.get("headline"),
        "description": article.get("description"),
        "date_published": article.get("datePublished"),
        "date_modified": article.get("dateModified"),
        "author": author,
        "url": article.get("url"),
    }

Recipe example

def normalize_recipe(recipe: dict[str, Any]) -> dict[str, Any]:
    rating = recipe.get("aggregateRating") or {}

    return {
        "schema_type": "Recipe",
        "name": recipe.get("name"),
        "category": recipe.get("recipeCategory"),
        "cuisine": recipe.get("recipeCuisine"),
        "yield": recipe.get("recipeYield"),
        "ingredients": recipe.get("recipeIngredient") or [],
        "rating_value": rating.get("ratingValue"),
        "review_count": rating.get("reviewCount"),
        "total_time": recipe.get("totalTime"),
        "url": recipe.get("url"),
    }

Step 6: Build a JSON-LD-first scraper with HTML fallback

This is the production pattern you actually want:

try JSON-LD first
see what is missing
fill only the missing fields from HTML selectors

def scrape_page(url: str) -> dict[str, Any]:
    html = fetch_html(url)
    soup = BeautifulSoup(html, "lxml")

    blocks = extract_json_ld_blocks(html)
    items = flatten_json_ld(blocks)

    product = pick_first_of_type(items, {"Product"})
    article = pick_first_of_type(items, {"Article", "NewsArticle", "BlogPosting"})
    recipe = pick_first_of_type(items, {"Recipe"})

    if product:
        data = normalize_product(product)
        if not data.get("name"):
            title = soup.select_one("h1")
            data["name"] = title.get_text(" ", strip=True) if title else None
        return data

    if article:
        data = normalize_article(article)
        if not data.get("headline"):
            title = soup.select_one("h1")
            data["headline"] = title.get_text(" ", strip=True) if title else None
        return data

    if recipe:
        data = normalize_recipe(recipe)
        if not data.get("ingredients"):
            data["ingredients"] = [
                li.get_text(" ", strip=True)
                for li in soup.select(".ingredients-item")
            ]
        return data

    return {"schema_type": None, "url": url}

Notice what we are not doing:

we are not building the whole scraper from CSS selectors first
we are not parsing dozens of presentational elements unless JSON-LD misses something

That keeps the maintenance surface much smaller.

Comparison: JSON-LD-first vs selector-first

Approach	Strength	Weakness
JSON-LD-first	Cleaner data, faster implementation, fewer brittle selectors	Some sites omit fields or ship noisy graph objects
Selector-first	Works even when no structured data exists	More fragile, more time-consuming, harder to maintain
Hybrid	Best production choice for many sites	Slightly more code, but lower long-term pain

The hybrid approach wins most of the time:

JSON-LD for the obvious fields
HTML selectors only for the gaps

Common JSON-LD scraping issues

1. Multiple unrelated objects

A page may contain Organization, BreadcrumbList, and WebSite before the Product you actually want. Always filter by @type.

2. Invalid JSON

Some sites ship malformed blocks. Wrap json.loads() in try/except and skip broken scripts instead of killing the whole scrape.

3. Arrays vs single objects

offers, author, and other properties may be a dict on one site and a list on another. Normalize both shapes.

4. Incomplete structured data

Sometimes the JSON-LD has the price but not the visible discount, or the headline but not the body text. That is exactly when a selective HTML fallback makes sense.

Where ProxiesAPI fits

JSON-LD reduces parser brittleness. It does not remove fetch-layer problems.

If you are scraping many product or article pages, you still may need:

retry handling
proxy rotation
better success rates over time

That is why the fetch layer should stay separate from the parser:

response = session.get(url, timeout=TIMEOUT, proxies=PROXIES)

Once that is in place, your JSON-LD extraction code can scale without becoming the thing that breaks first.

Final takeaway

JSON-LD scraping is one of the highest-leverage habits you can build into a scraper.

Before you touch the visible HTML, check for:

script[type="application/ld+json"]

If the page already gives you structured Product, Article, Recipe, or Review data, use that first. Then add HTML selectors only where the schema is incomplete.

That one change usually leads to:

fewer selectors
fewer breakages
cleaner datasets
faster production scrapers

Pull the clean data first, then scale the fetch layer

Get 1,000 free API calls View pricing

Extract title, price, rating, and shop info from Etsy search pages reliably with rotating proxies, retries, and pagination.

tutorial#python#etsy#web-scraping

Scrape Podcast Data from Apple Podcasts with Python (Charts + Show Metadata)

Build a scraper that captures Apple Podcasts chart listings, show metadata, and episode links into a clean discovery dataset, with an optional ProxiesAPI request layer for scheduled crawls.

tutorial#python#apple-podcasts#podcasts

Scrape News Headlines from Google News

Build a practical Google News headline scraper in Python using topic feeds, parse titles, publishers, and links, then export a deduplicated CSV for a lightweight news monitor.

tutorial#python#google-news#news

Scrape Live Stock Data from Yahoo Finance

Build a Yahoo Finance watchlist scraper in Python: pull current quote snapshots, day ranges, and volume from quote pages, then export a clean CSV using a ProxiesAPI-ready fetch layer.

tutorial#python#yahoo-finance#stocks

JSON-LD Scraping: Extract Structured Data Without Brittle Selectors

Related guides