Scrape Book Reviews and Ratings from Goodreads with Python (JSON-LD + Top Reviews)

Jun 01, 2026 · tutorial · #python, #goodreads, #web-scraping, #requests, #beautifulsoup, #json, #proxies

Goodreads book pages contain a surprisingly clean source of truth for ratings and review counts: JSON-LD (structured data embedded in HTML).

In this tutorial you’ll build a scraper that:

fetches a Goodreads book page
extracts ratingValue, ratingCount, and reviewCount from the JSON-LD block
collects a few top review snippets (where present)
outputs clean JSON

Goodreads book page (we’ll scrape ratings + review counts + review snippets)

When pages get slow or flaky, add ProxiesAPI

Goodreads pages are heavy and can be inconsistent under load. Keep your scraper reliable with timeouts, retries, and optional ProxiesAPI routing so your extraction stays simple.

Get 1,000 free API calls View pricing

What we’re scraping (Goodreads structure)

A typical book page looks like:

https://www.goodreads.com/book/show/4671.The_Great_Gatsby

In the HTML you’ll usually find:

a <script type="application/ld+json">...</script> block describing the book
visible HTML sections for reviews (may paginate / load dynamically)

The JSON-LD is the most stable place to start because it’s designed for machines (search engines), not humans.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML (ProxiesAPI-ready)

Even before proxies, treat HTTP like production:

set a real User-Agent
use connect/read timeouts
add retries with backoff for transient failures

from __future__ import annotations

import os
import random
import time
import requests

TIMEOUT = (10, 30)  # connect, read
UA = "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)"

# Optional: route through an HTTP proxy (including ProxiesAPI upstream proxy mode, if you use it).
PROXY = os.environ.get("HTTP_PROXY") or os.environ.get("HTTPS_PROXY")

session = requests.Session()
session.headers.update({"User-Agent": UA})


def fetch(url: str, attempts: int = 3) -> str:
    proxies = {"http": PROXY, "https": PROXY} if PROXY else None
    last_exc = None
    for i in range(attempts):
        try:
            r = session.get(url, timeout=TIMEOUT, proxies=proxies)
            r.raise_for_status()
            return r.text
        except Exception as e:  # noqa: BLE001
            last_exc = e
            sleep_s = (2**i) + random.random()
            time.sleep(sleep_s)
    raise last_exc  # type: ignore[misc]

Step 2: Extract ratings from JSON-LD (the stable path)

The JSON-LD block is usually a Book schema with aggregateRating.

import json
from bs4 import BeautifulSoup


def extract_jsonld_book(html: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")
    script = soup.select_one('script[type="application/ld+json"]')
    if not script or not script.string:
        return None
    data = json.loads(script.string)
    return data if isinstance(data, dict) else None


def extract_ratings(book_schema: dict) -> dict:
    agg = book_schema.get("aggregateRating") or {}
    return {
        "title": book_schema.get("name"),
        "author": (book_schema.get("author") or [{}])[0].get("name"),
        "rating_value": agg.get("ratingValue"),
        "rating_count": agg.get("ratingCount"),
        "review_count": agg.get("reviewCount"),
        "image": book_schema.get("image"),
        "url": book_schema.get("url"),
    }

This gets you the “big numbers” reliably without fragile selectors.

Step 3: Extract a few review snippets (best-effort)

Reviews are the part that changes most often.

The robust strategy is:

select a review “container” pattern (often an <article> or a div with stable attributes)
extract only what you can confidently locate
keep it best-effort and don’t break your whole pipeline if reviews are missing

def extract_top_reviews(html: str, limit: int = 5) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    reviews: list[dict] = []

    # Goodreads markup evolves; treat this as a pattern, not a promise.
    # Start broad: look for blocks that look like review text.
    for node in soup.select("article, div"):
        text = node.get_text(" ", strip=True)
        if not text or len(text) < 120:
            continue
        if "likes" in text.lower() and "comment" in text.lower():
            # common review footer markers
            reviews.append({"snippet": text[:420] + ("…" if len(text) > 420 else "")})
        if len(reviews) >= limit:
            break

    return reviews

This intentionally prefers “good enough + stable” over a brittle selector soup.

If you want the cleanest review extraction, use your browser devtools once, identify a stable review container, then tighten the selectors.

Step 4: Put it together (clean JSON output)

import json


def scrape_goodreads_book(url: str) -> dict:
    html = fetch(url)
    schema = extract_jsonld_book(html) or {}
    out = extract_ratings(schema) if schema else {"url": url}
    out["top_reviews"] = extract_top_reviews(html, limit=5)
    return out


if __name__ == "__main__":
    url = "https://www.goodreads.com/book/show/4671.The_Great_Gatsby"
    data = scrape_goodreads_book(url)
    print(json.dumps(data, indent=2, ensure_ascii=False))

Practical notes (don’t get blocked)

Don’t hammer Goodreads. Add delays and keep concurrency low.
Prefer JSON-LD for high-level metrics: it’s stable.
Treat review scraping as best-effort unless you lock down selectors.
Use caching when iterating on selectors so you don’t repeatedly fetch the same page.

Wrap-up

You now have a Goodreads scraper that:

reliably extracts rating + rating count + review count from JSON-LD
collects a handful of review snippets as a best-effort enhancement
outputs clean JSON for downstream pipelines

If you want, the next step is to crawl a list of book URLs and store results in a database — just keep the crawl polite.

When pages get slow or flaky, add ProxiesAPI

Goodreads pages are heavy and can be inconsistent under load. Keep your scraper reliable with timeouts, retries, and optional ProxiesAPI routing so your extraction stays simple.

Get 1,000 free API calls View pricing

Related guides

Scrape Book Data from Goodreads

Build a Goodreads dataset with book titles, authors, ratings, and review counts from a public list page using Python and an optional ProxiesAPI fetch layer.

tutorial#python#goodreads#books

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads book metadata, average rating, rating counts, review counts, and top review snippets with Python using JSON-LD plus __NEXT_DATA__ review objects.

tutorial#python#goodreads#books

Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)

A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.

tutorial#python#goodreads#books

Scrape Secondhand Fashion Listings from Vinted

Show how to collect Vinted search listings, prices, brands, and image URLs into a resale market dataset with Python and an optional ProxiesAPI fetch layer.

tutorial#python#vinted#web-scraping