Scrape Book Reviews and Ratings from Goodreads with Python (JSON-LD + Top Reviews)

Goodreads book pages contain a surprisingly clean source of truth for ratings and review counts: JSON-LD (structured data embedded in HTML).

In this tutorial you’ll build a scraper that:

  • fetches a Goodreads book page
  • extracts ratingValue, ratingCount, and reviewCount from the JSON-LD block
  • collects a few top review snippets (where present)
  • outputs clean JSON

Goodreads book page (we’ll scrape ratings + review counts + review snippets)

When pages get slow or flaky, add ProxiesAPI

Goodreads pages are heavy and can be inconsistent under load. Keep your scraper reliable with timeouts, retries, and optional ProxiesAPI routing so your extraction stays simple.


What we’re scraping (Goodreads structure)

A typical book page looks like:

  • https://www.goodreads.com/book/show/4671.The_Great_Gatsby

In the HTML you’ll usually find:

  • a <script type="application/ld+json">...</script> block describing the book
  • visible HTML sections for reviews (may paginate / load dynamically)

The JSON-LD is the most stable place to start because it’s designed for machines (search engines), not humans.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML (ProxiesAPI-ready)

Even before proxies, treat HTTP like production:

  • set a real User-Agent
  • use connect/read timeouts
  • add retries with backoff for transient failures
from __future__ import annotations

import os
import random
import time
import requests

TIMEOUT = (10, 30)  # connect, read
UA = "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)"

# Optional: route through an HTTP proxy (including ProxiesAPI upstream proxy mode, if you use it).
PROXY = os.environ.get("HTTP_PROXY") or os.environ.get("HTTPS_PROXY")

session = requests.Session()
session.headers.update({"User-Agent": UA})


def fetch(url: str, attempts: int = 3) -> str:
    proxies = {"http": PROXY, "https": PROXY} if PROXY else None
    last_exc = None
    for i in range(attempts):
        try:
            r = session.get(url, timeout=TIMEOUT, proxies=proxies)
            r.raise_for_status()
            return r.text
        except Exception as e:  # noqa: BLE001
            last_exc = e
            sleep_s = (2**i) + random.random()
            time.sleep(sleep_s)
    raise last_exc  # type: ignore[misc]

Step 2: Extract ratings from JSON-LD (the stable path)

The JSON-LD block is usually a Book schema with aggregateRating.

import json
from bs4 import BeautifulSoup


def extract_jsonld_book(html: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")
    script = soup.select_one('script[type="application/ld+json"]')
    if not script or not script.string:
        return None
    data = json.loads(script.string)
    return data if isinstance(data, dict) else None


def extract_ratings(book_schema: dict) -> dict:
    agg = book_schema.get("aggregateRating") or {}
    return {
        "title": book_schema.get("name"),
        "author": (book_schema.get("author") or [{}])[0].get("name"),
        "rating_value": agg.get("ratingValue"),
        "rating_count": agg.get("ratingCount"),
        "review_count": agg.get("reviewCount"),
        "image": book_schema.get("image"),
        "url": book_schema.get("url"),
    }

This gets you the “big numbers” reliably without fragile selectors.


Step 3: Extract a few review snippets (best-effort)

Reviews are the part that changes most often.

The robust strategy is:

  • select a review “container” pattern (often an <article> or a div with stable attributes)
  • extract only what you can confidently locate
  • keep it best-effort and don’t break your whole pipeline if reviews are missing
def extract_top_reviews(html: str, limit: int = 5) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    reviews: list[dict] = []

    # Goodreads markup evolves; treat this as a pattern, not a promise.
    # Start broad: look for blocks that look like review text.
    for node in soup.select("article, div"):
        text = node.get_text(" ", strip=True)
        if not text or len(text) < 120:
            continue
        if "likes" in text.lower() and "comment" in text.lower():
            # common review footer markers
            reviews.append({"snippet": text[:420] + ("…" if len(text) > 420 else "")})
        if len(reviews) >= limit:
            break

    return reviews

This intentionally prefers “good enough + stable” over a brittle selector soup.

If you want the cleanest review extraction, use your browser devtools once, identify a stable review container, then tighten the selectors.


Step 4: Put it together (clean JSON output)

import json


def scrape_goodreads_book(url: str) -> dict:
    html = fetch(url)
    schema = extract_jsonld_book(html) or {}
    out = extract_ratings(schema) if schema else {"url": url}
    out["top_reviews"] = extract_top_reviews(html, limit=5)
    return out


if __name__ == "__main__":
    url = "https://www.goodreads.com/book/show/4671.The_Great_Gatsby"
    data = scrape_goodreads_book(url)
    print(json.dumps(data, indent=2, ensure_ascii=False))

Practical notes (don’t get blocked)

  • Don’t hammer Goodreads. Add delays and keep concurrency low.
  • Prefer JSON-LD for high-level metrics: it’s stable.
  • Treat review scraping as best-effort unless you lock down selectors.
  • Use caching when iterating on selectors so you don’t repeatedly fetch the same page.

Wrap-up

You now have a Goodreads scraper that:

  • reliably extracts rating + rating count + review count from JSON-LD
  • collects a handful of review snippets as a best-effort enhancement
  • outputs clean JSON for downstream pipelines

If you want, the next step is to crawl a list of book URLs and store results in a database — just keep the crawl polite.

When pages get slow or flaky, add ProxiesAPI

Goodreads pages are heavy and can be inconsistent under load. Keep your scraper reliable with timeouts, retries, and optional ProxiesAPI routing so your extraction stays simple.

Related guides

Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Goodreads Author Pages: Books, Series, Ratings (ProxiesAPI + Python)
Extract author profile data plus a clean list of books (title, URL, average rating, rating count) from Goodreads author pages. Includes real selectors, retries, and a screenshot.
tutorial#python#goodreads#web-scraping
Scrape Book Data from Goodreads with Python (List Pages + Pagination)
Scrape Goodreads list pages for title/author/rating/reviews with Python: fetch via ProxiesAPI, parse real HTML selectors, paginate safely, and export CSV/JSON.
tutorial#python#goodreads#books
Scrape Goodreads Book Reviews + Ratings with Python (Pagination + CSV)
Extract Goodreads community reviews (rating, review text, reviewer, date) from a book page, paginate using Goodreads’ "More reviews" cursor link, and export results to CSV. Includes screenshot and ProxiesAPI fetch-layer integration.
tutorial#python#goodreads#web-scraping