Scrape Book Reviews and Ratings from Goodreads with Python (JSON-LD + Top Reviews)
Goodreads book pages contain a surprisingly clean source of truth for ratings and review counts: JSON-LD (structured data embedded in HTML).
In this tutorial you’ll build a scraper that:
- fetches a Goodreads book page
- extracts
ratingValue,ratingCount, andreviewCountfrom the JSON-LD block - collects a few top review snippets (where present)
- outputs clean JSON

Goodreads pages are heavy and can be inconsistent under load. Keep your scraper reliable with timeouts, retries, and optional ProxiesAPI routing so your extraction stays simple.
What we’re scraping (Goodreads structure)
A typical book page looks like:
https://www.goodreads.com/book/show/4671.The_Great_Gatsby
In the HTML you’ll usually find:
- a
<script type="application/ld+json">...</script>block describing the book - visible HTML sections for reviews (may paginate / load dynamically)
The JSON-LD is the most stable place to start because it’s designed for machines (search engines), not humans.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Step 1: Fetch HTML (ProxiesAPI-ready)
Even before proxies, treat HTTP like production:
- set a real
User-Agent - use connect/read timeouts
- add retries with backoff for transient failures
from __future__ import annotations
import os
import random
import time
import requests
TIMEOUT = (10, 30) # connect, read
UA = "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0; +https://proxiesapi.com)"
# Optional: route through an HTTP proxy (including ProxiesAPI upstream proxy mode, if you use it).
PROXY = os.environ.get("HTTP_PROXY") or os.environ.get("HTTPS_PROXY")
session = requests.Session()
session.headers.update({"User-Agent": UA})
def fetch(url: str, attempts: int = 3) -> str:
proxies = {"http": PROXY, "https": PROXY} if PROXY else None
last_exc = None
for i in range(attempts):
try:
r = session.get(url, timeout=TIMEOUT, proxies=proxies)
r.raise_for_status()
return r.text
except Exception as e: # noqa: BLE001
last_exc = e
sleep_s = (2**i) + random.random()
time.sleep(sleep_s)
raise last_exc # type: ignore[misc]
Step 2: Extract ratings from JSON-LD (the stable path)
The JSON-LD block is usually a Book schema with aggregateRating.
import json
from bs4 import BeautifulSoup
def extract_jsonld_book(html: str) -> dict | None:
soup = BeautifulSoup(html, "lxml")
script = soup.select_one('script[type="application/ld+json"]')
if not script or not script.string:
return None
data = json.loads(script.string)
return data if isinstance(data, dict) else None
def extract_ratings(book_schema: dict) -> dict:
agg = book_schema.get("aggregateRating") or {}
return {
"title": book_schema.get("name"),
"author": (book_schema.get("author") or [{}])[0].get("name"),
"rating_value": agg.get("ratingValue"),
"rating_count": agg.get("ratingCount"),
"review_count": agg.get("reviewCount"),
"image": book_schema.get("image"),
"url": book_schema.get("url"),
}
This gets you the “big numbers” reliably without fragile selectors.
Step 3: Extract a few review snippets (best-effort)
Reviews are the part that changes most often.
The robust strategy is:
- select a review “container” pattern (often an
<article>or a div with stable attributes) - extract only what you can confidently locate
- keep it best-effort and don’t break your whole pipeline if reviews are missing
def extract_top_reviews(html: str, limit: int = 5) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
reviews: list[dict] = []
# Goodreads markup evolves; treat this as a pattern, not a promise.
# Start broad: look for blocks that look like review text.
for node in soup.select("article, div"):
text = node.get_text(" ", strip=True)
if not text or len(text) < 120:
continue
if "likes" in text.lower() and "comment" in text.lower():
# common review footer markers
reviews.append({"snippet": text[:420] + ("…" if len(text) > 420 else "")})
if len(reviews) >= limit:
break
return reviews
This intentionally prefers “good enough + stable” over a brittle selector soup.
If you want the cleanest review extraction, use your browser devtools once, identify a stable review container, then tighten the selectors.
Step 4: Put it together (clean JSON output)
import json
def scrape_goodreads_book(url: str) -> dict:
html = fetch(url)
schema = extract_jsonld_book(html) or {}
out = extract_ratings(schema) if schema else {"url": url}
out["top_reviews"] = extract_top_reviews(html, limit=5)
return out
if __name__ == "__main__":
url = "https://www.goodreads.com/book/show/4671.The_Great_Gatsby"
data = scrape_goodreads_book(url)
print(json.dumps(data, indent=2, ensure_ascii=False))
Practical notes (don’t get blocked)
- Don’t hammer Goodreads. Add delays and keep concurrency low.
- Prefer JSON-LD for high-level metrics: it’s stable.
- Treat review scraping as best-effort unless you lock down selectors.
- Use caching when iterating on selectors so you don’t repeatedly fetch the same page.
Wrap-up
You now have a Goodreads scraper that:
- reliably extracts rating + rating count + review count from JSON-LD
- collects a handful of review snippets as a best-effort enhancement
- outputs clean JSON for downstream pipelines
If you want, the next step is to crawl a list of book URLs and store results in a database — just keep the crawl polite.
Goodreads pages are heavy and can be inconsistent under load. Keep your scraper reliable with timeouts, retries, and optional ProxiesAPI routing so your extraction stays simple.