Scrape Book Data from Goodreads with Python (List Pages + Pagination)
Goodreads list pages (Listopia) are a common starting point for building book datasets: titles, authors, average rating, rating count, and more.
In this tutorial you’ll build a practical scraper that:
- fetches Goodreads list pages via ProxiesAPI (optional, recommended at scale)
- extracts book rows with stable selectors
- paginates until you hit a limit or the list ends
- exports data to CSV and JSON

Directory-style sites often rate-limit when you scale from 20 URLs to 2,000. ProxiesAPI fits cleanly into your fetch layer so retries and proxy rotation stay a one-function change.
What we’re scraping
Example list page:
https://www.goodreads.com/list/show/1.Best_Books_Ever
Pagination typically looks like:
...?page=2...?page=3
We’ll scrape each list row for:
- title
- author
- average rating
- rating count (when visible)
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas
Step 1: A resilient fetch layer (with optional ProxiesAPI)
ProxiesAPI works by fetching the target URL through their endpoint:
http://api.proxiesapi.com/?auth_key=YOUR_KEY&url=https://example.com
import os
import time
import random
import urllib.parse
import requests
PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY", "")
TIMEOUT = (10, 40) # connect, read
session = requests.Session()
def proxiesapi_url(target_url: str) -> str:
if not PROXIESAPI_KEY:
raise RuntimeError("Set PROXIESAPI_KEY in your environment")
return (
"http://api.proxiesapi.com/?auth_key="
+ urllib.parse.quote(PROXIESAPI_KEY, safe="")
+ "&url="
+ urllib.parse.quote(target_url, safe="")
)
def fetch(url: str, *, use_proxiesapi: bool = True, max_retries: int = 4) -> str:
last_err = None
for attempt in range(1, max_retries + 1):
try:
final_url = proxiesapi_url(url) if use_proxiesapi else url
r = session.get(
final_url,
timeout=TIMEOUT,
headers={
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
},
)
r.raise_for_status()
html = r.text
if not html or len(html) < 20_000:
raise RuntimeError(f"Suspiciously small HTML ({len(html)} bytes)")
return html
except Exception as e:
last_err = e
time.sleep(min(10, (2 ** (attempt - 1))) + random.random())
raise RuntimeError(f"Fetch failed after {max_retries} attempts: {last_err}")
Step 2: Identify list row selectors
On most Goodreads Listopia pages, each book row is a “table row-like” block containing:
- a title link (usually an
<a class="bookTitle">) - an author link (usually an
<a class="authorName">) - an average rating snippet (text around “avg rating”)
We’ll parse using BeautifulSoup and keep the selector logic small and testable.
import re
from bs4 import BeautifulSoup
AVG_RE = re.compile(r"avg rating\s*([0-9.]+)", re.I)
RATINGS_RE = re.compile(r"([0-9,]+)\s*ratings", re.I)
def parse_float(text: str) -> float | None:
try:
return float(text)
except Exception:
return None
def parse_int(text: str) -> int | None:
m = re.search(r"(\d[\d,]*)", text or "")
return int(m.group(1).replace(",", "")) if m else None
def parse_list_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
# Most list pages render rows under .tableList
rows = soup.select(".tableList tr")
items: list[dict] = []
for row in rows:
title_a = row.select_one("a.bookTitle")
author_a = row.select_one("a.authorName")
title = title_a.get_text(" ", strip=True) if title_a else None
author = author_a.get_text(" ", strip=True) if author_a else None
href = title_a.get("href") if title_a else None
url = f"https://www.goodreads.com{href}" if href and href.startswith("/") else href
meta = row.get_text(" ", strip=True)
avg = None
ratings = None
m_avg = AVG_RE.search(meta)
if m_avg:
avg = parse_float(m_avg.group(1))
m_r = RATINGS_RE.search(meta)
if m_r:
ratings = parse_int(m_r.group(1))
if title:
items.append({
"title": title,
"author": author,
"avg_rating": avg,
"ratings": ratings,
"url": url,
})
return items
Step 3: Paginate safely (don’t assume infinite pages)
Goodreads list pages often expose explicit paging controls. A simple and robust strategy:
- request
?page=N - stop when you get no rows or when the page repeats the previous page
- cap the crawl with
max_pages
def paged_list_url(base: str, page: int) -> str:
joiner = "&" if "?" in base else "?"
return f"{base}{joiner}page={page}"
def scrape_list(base_url: str, *, max_pages: int = 5) -> list[dict]:
all_items: list[dict] = []
last_first_title = None
for page in range(1, max_pages + 1):
url = paged_list_url(base_url, page)
html = fetch(url, use_proxiesapi=True)
items = parse_list_page(html)
if not items:
break
first_title = items[0].get("title")
if first_title and first_title == last_first_title:
break
last_first_title = first_title
all_items.extend(items)
return all_items
Step 4: Export to CSV + JSON
import json
import pandas as pd
if __name__ == "__main__":
base = "https://www.goodreads.com/list/show/1.Best_Books_Ever"
items = scrape_list(base, max_pages=3)
print("books:", len(items))
# JSON export
with open("goodreads-list.json", "w", encoding="utf-8") as f:
json.dump(items, f, ensure_ascii=False, indent=2)
# CSV export
df = pd.DataFrame(items)
df.to_csv("goodreads-list.csv", index=False)
print(df.head(5))
Common issues (and how to handle them)
- Consent / bot pages: HTML is too small or contains “verify you are human”
- backoff + retries
- lower request rate
- add a proxy-backed fetch layer (ProxiesAPI)
- Selector drift:
a.bookTitleor.tableList trchanges- keep
parse_list_page()small and adjust it when it breaks
- keep
- Pagination surprises: some lists reorder, or show localized variants
- cap
max_pages - detect repetition with the “first title repeats” check
- cap
Where ProxiesAPI fits (no hype)
Goodreads scraping success is mostly a network problem as you scale: rate limits, throttling, and inconsistent responses.
ProxiesAPI helps by giving you:
- a consistent fetch URL that you can toggle on/off
- fewer sudden failures when you paginate
- a clean separation between fetch and parse
That separation is what makes your scraper maintainable.
Directory-style sites often rate-limit when you scale from 20 URLs to 2,000. ProxiesAPI fits cleanly into your fetch layer so retries and proxy rotation stay a one-function change.