Scrape Book Data from Goodreads with Python (List Pages + Pagination)

Goodreads list pages (Listopia) are a common starting point for building book datasets: titles, authors, average rating, rating count, and more.

In this tutorial you’ll build a practical scraper that:

  • fetches Goodreads list pages via ProxiesAPI (optional, recommended at scale)
  • extracts book rows with stable selectors
  • paginates until you hit a limit or the list ends
  • exports data to CSV and JSON

Goodreads Listopia page (we’ll scrape book rows + paginate)

When pagination starts failing, ProxiesAPI keeps the fetch layer stable

Directory-style sites often rate-limit when you scale from 20 URLs to 2,000. ProxiesAPI fits cleanly into your fetch layer so retries and proxy rotation stay a one-function change.


What we’re scraping

Example list page:

  • https://www.goodreads.com/list/show/1.Best_Books_Ever

Pagination typically looks like:

  • ...?page=2
  • ...?page=3

We’ll scrape each list row for:

  • title
  • author
  • average rating
  • rating count (when visible)

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas

Step 1: A resilient fetch layer (with optional ProxiesAPI)

ProxiesAPI works by fetching the target URL through their endpoint:

http://api.proxiesapi.com/?auth_key=YOUR_KEY&url=https://example.com
import os
import time
import random
import urllib.parse
import requests

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY", "")
TIMEOUT = (10, 40)  # connect, read

session = requests.Session()


def proxiesapi_url(target_url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY in your environment")

    return (
        "http://api.proxiesapi.com/?auth_key="
        + urllib.parse.quote(PROXIESAPI_KEY, safe="")
        + "&url="
        + urllib.parse.quote(target_url, safe="")
    )


def fetch(url: str, *, use_proxiesapi: bool = True, max_retries: int = 4) -> str:
    last_err = None

    for attempt in range(1, max_retries + 1):
        try:
            final_url = proxiesapi_url(url) if use_proxiesapi else url
            r = session.get(
                final_url,
                timeout=TIMEOUT,
                headers={
                    "User-Agent": (
                        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                        "AppleWebKit/537.36 (KHTML, like Gecko) "
                        "Chrome/123.0 Safari/537.36"
                    ),
                    "Accept-Language": "en-US,en;q=0.9",
                },
            )
            r.raise_for_status()
            html = r.text
            if not html or len(html) < 20_000:
                raise RuntimeError(f"Suspiciously small HTML ({len(html)} bytes)")
            return html

        except Exception as e:
            last_err = e
            time.sleep(min(10, (2 ** (attempt - 1))) + random.random())

    raise RuntimeError(f"Fetch failed after {max_retries} attempts: {last_err}")

Step 2: Identify list row selectors

On most Goodreads Listopia pages, each book row is a “table row-like” block containing:

  • a title link (usually an <a class="bookTitle">)
  • an author link (usually an <a class="authorName">)
  • an average rating snippet (text around “avg rating”)

We’ll parse using BeautifulSoup and keep the selector logic small and testable.

import re
from bs4 import BeautifulSoup


AVG_RE = re.compile(r"avg rating\s*([0-9.]+)", re.I)
RATINGS_RE = re.compile(r"([0-9,]+)\s*ratings", re.I)


def parse_float(text: str) -> float | None:
    try:
        return float(text)
    except Exception:
        return None


def parse_int(text: str) -> int | None:
    m = re.search(r"(\d[\d,]*)", text or "")
    return int(m.group(1).replace(",", "")) if m else None


def parse_list_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Most list pages render rows under .tableList
    rows = soup.select(".tableList tr")
    items: list[dict] = []

    for row in rows:
        title_a = row.select_one("a.bookTitle")
        author_a = row.select_one("a.authorName")

        title = title_a.get_text(" ", strip=True) if title_a else None
        author = author_a.get_text(" ", strip=True) if author_a else None

        href = title_a.get("href") if title_a else None
        url = f"https://www.goodreads.com{href}" if href and href.startswith("/") else href

        meta = row.get_text(" ", strip=True)
        avg = None
        ratings = None

        m_avg = AVG_RE.search(meta)
        if m_avg:
            avg = parse_float(m_avg.group(1))

        m_r = RATINGS_RE.search(meta)
        if m_r:
            ratings = parse_int(m_r.group(1))

        if title:
            items.append({
                "title": title,
                "author": author,
                "avg_rating": avg,
                "ratings": ratings,
                "url": url,
            })

    return items

Step 3: Paginate safely (don’t assume infinite pages)

Goodreads list pages often expose explicit paging controls. A simple and robust strategy:

  • request ?page=N
  • stop when you get no rows or when the page repeats the previous page
  • cap the crawl with max_pages
def paged_list_url(base: str, page: int) -> str:
    joiner = "&" if "?" in base else "?"
    return f"{base}{joiner}page={page}"


def scrape_list(base_url: str, *, max_pages: int = 5) -> list[dict]:
    all_items: list[dict] = []
    last_first_title = None

    for page in range(1, max_pages + 1):
        url = paged_list_url(base_url, page)
        html = fetch(url, use_proxiesapi=True)
        items = parse_list_page(html)

        if not items:
            break

        first_title = items[0].get("title")
        if first_title and first_title == last_first_title:
            break

        last_first_title = first_title
        all_items.extend(items)

    return all_items

Step 4: Export to CSV + JSON

import json
import pandas as pd


if __name__ == "__main__":
    base = "https://www.goodreads.com/list/show/1.Best_Books_Ever"

    items = scrape_list(base, max_pages=3)
    print("books:", len(items))

    # JSON export
    with open("goodreads-list.json", "w", encoding="utf-8") as f:
        json.dump(items, f, ensure_ascii=False, indent=2)

    # CSV export
    df = pd.DataFrame(items)
    df.to_csv("goodreads-list.csv", index=False)
    print(df.head(5))

Common issues (and how to handle them)

  • Consent / bot pages: HTML is too small or contains “verify you are human”
    • backoff + retries
    • lower request rate
    • add a proxy-backed fetch layer (ProxiesAPI)
  • Selector drift: a.bookTitle or .tableList tr changes
    • keep parse_list_page() small and adjust it when it breaks
  • Pagination surprises: some lists reorder, or show localized variants
    • cap max_pages
    • detect repetition with the “first title repeats” check

Where ProxiesAPI fits (no hype)

Goodreads scraping success is mostly a network problem as you scale: rate limits, throttling, and inconsistent responses.

ProxiesAPI helps by giving you:

  • a consistent fetch URL that you can toggle on/off
  • fewer sudden failures when you paginate
  • a clean separation between fetch and parse

That separation is what makes your scraper maintainable.

When pagination starts failing, ProxiesAPI keeps the fetch layer stable

Directory-style sites often rate-limit when you scale from 20 URLs to 2,000. ProxiesAPI fits cleanly into your fetch layer so retries and proxy rotation stay a one-function change.

Related guides

Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Goodreads Book Reviews + Ratings with Python (Pagination + CSV)
Extract Goodreads community reviews (rating, review text, reviewer, date) from a book page, paginate using Goodreads’ "More reviews" cursor link, and export results to CSV. Includes screenshot and ProxiesAPI fetch-layer integration.
tutorial#python#goodreads#web-scraping
Scrape Numbeo Cost of Living Data with Python (cities, indices, and tables)
Extract Numbeo cost-of-living tables into a structured dataset (with a screenshot), then export to JSON/CSV using ProxiesAPI-backed requests.
tutorial#python#web-scraping#beautifulsoup
Scrape Live Stock Data from Yahoo Finance with Python (Quotes + Key Stats)
A resilient Yahoo Finance scraper in Python: fetch quote pages via ProxiesAPI, extract live-ish quote fields + key stats from embedded JSON, handle retries, and export to CSV.
tutorial#python#yahoo-finance#stocks