Scrape Book Data from Goodreads with Python (List Pages + Pagination)

May 22, 2026 · tutorial · #python, #goodreads, #books, #web-scraping, #beautifulsoup, #csv, #json, #proxies

Goodreads list pages (Listopia) are a common starting point for building book datasets: titles, authors, average rating, rating count, and more.

In this tutorial you’ll build a practical scraper that:

fetches Goodreads list pages via ProxiesAPI (optional, recommended at scale)
extracts book rows with stable selectors
paginates until you hit a limit or the list ends
exports data to CSV and JSON

Goodreads Listopia page (we’ll scrape book rows + paginate)

When pagination starts failing, ProxiesAPI keeps the fetch layer stable

Directory-style sites often rate-limit when you scale from 20 URLs to 2,000. ProxiesAPI fits cleanly into your fetch layer so retries and proxy rotation stay a one-function change.

Get 1,000 free API calls View pricing

What we’re scraping

Example list page:

https://www.goodreads.com/list/show/1.Best_Books_Ever

Pagination typically looks like:

...?page=2
...?page=3

We’ll scrape each list row for:

title
author
average rating
rating count (when visible)

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml pandas

Step 1: A resilient fetch layer (with optional ProxiesAPI)

ProxiesAPI works by fetching the target URL through their endpoint:

http://api.proxiesapi.com/?auth_key=YOUR_KEY&url=https://example.com

import os
import time
import random
import urllib.parse
import requests

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY", "")
TIMEOUT = (10, 40)  # connect, read

session = requests.Session()


def proxiesapi_url(target_url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY in your environment")

    return (
        "http://api.proxiesapi.com/?auth_key="
        + urllib.parse.quote(PROXIESAPI_KEY, safe="")
        + "&url="
        + urllib.parse.quote(target_url, safe="")
    )


def fetch(url: str, *, use_proxiesapi: bool = True, max_retries: int = 4) -> str:
    last_err = None

    for attempt in range(1, max_retries + 1):
        try:
            final_url = proxiesapi_url(url) if use_proxiesapi else url
            r = session.get(
                final_url,
                timeout=TIMEOUT,
                headers={
                    "User-Agent": (
                        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                        "AppleWebKit/537.36 (KHTML, like Gecko) "
                        "Chrome/123.0 Safari/537.36"
                    ),
                    "Accept-Language": "en-US,en;q=0.9",
                },
            )
            r.raise_for_status()
            html = r.text
            if not html or len(html) < 20_000:
                raise RuntimeError(f"Suspiciously small HTML ({len(html)} bytes)")
            return html

        except Exception as e:
            last_err = e
            time.sleep(min(10, (2 ** (attempt - 1))) + random.random())

    raise RuntimeError(f"Fetch failed after {max_retries} attempts: {last_err}")

Step 2: Identify list row selectors

On most Goodreads Listopia pages, each book row is a “table row-like” block containing:

a title link (usually an <a class="bookTitle">)
an author link (usually an <a class="authorName">)
an average rating snippet (text around “avg rating”)

We’ll parse using BeautifulSoup and keep the selector logic small and testable.

import re
from bs4 import BeautifulSoup


AVG_RE = re.compile(r"avg rating\s*([0-9.]+)", re.I)
RATINGS_RE = re.compile(r"([0-9,]+)\s*ratings", re.I)


def parse_float(text: str) -> float | None:
    try:
        return float(text)
    except Exception:
        return None


def parse_int(text: str) -> int | None:
    m = re.search(r"(\d[\d,]*)", text or "")
    return int(m.group(1).replace(",", "")) if m else None


def parse_list_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Most list pages render rows under .tableList
    rows = soup.select(".tableList tr")
    items: list[dict] = []

    for row in rows:
        title_a = row.select_one("a.bookTitle")
        author_a = row.select_one("a.authorName")

        title = title_a.get_text(" ", strip=True) if title_a else None
        author = author_a.get_text(" ", strip=True) if author_a else None

        href = title_a.get("href") if title_a else None
        url = f"https://www.goodreads.com{href}" if href and href.startswith("/") else href

        meta = row.get_text(" ", strip=True)
        avg = None
        ratings = None

        m_avg = AVG_RE.search(meta)
        if m_avg:
            avg = parse_float(m_avg.group(1))

        m_r = RATINGS_RE.search(meta)
        if m_r:
            ratings = parse_int(m_r.group(1))

        if title:
            items.append({
                "title": title,
                "author": author,
                "avg_rating": avg,
                "ratings": ratings,
                "url": url,
            })

    return items

Step 3: Paginate safely (don’t assume infinite pages)

Goodreads list pages often expose explicit paging controls. A simple and robust strategy:

request ?page=N
stop when you get no rows or when the page repeats the previous page
cap the crawl with max_pages

def paged_list_url(base: str, page: int) -> str:
    joiner = "&" if "?" in base else "?"
    return f"{base}{joiner}page={page}"


def scrape_list(base_url: str, *, max_pages: int = 5) -> list[dict]:
    all_items: list[dict] = []
    last_first_title = None

    for page in range(1, max_pages + 1):
        url = paged_list_url(base_url, page)
        html = fetch(url, use_proxiesapi=True)
        items = parse_list_page(html)

        if not items:
            break

        first_title = items[0].get("title")
        if first_title and first_title == last_first_title:
            break

        last_first_title = first_title
        all_items.extend(items)

    return all_items

Step 4: Export to CSV + JSON

import json
import pandas as pd


if __name__ == "__main__":
    base = "https://www.goodreads.com/list/show/1.Best_Books_Ever"

    items = scrape_list(base, max_pages=3)
    print("books:", len(items))

    # JSON export
    with open("goodreads-list.json", "w", encoding="utf-8") as f:
        json.dump(items, f, ensure_ascii=False, indent=2)

    # CSV export
    df = pd.DataFrame(items)
    df.to_csv("goodreads-list.csv", index=False)
    print(df.head(5))

Common issues (and how to handle them)

Consent / bot pages: HTML is too small or contains “verify you are human”
- backoff + retries
- lower request rate
- add a proxy-backed fetch layer (ProxiesAPI)
Selector drift: a.bookTitle or .tableList tr changes
- keep parse_list_page() small and adjust it when it breaks
Pagination surprises: some lists reorder, or show localized variants
- cap max_pages
- detect repetition with the “first title repeats” check

Where ProxiesAPI fits (no hype)

Goodreads scraping success is mostly a network problem as you scale: rate limits, throttling, and inconsistent responses.

ProxiesAPI helps by giving you:

a consistent fetch URL that you can toggle on/off
fewer sudden failures when you paginate
a clean separation between fetch and parse

That separation is what makes your scraper maintainable.

When pagination starts failing, ProxiesAPI keeps the fetch layer stable

Directory-style sites often rate-limit when you scale from 20 URLs to 2,000. ProxiesAPI fits cleanly into your fetch layer so retries and proxy rotation stay a one-function change.

Get 1,000 free API calls View pricing

A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.

tutorial#python#goodreads#books

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads book metadata, average rating, rating counts, review counts, and top review snippets with Python using JSON-LD plus __NEXT_DATA__ review objects.

tutorial#python#goodreads#books

Scrape GitHub Trending Repositories with Python

Build a daily GitHub Trending dataset with Python: collect repository names, languages, star counts, and URLs, then export clean CSV or JSON with an optional ProxiesAPI fetch layer.

tutorial#python#github#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, pagination cursors, and reviewer metadata into a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Book Data from Goodreads with Python (List Pages + Pagination)

Related guides