Web Scraping Pagination: 7 Patterns That Don’t Break (Offset, Cursor, Infinite Scroll)

May 31, 2026 · guide · #web-scraping, #pagination, #python, #requests, #retry, #backoff

Pagination is where “my scraper works” turns into “my dataset is wrong.”

It’s not the first page that breaks you. It’s page 73:

duplicate items appear across pages
the last page returns a soft-block HTML template
cursor parameters change without warning
“Load more” endpoints require headers you didn’t copy

This guide is a practical playbook for web scraping pagination in 2026.

Target keyword (natural): web scraping pagination

Make paginated crawls stable with ProxiesAPI

Pagination failures compound fast at scale. ProxiesAPI fits as a fetch-layer wrapper so intermittent blocks don’t turn into missing pages and silent gaps.

Get 1,000 free API calls View pricing

The 7 pagination patterns (and when to use each)

Offset pagination: ?page=3 or ?offset=40
Cursor pagination: ?cursor=eyJpZCI6...
Next-link discovery: follow rel="next" / “Next” anchor
Token-in-HTML pagination: next cursor embedded in the page payload
Infinite scroll endpoints: hidden JSON/XHR calls behind “Load more”
Calendar/time pagination: before=2026-01-01 or until=...
ID-based pagination: after_id=12345 or “seek method”

You can detect which one a site uses by inspecting:

URL parameters when you click next
network requests for XHR calls
HTML link tags (rel="next")

Pattern 1: Offset pagination (simple, but dangerous)

Offset pagination looks like:

?page=2
?offset=20&limit=20

Pros: easy to implement.

Cons:

fragile for changing datasets (new items shift offsets)
duplicates or missing items if the listing updates while you crawl

Mitigation: crawl with a stable sort order or crawl within time windows.

def offset_urls(base: str, pages: int) -> list[str]:
    return [f"{base}?page={p}" for p in range(1, pages + 1)]

Pattern 2: Cursor pagination (most reliable when supported)

Cursor pagination uses a token like ?cursor=... or ?after=...

The next cursor usually comes from the response, not from URL math.

def crawl_cursor(fetch_page, first_url: str, *, max_pages: int = 50):
    url = first_url
    pages = 0
    seen_ids = set()

    while url and pages < max_pages:
        pages += 1
        data = fetch_page(url)  # returns parsed JSON

        for item in data["items"]:
            if item["id"] in seen_ids:
                continue
            seen_ids.add(item["id"])
            yield item

        url = data.get("next_url")  # extracted cursor for next page

Pattern 3: Next-link discovery (HTML)

Many sites include an explicit next link:

from bs4 import BeautifulSoup
from urllib.parse import urljoin


def next_link(html: str, current_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")
    a = soup.select_one('a[rel="next"]') or soup.find(
        "a", string=lambda s: s and "next" in s.lower()
    )
    return urljoin(current_url, a.get("href")) if a and a.get("href") else None

Pattern 4: Token-in-HTML pagination (payload embedded)

Modern apps embed pagination metadata in:

JSON blobs (Next.js, Apollo, etc.)
hidden inputs
data-* attributes

Tip: search the HTML for strings like "cursor", "pageInfo", or "next".

Pattern 5: Infinite scroll endpoints (XHR / JSON)

Infinite scroll is usually:

page 1 is HTML
“Load more” calls an endpoint returning JSON (or HTML fragments)

The reliable move:

open devtools → Network
trigger “load more”
copy request as cURL
replicate it in requests

def crawl_load_more(fetch_more, first_payload: dict):
    payload = first_payload
    while True:
        items = payload.get("items", [])
        if not items:
            break
        for it in items:
            yield it
        payload = fetch_more(payload["next_cursor"])

Pattern 6: Calendar/time pagination (before/until)

For feeds and logs, time-window pagination is stable:

keep the timestamp of the oldest item you saw
request the next window using before/until
dedupe by ID

Pattern 7: ID-based pagination (seek method)

If the site supports after_id / since_id, use it.

It’s extremely reliable because IDs are monotonic.

The three rules that prevent silent data bugs

Always dedupe by a canonical key (ID or URL).
Detect soft-blocks:

tiny HTML bodies
missing expected selectors
“please verify you are human” templates

Log checkpoints (cursor/page, item count, timestamp) so you can resume.

A production-friendly pagination loop (copy/paste)

import time
import random


def crawl_pages(fetch_html, parse_items, get_next_url, start_url: str, *, max_pages: int = 100):
    url = start_url
    pages = 0
    seen = set()

    while url and pages < max_pages:
        pages += 1
        html = fetch_html(url)

        items = parse_items(html)
        if not items:
            raise RuntimeError("No items parsed — possible soft-block or selector drift")

        for it in items:
            key = it.get("id") or it.get("url")
            if not key or key in seen:
                continue
            seen.add(key)
            yield it

        url = get_next_url(html, url)
        time.sleep(0.5 + random.random())

Where ProxiesAPI helps (honestly)

Pagination multiplies your request count.

If a site fails 2% of the time, that sounds fine… until you fetch 1,000 pages.

ProxiesAPI can help stabilize the network layer:

consistent IP rotation when you scale
fewer transient blocks
easier retries (because you route through one wrapper URL)

It won’t fix bad dedupe logic or incorrect next-link extraction — but it can reduce missing pages caused by network instability.

Make paginated crawls stable with ProxiesAPI

Pagination failures compound fast at scale. ProxiesAPI fits as a fetch-layer wrapper so intermittent blocks don’t turn into missing pages and silent gaps.

Get 1,000 free API calls View pricing

Related guides

Scrape eBay Listings and Prices

Build an eBay scraper that captures titles, prices, item URLs, and pagination into CSV-ready output.

tutorial#python#ebay#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, pagination cursors, and reviewer metadata into a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Stack Overflow with Python: Tag Pages + Question Threads + Q/A Export

Build a production-ready Stack Overflow scraper: crawl tag pages, follow question links, extract question + answers + votes, and export JSON/CSV. Includes a screenshot and ProxiesAPI integration hooks.

tutorial#stack overflow#python#web-scraping

Scrape Vinted Listings with Python: Search + Pagination + Clean CSV Export

Build a practical Vinted listings scraper: pull search results via Vinted’s internal catalog endpoint, paginate safely, extract price/brand/size/image URLs, and export a clean CSV. Includes a screenshot + ProxiesAPI integration.

tutorial#vinted#python#web-scraping