Web Scraping Pagination: 7 Patterns That Don’t Break (Offset, Cursor, Infinite Scroll)

Pagination is where “my scraper works” turns into “my dataset is wrong.”

It’s not the first page that breaks you. It’s page 73:

  • duplicate items appear across pages
  • the last page returns a soft-block HTML template
  • cursor parameters change without warning
  • “Load more” endpoints require headers you didn’t copy

This guide is a practical playbook for web scraping pagination in 2026.

Target keyword (natural): web scraping pagination

Make paginated crawls stable with ProxiesAPI

Pagination failures compound fast at scale. ProxiesAPI fits as a fetch-layer wrapper so intermittent blocks don’t turn into missing pages and silent gaps.


The 7 pagination patterns (and when to use each)

  1. Offset pagination: ?page=3 or ?offset=40
  2. Cursor pagination: ?cursor=eyJpZCI6...
  3. Next-link discovery: follow rel="next" / “Next” anchor
  4. Token-in-HTML pagination: next cursor embedded in the page payload
  5. Infinite scroll endpoints: hidden JSON/XHR calls behind “Load more”
  6. Calendar/time pagination: before=2026-01-01 or until=...
  7. ID-based pagination: after_id=12345 or “seek method”

You can detect which one a site uses by inspecting:

  • URL parameters when you click next
  • network requests for XHR calls
  • HTML link tags (rel="next")

Pattern 1: Offset pagination (simple, but dangerous)

Offset pagination looks like:

  • ?page=2
  • ?offset=20&limit=20

Pros: easy to implement.

Cons:

  • fragile for changing datasets (new items shift offsets)
  • duplicates or missing items if the listing updates while you crawl

Mitigation: crawl with a stable sort order or crawl within time windows.

def offset_urls(base: str, pages: int) -> list[str]:
    return [f"{base}?page={p}" for p in range(1, pages + 1)]

Pattern 2: Cursor pagination (most reliable when supported)

Cursor pagination uses a token like ?cursor=... or ?after=...

The next cursor usually comes from the response, not from URL math.

def crawl_cursor(fetch_page, first_url: str, *, max_pages: int = 50):
    url = first_url
    pages = 0
    seen_ids = set()

    while url and pages < max_pages:
        pages += 1
        data = fetch_page(url)  # returns parsed JSON

        for item in data["items"]:
            if item["id"] in seen_ids:
                continue
            seen_ids.add(item["id"])
            yield item

        url = data.get("next_url")  # extracted cursor for next page

Many sites include an explicit next link:

from bs4 import BeautifulSoup
from urllib.parse import urljoin


def next_link(html: str, current_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")
    a = soup.select_one('a[rel="next"]') or soup.find(
        "a", string=lambda s: s and "next" in s.lower()
    )
    return urljoin(current_url, a.get("href")) if a and a.get("href") else None

Pattern 4: Token-in-HTML pagination (payload embedded)

Modern apps embed pagination metadata in:

  • JSON blobs (Next.js, Apollo, etc.)
  • hidden inputs
  • data-* attributes

Tip: search the HTML for strings like "cursor", "pageInfo", or "next".


Pattern 5: Infinite scroll endpoints (XHR / JSON)

Infinite scroll is usually:

  • page 1 is HTML
  • “Load more” calls an endpoint returning JSON (or HTML fragments)

The reliable move:

  1. open devtools → Network
  2. trigger “load more”
  3. copy request as cURL
  4. replicate it in requests
def crawl_load_more(fetch_more, first_payload: dict):
    payload = first_payload
    while True:
        items = payload.get("items", [])
        if not items:
            break
        for it in items:
            yield it
        payload = fetch_more(payload["next_cursor"])

Pattern 6: Calendar/time pagination (before/until)

For feeds and logs, time-window pagination is stable:

  • keep the timestamp of the oldest item you saw
  • request the next window using before/until
  • dedupe by ID

Pattern 7: ID-based pagination (seek method)

If the site supports after_id / since_id, use it.

It’s extremely reliable because IDs are monotonic.


The three rules that prevent silent data bugs

  1. Always dedupe by a canonical key (ID or URL).

  2. Detect soft-blocks:

  • tiny HTML bodies
  • missing expected selectors
  • “please verify you are human” templates
  1. Log checkpoints (cursor/page, item count, timestamp) so you can resume.

A production-friendly pagination loop (copy/paste)

import time
import random


def crawl_pages(fetch_html, parse_items, get_next_url, start_url: str, *, max_pages: int = 100):
    url = start_url
    pages = 0
    seen = set()

    while url and pages < max_pages:
        pages += 1
        html = fetch_html(url)

        items = parse_items(html)
        if not items:
            raise RuntimeError("No items parsed — possible soft-block or selector drift")

        for it in items:
            key = it.get("id") or it.get("url")
            if not key or key in seen:
                continue
            seen.add(key)
            yield it

        url = get_next_url(html, url)
        time.sleep(0.5 + random.random())

Where ProxiesAPI helps (honestly)

Pagination multiplies your request count.

If a site fails 2% of the time, that sounds fine… until you fetch 1,000 pages.

ProxiesAPI can help stabilize the network layer:

  • consistent IP rotation when you scale
  • fewer transient blocks
  • easier retries (because you route through one wrapper URL)

It won’t fix bad dedupe logic or incorrect next-link extraction — but it can reduce missing pages caused by network instability.

Make paginated crawls stable with ProxiesAPI

Pagination failures compound fast at scale. ProxiesAPI fits as a fetch-layer wrapper so intermittent blocks don’t turn into missing pages and silent gaps.

Related guides

Scrape Stack Overflow with Python: Tag Pages + Question Threads + Q/A Export
Build a production-ready Stack Overflow scraper: crawl tag pages, follow question links, extract question + answers + votes, and export JSON/CSV. Includes a screenshot and ProxiesAPI integration hooks.
tutorial#stack overflow#python#web-scraping
Scrape Vinted Listings with Python: Search + Pagination + Clean CSV Export
Build a practical Vinted listings scraper: pull search results via Vinted’s internal catalog endpoint, paginate safely, extract price/brand/size/image URLs, and export a clean CSV. Includes a screenshot + ProxiesAPI integration.
tutorial#vinted#python#web-scraping
Python BeautifulSoup Tutorial: Scraping Your First Website (2026)
A beginner-friendly BeautifulSoup tutorial: fetch HTML with requests, parse elements with CSS selectors, handle pagination, avoid common pitfalls, and export results. Includes an honest ProxiesAPI section for when you scale.
tutorial#python beautifulsoup tutorial#python#beautifulsoup
Scrape eBay Listings + Sold Prices with Python (Active + Completed Listings)
Build a small eBay dataset (title, price, condition, shipping) from search results, then pull completed/sold prices from the Sold filter. Includes pagination, CSV export, and ProxiesAPI in the fetch layer.
tutorial#python#ebay#web-scraping