Web Scraping Pagination: 7 Patterns That Don’t Break (Offset, Cursor, Infinite Scroll)
Pagination is where “my scraper works” turns into “my dataset is wrong.”
It’s not the first page that breaks you. It’s page 73:
- duplicate items appear across pages
- the last page returns a soft-block HTML template
- cursor parameters change without warning
- “Load more” endpoints require headers you didn’t copy
This guide is a practical playbook for web scraping pagination in 2026.
Target keyword (natural): web scraping pagination
Pagination failures compound fast at scale. ProxiesAPI fits as a fetch-layer wrapper so intermittent blocks don’t turn into missing pages and silent gaps.
The 7 pagination patterns (and when to use each)
- Offset pagination: ?page=3 or ?offset=40
- Cursor pagination: ?cursor=eyJpZCI6...
- Next-link discovery: follow rel="next" / “Next” anchor
- Token-in-HTML pagination: next cursor embedded in the page payload
- Infinite scroll endpoints: hidden JSON/XHR calls behind “Load more”
- Calendar/time pagination: before=2026-01-01 or until=...
- ID-based pagination: after_id=12345 or “seek method”
You can detect which one a site uses by inspecting:
- URL parameters when you click next
- network requests for XHR calls
- HTML link tags (rel="next")
Pattern 1: Offset pagination (simple, but dangerous)
Offset pagination looks like:
- ?page=2
- ?offset=20&limit=20
Pros: easy to implement.
Cons:
- fragile for changing datasets (new items shift offsets)
- duplicates or missing items if the listing updates while you crawl
Mitigation: crawl with a stable sort order or crawl within time windows.
def offset_urls(base: str, pages: int) -> list[str]:
return [f"{base}?page={p}" for p in range(1, pages + 1)]
Pattern 2: Cursor pagination (most reliable when supported)
Cursor pagination uses a token like ?cursor=... or ?after=...
The next cursor usually comes from the response, not from URL math.
def crawl_cursor(fetch_page, first_url: str, *, max_pages: int = 50):
url = first_url
pages = 0
seen_ids = set()
while url and pages < max_pages:
pages += 1
data = fetch_page(url) # returns parsed JSON
for item in data["items"]:
if item["id"] in seen_ids:
continue
seen_ids.add(item["id"])
yield item
url = data.get("next_url") # extracted cursor for next page
Pattern 3: Next-link discovery (HTML)
Many sites include an explicit next link:
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def next_link(html: str, current_url: str) -> str | None:
soup = BeautifulSoup(html, "lxml")
a = soup.select_one('a[rel="next"]') or soup.find(
"a", string=lambda s: s and "next" in s.lower()
)
return urljoin(current_url, a.get("href")) if a and a.get("href") else None
Pattern 4: Token-in-HTML pagination (payload embedded)
Modern apps embed pagination metadata in:
- JSON blobs (Next.js, Apollo, etc.)
- hidden inputs
- data-* attributes
Tip: search the HTML for strings like "cursor", "pageInfo", or "next".
Pattern 5: Infinite scroll endpoints (XHR / JSON)
Infinite scroll is usually:
- page 1 is HTML
- “Load more” calls an endpoint returning JSON (or HTML fragments)
The reliable move:
- open devtools → Network
- trigger “load more”
- copy request as cURL
- replicate it in requests
def crawl_load_more(fetch_more, first_payload: dict):
payload = first_payload
while True:
items = payload.get("items", [])
if not items:
break
for it in items:
yield it
payload = fetch_more(payload["next_cursor"])
Pattern 6: Calendar/time pagination (before/until)
For feeds and logs, time-window pagination is stable:
- keep the timestamp of the oldest item you saw
- request the next window using before/until
- dedupe by ID
Pattern 7: ID-based pagination (seek method)
If the site supports after_id / since_id, use it.
It’s extremely reliable because IDs are monotonic.
The three rules that prevent silent data bugs
-
Always dedupe by a canonical key (ID or URL).
-
Detect soft-blocks:
- tiny HTML bodies
- missing expected selectors
- “please verify you are human” templates
- Log checkpoints (cursor/page, item count, timestamp) so you can resume.
A production-friendly pagination loop (copy/paste)
import time
import random
def crawl_pages(fetch_html, parse_items, get_next_url, start_url: str, *, max_pages: int = 100):
url = start_url
pages = 0
seen = set()
while url and pages < max_pages:
pages += 1
html = fetch_html(url)
items = parse_items(html)
if not items:
raise RuntimeError("No items parsed — possible soft-block or selector drift")
for it in items:
key = it.get("id") or it.get("url")
if not key or key in seen:
continue
seen.add(key)
yield it
url = get_next_url(html, url)
time.sleep(0.5 + random.random())
Where ProxiesAPI helps (honestly)
Pagination multiplies your request count.
If a site fails 2% of the time, that sounds fine… until you fetch 1,000 pages.
ProxiesAPI can help stabilize the network layer:
- consistent IP rotation when you scale
- fewer transient blocks
- easier retries (because you route through one wrapper URL)
It won’t fix bad dedupe logic or incorrect next-link extraction — but it can reduce missing pages caused by network instability.
Pagination failures compound fast at scale. ProxiesAPI fits as a fetch-layer wrapper so intermittent blocks don’t turn into missing pages and silent gaps.