Incremental Web Scraping: Re-Crawl Only What Changed

Jul 04, 2026 · guides · #incremental web scraping, #web-scraping, #etag, #last-modified, #python, #crawl-strategy

Incremental web scraping is the difference between a demo crawler and a durable data pipeline.

If you recrawl everything every time, you pay for:

repeated requests to unchanged pages
longer runtimes
more proxy usage
noisier diffs
more chances to get blocked for no reason

A better model is simple:

keep a lightweight state record per URL
use change signals before downloading the whole page
refetch fully only when something probably changed

That is what incremental web scraping is about.

Spend proxy budget on changed pages, not the same pages

Incremental web scraping reduces waste before you scale. Once your recrawl policy is selective, ProxiesAPI can focus on the pages that truly need another fetch.

Get 1,000 free API calls View pricing

What counts as a change signal?

There is no single perfect signal, so good incremental web scraping stacks combine several:

Signal	Good for	Main risk
`ETag`	stable origin servers and CDNs	weak or inconsistent values
`Last-Modified`	articles, docs, feeds	missing or inaccurate timestamps
Sitemap `lastmod`	content-heavy public sites	can be stale or over-updated
Content hash	final truth after download	requires a body fetch
List-page diffs	marketplaces and search results	misses detail-page-only changes

The trick is not choosing one “winner.” The trick is ordering them by cost.

A practical incremental crawl policy

My default order looks like this:

if the site gives trustworthy sitemap lastmod, use it to prioritize
send a conditional request with If-None-Match and If-Modified-Since
if you still get a fresh 200, compute a content hash
update stored state only after the fetch succeeds

That gives you both cheap skip logic and a final “did the content really change?” check.

Step 1: Store crawl state per URL

You need somewhere to remember the last known metadata for each page:

URL
last seen ETag
last seen Last-Modified
last content hash
last fetch timestamp

SQLite is enough for many pipelines.

import sqlite3


def init_db(path: str = "crawl_state.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS page_state (
            url TEXT PRIMARY KEY,
            etag TEXT,
            last_modified TEXT,
            content_hash TEXT,
            last_seen_at TEXT
        )
        """
    )
    conn.commit()
    return conn

Step 2: Send conditional requests

This is the cheapest high-value move in incremental web scraping.

If you already know an ETag or Last-Modified, send them back in the next request:

import requests


def fetch_conditionally(session: requests.Session, url: str, state: dict | None):
    headers = {}
    if state:
        if state.get("etag"):
            headers["If-None-Match"] = state["etag"]
        if state.get("last_modified"):
            headers["If-Modified-Since"] = state["last_modified"]

    response = session.get(url, headers=headers, timeout=(10, 30))
    return response

If the server returns 304 Not Modified, you just skipped a full download.

That is the dream.

Step 3: Hash content after successful `200` responses

Even when the server returns 200, the content may still be unchanged in any meaningful way.

So compute a hash from the normalized body:

import hashlib


def sha256_text(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

If the new hash equals the old hash, you can safely mark the page as checked without triggering downstream “content changed” work.

Full example

import datetime as dt
import hashlib
import sqlite3
import requests


def init_db(path: str = "crawl_state.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS page_state (
            url TEXT PRIMARY KEY,
            etag TEXT,
            last_modified TEXT,
            content_hash TEXT,
            last_seen_at TEXT
        )
        """
    )
    conn.commit()
    return conn


def load_state(conn: sqlite3.Connection, url: str) -> dict | None:
    row = conn.execute(
        "SELECT url, etag, last_modified, content_hash, last_seen_at FROM page_state WHERE url = ?",
        (url,),
    ).fetchone()
    if not row:
        return None
    return {
        "url": row[0],
        "etag": row[1],
        "last_modified": row[2],
        "content_hash": row[3],
        "last_seen_at": row[4],
    }


def save_state(
    conn: sqlite3.Connection,
    url: str,
    etag: str | None,
    last_modified: str | None,
    content_hash: str | None,
) -> None:
    conn.execute(
        """
        INSERT INTO page_state (url, etag, last_modified, content_hash, last_seen_at)
        VALUES (?, ?, ?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            etag = excluded.etag,
            last_modified = excluded.last_modified,
            content_hash = excluded.content_hash,
            last_seen_at = excluded.last_seen_at
        """,
        (
            url,
            etag,
            last_modified,
            content_hash,
            dt.datetime.utcnow().isoformat(timespec="seconds"),
        ),
    )
    conn.commit()


def sha256_text(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()


def fetch_conditionally(session: requests.Session, url: str, state: dict | None):
    headers = {}
    if state:
        if state.get("etag"):
            headers["If-None-Match"] = state["etag"]
        if state.get("last_modified"):
            headers["If-Modified-Since"] = state["last_modified"]
    return session.get(url, headers=headers, timeout=(10, 30))


def crawl_one(session: requests.Session, conn: sqlite3.Connection, url: str) -> dict:
    state = load_state(conn, url)
    response = fetch_conditionally(session, url, state)

    if response.status_code == 304:
        return {"url": url, "status": "not_modified"}

    response.raise_for_status()
    body = response.text
    new_hash = sha256_text(body)

    if state and state.get("content_hash") == new_hash:
        save_state(
            conn,
            url,
            response.headers.get("ETag"),
            response.headers.get("Last-Modified"),
            new_hash,
        )
        return {"url": url, "status": "same_content"}

    # Your parser or downstream processor would run here.
    save_state(
        conn,
        url,
        response.headers.get("ETag"),
        response.headers.get("Last-Modified"),
        new_hash,
    )
    return {"url": url, "status": "changed"}


if __name__ == "__main__":
    conn = init_db()
    session = requests.Session()

    urls = [
        "https://example.com/page-1",
        "https://example.com/page-2",
    ]

    for url in urls:
        print(crawl_one(session, conn, url))

Typical results:

{'url': 'https://example.com/page-1', 'status': 'not_modified'}
{'url': 'https://example.com/page-2', 'status': 'changed'}

Where incremental web scraping really saves money

Large article archives

Old posts change rarely. A hash plus timestamp policy can cut huge amounts of waste.

Ecommerce catalogs

Product detail pages may change only when:

price changes
stock changes
title/image changes

That means you can recrawl detail pages selectively instead of burning cycles on every SKU every night.

Marketplaces

List pages are often the first signal. If the listing IDs on page 1 changed, then go deeper. If not, skip a large part of the tree.

Strong patterns for incremental web scraping

Pattern	Why it works	Best use case
Conditional GET with `ETag`	server tells you if the resource changed	docs, blogs, feeds
Conditional GET with `Last-Modified`	easy and widely supported	legacy sites
Hash only selected DOM section	ignores nav/footer noise	content pages with templates
Compare list-page item IDs	cheap early warning	search pages, category grids
Use sitemap `lastmod` as priority score	reduces unnecessary refetches	large public sites

One underrated trick is hashing only the meaningful content region instead of the whole HTML document. That avoids false positives from:

rotating ads
timestamps in footers
personalization snippets

Common mistakes

1. Updating state before parsing succeeds

If the parser crashes but you already saved the new state, you just taught the system to forget a real change.

Save state only after the fetch and parse path succeeds.

2. Trusting `Last-Modified` blindly

Some servers never update it. Others update it on every deploy even when the page content is identical.

Use it as a hint, then confirm with a hash when needed.

3. Hashing raw HTML without cleanup

If the page injects dynamic tokens, analytics blobs, or timestamps, a raw hash becomes noisy.

Normalize first when possible.

4. Using one recrawl policy for every page type

Homepages, product pages, article pages, and search pages do not change at the same rate.

Incremental web scraping works best when frequency and change detection are scoped by page type.

A simple production policy to start with

If you need a practical first version, use this:

discover candidate URLs from sitemaps and list pages
store state in SQLite
send conditional GETs on every revisit
hash the body when you get 200
only trigger downstream processing for changed

That is enough to turn a brute-force crawler into a disciplined pipeline.

Bottom line

Incremental web scraping is not about being clever. It is about refusing to pay twice for the same unchanged page.

If you remember only one thing, make it this:

use cheap change signals first
use a content hash as the final truth
keep per-URL state so every run gets smarter

Once that is in place, you can scale more confidently because your scraper is no longer treating every page like it is new every time.

Spend proxy budget on changed pages, not the same pages

Incremental web scraping reduces waste before you scale. Once your recrawl policy is selective, ProxiesAPI can focus on the pages that truly need another fetch.

Get 1,000 free API calls View pricing

Cut proxy cost and avoid bans with smarter caching: HTTP conditional requests, cache keys, TTL strategy, content hashing, and Redis patterns for production scrapers.

guide#web-scraping#caching#etag

Web Scraping Sitemaps: Find Every Indexable URL Fast

A practical sitemap scraping guide: discover sitemap files, expand nested indexes, parse XML and gzip variants, and turn the results into a crawl queue you can trust.

guides#sitemap scraping#web-scraping#xml

Free Web Scraping Tools: 10 Options That Actually Work

A practical comparison of 10 free web scraping tools that still hold up in 2026, including where each tool shines and when the free route starts to break down.

guides#web-scraping#tools#free

Web Scraping with Scrapy: Getting Started Guide

Teach Scrapy fundamentals with a simple crawl, selectors, pagination, exports, and proxy-ready request handling.

guides#scrapy#python#web-scraping

Incremental Web Scraping: Re-Crawl Only What Changed

Related guides