Incremental Web Scraping: Re-Crawl Only What Changed

Incremental web scraping is the difference between a demo crawler and a durable data pipeline.

If you recrawl everything every time, you pay for:

  • repeated requests to unchanged pages
  • longer runtimes
  • more proxy usage
  • noisier diffs
  • more chances to get blocked for no reason

A better model is simple:

  1. keep a lightweight state record per URL
  2. use change signals before downloading the whole page
  3. refetch fully only when something probably changed

That is what incremental web scraping is about.

Spend proxy budget on changed pages, not the same pages

Incremental web scraping reduces waste before you scale. Once your recrawl policy is selective, ProxiesAPI can focus on the pages that truly need another fetch.


What counts as a change signal?

There is no single perfect signal, so good incremental web scraping stacks combine several:

SignalGood forMain risk
ETagstable origin servers and CDNsweak or inconsistent values
Last-Modifiedarticles, docs, feedsmissing or inaccurate timestamps
Sitemap lastmodcontent-heavy public sitescan be stale or over-updated
Content hashfinal truth after downloadrequires a body fetch
List-page diffsmarketplaces and search resultsmisses detail-page-only changes

The trick is not choosing one “winner.” The trick is ordering them by cost.


A practical incremental crawl policy

My default order looks like this:

  1. if the site gives trustworthy sitemap lastmod, use it to prioritize
  2. send a conditional request with If-None-Match and If-Modified-Since
  3. if you still get a fresh 200, compute a content hash
  4. update stored state only after the fetch succeeds

That gives you both cheap skip logic and a final “did the content really change?” check.


Step 1: Store crawl state per URL

You need somewhere to remember the last known metadata for each page:

  • URL
  • last seen ETag
  • last seen Last-Modified
  • last content hash
  • last fetch timestamp

SQLite is enough for many pipelines.

import sqlite3


def init_db(path: str = "crawl_state.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS page_state (
            url TEXT PRIMARY KEY,
            etag TEXT,
            last_modified TEXT,
            content_hash TEXT,
            last_seen_at TEXT
        )
        """
    )
    conn.commit()
    return conn

Step 2: Send conditional requests

This is the cheapest high-value move in incremental web scraping.

If you already know an ETag or Last-Modified, send them back in the next request:

import requests


def fetch_conditionally(session: requests.Session, url: str, state: dict | None):
    headers = {}
    if state:
        if state.get("etag"):
            headers["If-None-Match"] = state["etag"]
        if state.get("last_modified"):
            headers["If-Modified-Since"] = state["last_modified"]

    response = session.get(url, headers=headers, timeout=(10, 30))
    return response

If the server returns 304 Not Modified, you just skipped a full download.

That is the dream.


Step 3: Hash content after successful 200 responses

Even when the server returns 200, the content may still be unchanged in any meaningful way.

So compute a hash from the normalized body:

import hashlib


def sha256_text(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

If the new hash equals the old hash, you can safely mark the page as checked without triggering downstream “content changed” work.


Full example

import datetime as dt
import hashlib
import sqlite3
import requests


def init_db(path: str = "crawl_state.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS page_state (
            url TEXT PRIMARY KEY,
            etag TEXT,
            last_modified TEXT,
            content_hash TEXT,
            last_seen_at TEXT
        )
        """
    )
    conn.commit()
    return conn


def load_state(conn: sqlite3.Connection, url: str) -> dict | None:
    row = conn.execute(
        "SELECT url, etag, last_modified, content_hash, last_seen_at FROM page_state WHERE url = ?",
        (url,),
    ).fetchone()
    if not row:
        return None
    return {
        "url": row[0],
        "etag": row[1],
        "last_modified": row[2],
        "content_hash": row[3],
        "last_seen_at": row[4],
    }


def save_state(
    conn: sqlite3.Connection,
    url: str,
    etag: str | None,
    last_modified: str | None,
    content_hash: str | None,
) -> None:
    conn.execute(
        """
        INSERT INTO page_state (url, etag, last_modified, content_hash, last_seen_at)
        VALUES (?, ?, ?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            etag = excluded.etag,
            last_modified = excluded.last_modified,
            content_hash = excluded.content_hash,
            last_seen_at = excluded.last_seen_at
        """,
        (
            url,
            etag,
            last_modified,
            content_hash,
            dt.datetime.utcnow().isoformat(timespec="seconds"),
        ),
    )
    conn.commit()


def sha256_text(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()


def fetch_conditionally(session: requests.Session, url: str, state: dict | None):
    headers = {}
    if state:
        if state.get("etag"):
            headers["If-None-Match"] = state["etag"]
        if state.get("last_modified"):
            headers["If-Modified-Since"] = state["last_modified"]
    return session.get(url, headers=headers, timeout=(10, 30))


def crawl_one(session: requests.Session, conn: sqlite3.Connection, url: str) -> dict:
    state = load_state(conn, url)
    response = fetch_conditionally(session, url, state)

    if response.status_code == 304:
        return {"url": url, "status": "not_modified"}

    response.raise_for_status()
    body = response.text
    new_hash = sha256_text(body)

    if state and state.get("content_hash") == new_hash:
        save_state(
            conn,
            url,
            response.headers.get("ETag"),
            response.headers.get("Last-Modified"),
            new_hash,
        )
        return {"url": url, "status": "same_content"}

    # Your parser or downstream processor would run here.
    save_state(
        conn,
        url,
        response.headers.get("ETag"),
        response.headers.get("Last-Modified"),
        new_hash,
    )
    return {"url": url, "status": "changed"}


if __name__ == "__main__":
    conn = init_db()
    session = requests.Session()

    urls = [
        "https://example.com/page-1",
        "https://example.com/page-2",
    ]

    for url in urls:
        print(crawl_one(session, conn, url))

Typical results:

{'url': 'https://example.com/page-1', 'status': 'not_modified'}
{'url': 'https://example.com/page-2', 'status': 'changed'}

Where incremental web scraping really saves money

Large article archives

Old posts change rarely. A hash plus timestamp policy can cut huge amounts of waste.

Ecommerce catalogs

Product detail pages may change only when:

  • price changes
  • stock changes
  • title/image changes

That means you can recrawl detail pages selectively instead of burning cycles on every SKU every night.

Marketplaces

List pages are often the first signal. If the listing IDs on page 1 changed, then go deeper. If not, skip a large part of the tree.


Strong patterns for incremental web scraping

PatternWhy it worksBest use case
Conditional GET with ETagserver tells you if the resource changeddocs, blogs, feeds
Conditional GET with Last-Modifiedeasy and widely supportedlegacy sites
Hash only selected DOM sectionignores nav/footer noisecontent pages with templates
Compare list-page item IDscheap early warningsearch pages, category grids
Use sitemap lastmod as priority scorereduces unnecessary refetcheslarge public sites

One underrated trick is hashing only the meaningful content region instead of the whole HTML document. That avoids false positives from:

  • rotating ads
  • timestamps in footers
  • personalization snippets

Common mistakes

1. Updating state before parsing succeeds

If the parser crashes but you already saved the new state, you just taught the system to forget a real change.

Save state only after the fetch and parse path succeeds.

2. Trusting Last-Modified blindly

Some servers never update it. Others update it on every deploy even when the page content is identical.

Use it as a hint, then confirm with a hash when needed.

3. Hashing raw HTML without cleanup

If the page injects dynamic tokens, analytics blobs, or timestamps, a raw hash becomes noisy.

Normalize first when possible.

4. Using one recrawl policy for every page type

Homepages, product pages, article pages, and search pages do not change at the same rate.

Incremental web scraping works best when frequency and change detection are scoped by page type.


A simple production policy to start with

If you need a practical first version, use this:

  1. discover candidate URLs from sitemaps and list pages
  2. store state in SQLite
  3. send conditional GETs on every revisit
  4. hash the body when you get 200
  5. only trigger downstream processing for changed

That is enough to turn a brute-force crawler into a disciplined pipeline.


Bottom line

Incremental web scraping is not about being clever. It is about refusing to pay twice for the same unchanged page.

If you remember only one thing, make it this:

  • use cheap change signals first
  • use a content hash as the final truth
  • keep per-URL state so every run gets smarter

Once that is in place, you can scale more confidently because your scraper is no longer treating every page like it is new every time.

Spend proxy budget on changed pages, not the same pages

Incremental web scraping reduces waste before you scale. Once your recrawl policy is selective, ProxiesAPI can focus on the pages that truly need another fetch.

Related guides

Web Scraping Caching: ETag + Last-Modified + Redis (When to Re-fetch vs Reuse)
Cut proxy cost and avoid bans with smarter caching: HTTP conditional requests, cache keys, TTL strategy, content hashing, and Redis patterns for production scrapers.
guide#web-scraping#caching#etag
Web Scraping Sitemaps: Find Every Indexable URL Fast
A practical sitemap scraping guide: discover sitemap files, expand nested indexes, parse XML and gzip variants, and turn the results into a crawl queue you can trust.
guides#sitemap scraping#web-scraping#xml
Free Web Scraping Tools: 10 Options That Actually Work
A practical comparison of 10 free web scraping tools that still hold up in 2026, including where each tool shines and when the free route starts to break down.
guides#web-scraping#tools#free
Web Scraping with Scrapy: Getting Started Guide
Teach Scrapy fundamentals with a simple crawl, selectors, pagination, exports, and proxy-ready request handling.
guides#scrapy#python#web-scraping