Incremental Web Scraping: Re-Crawl Only What Changed
Incremental web scraping is the difference between a demo crawler and a durable data pipeline.
If you recrawl everything every time, you pay for:
- repeated requests to unchanged pages
- longer runtimes
- more proxy usage
- noisier diffs
- more chances to get blocked for no reason
A better model is simple:
- keep a lightweight state record per URL
- use change signals before downloading the whole page
- refetch fully only when something probably changed
That is what incremental web scraping is about.
Incremental web scraping reduces waste before you scale. Once your recrawl policy is selective, ProxiesAPI can focus on the pages that truly need another fetch.
What counts as a change signal?
There is no single perfect signal, so good incremental web scraping stacks combine several:
| Signal | Good for | Main risk |
|---|---|---|
ETag | stable origin servers and CDNs | weak or inconsistent values |
Last-Modified | articles, docs, feeds | missing or inaccurate timestamps |
Sitemap lastmod | content-heavy public sites | can be stale or over-updated |
| Content hash | final truth after download | requires a body fetch |
| List-page diffs | marketplaces and search results | misses detail-page-only changes |
The trick is not choosing one “winner.” The trick is ordering them by cost.
A practical incremental crawl policy
My default order looks like this:
- if the site gives trustworthy sitemap
lastmod, use it to prioritize - send a conditional request with
If-None-MatchandIf-Modified-Since - if you still get a fresh
200, compute a content hash - update stored state only after the fetch succeeds
That gives you both cheap skip logic and a final “did the content really change?” check.
Step 1: Store crawl state per URL
You need somewhere to remember the last known metadata for each page:
- URL
- last seen
ETag - last seen
Last-Modified - last content hash
- last fetch timestamp
SQLite is enough for many pipelines.
import sqlite3
def init_db(path: str = "crawl_state.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS page_state (
url TEXT PRIMARY KEY,
etag TEXT,
last_modified TEXT,
content_hash TEXT,
last_seen_at TEXT
)
"""
)
conn.commit()
return conn
Step 2: Send conditional requests
This is the cheapest high-value move in incremental web scraping.
If you already know an ETag or Last-Modified, send them back in the next request:
import requests
def fetch_conditionally(session: requests.Session, url: str, state: dict | None):
headers = {}
if state:
if state.get("etag"):
headers["If-None-Match"] = state["etag"]
if state.get("last_modified"):
headers["If-Modified-Since"] = state["last_modified"]
response = session.get(url, headers=headers, timeout=(10, 30))
return response
If the server returns 304 Not Modified, you just skipped a full download.
That is the dream.
Step 3: Hash content after successful 200 responses
Even when the server returns 200, the content may still be unchanged in any meaningful way.
So compute a hash from the normalized body:
import hashlib
def sha256_text(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
If the new hash equals the old hash, you can safely mark the page as checked without triggering downstream “content changed” work.
Full example
import datetime as dt
import hashlib
import sqlite3
import requests
def init_db(path: str = "crawl_state.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS page_state (
url TEXT PRIMARY KEY,
etag TEXT,
last_modified TEXT,
content_hash TEXT,
last_seen_at TEXT
)
"""
)
conn.commit()
return conn
def load_state(conn: sqlite3.Connection, url: str) -> dict | None:
row = conn.execute(
"SELECT url, etag, last_modified, content_hash, last_seen_at FROM page_state WHERE url = ?",
(url,),
).fetchone()
if not row:
return None
return {
"url": row[0],
"etag": row[1],
"last_modified": row[2],
"content_hash": row[3],
"last_seen_at": row[4],
}
def save_state(
conn: sqlite3.Connection,
url: str,
etag: str | None,
last_modified: str | None,
content_hash: str | None,
) -> None:
conn.execute(
"""
INSERT INTO page_state (url, etag, last_modified, content_hash, last_seen_at)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
etag = excluded.etag,
last_modified = excluded.last_modified,
content_hash = excluded.content_hash,
last_seen_at = excluded.last_seen_at
""",
(
url,
etag,
last_modified,
content_hash,
dt.datetime.utcnow().isoformat(timespec="seconds"),
),
)
conn.commit()
def sha256_text(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
def fetch_conditionally(session: requests.Session, url: str, state: dict | None):
headers = {}
if state:
if state.get("etag"):
headers["If-None-Match"] = state["etag"]
if state.get("last_modified"):
headers["If-Modified-Since"] = state["last_modified"]
return session.get(url, headers=headers, timeout=(10, 30))
def crawl_one(session: requests.Session, conn: sqlite3.Connection, url: str) -> dict:
state = load_state(conn, url)
response = fetch_conditionally(session, url, state)
if response.status_code == 304:
return {"url": url, "status": "not_modified"}
response.raise_for_status()
body = response.text
new_hash = sha256_text(body)
if state and state.get("content_hash") == new_hash:
save_state(
conn,
url,
response.headers.get("ETag"),
response.headers.get("Last-Modified"),
new_hash,
)
return {"url": url, "status": "same_content"}
# Your parser or downstream processor would run here.
save_state(
conn,
url,
response.headers.get("ETag"),
response.headers.get("Last-Modified"),
new_hash,
)
return {"url": url, "status": "changed"}
if __name__ == "__main__":
conn = init_db()
session = requests.Session()
urls = [
"https://example.com/page-1",
"https://example.com/page-2",
]
for url in urls:
print(crawl_one(session, conn, url))
Typical results:
{'url': 'https://example.com/page-1', 'status': 'not_modified'}
{'url': 'https://example.com/page-2', 'status': 'changed'}
Where incremental web scraping really saves money
Large article archives
Old posts change rarely. A hash plus timestamp policy can cut huge amounts of waste.
Ecommerce catalogs
Product detail pages may change only when:
- price changes
- stock changes
- title/image changes
That means you can recrawl detail pages selectively instead of burning cycles on every SKU every night.
Marketplaces
List pages are often the first signal. If the listing IDs on page 1 changed, then go deeper. If not, skip a large part of the tree.
Strong patterns for incremental web scraping
| Pattern | Why it works | Best use case |
|---|---|---|
Conditional GET with ETag | server tells you if the resource changed | docs, blogs, feeds |
Conditional GET with Last-Modified | easy and widely supported | legacy sites |
| Hash only selected DOM section | ignores nav/footer noise | content pages with templates |
| Compare list-page item IDs | cheap early warning | search pages, category grids |
Use sitemap lastmod as priority score | reduces unnecessary refetches | large public sites |
One underrated trick is hashing only the meaningful content region instead of the whole HTML document. That avoids false positives from:
- rotating ads
- timestamps in footers
- personalization snippets
Common mistakes
1. Updating state before parsing succeeds
If the parser crashes but you already saved the new state, you just taught the system to forget a real change.
Save state only after the fetch and parse path succeeds.
2. Trusting Last-Modified blindly
Some servers never update it. Others update it on every deploy even when the page content is identical.
Use it as a hint, then confirm with a hash when needed.
3. Hashing raw HTML without cleanup
If the page injects dynamic tokens, analytics blobs, or timestamps, a raw hash becomes noisy.
Normalize first when possible.
4. Using one recrawl policy for every page type
Homepages, product pages, article pages, and search pages do not change at the same rate.
Incremental web scraping works best when frequency and change detection are scoped by page type.
A simple production policy to start with
If you need a practical first version, use this:
- discover candidate URLs from sitemaps and list pages
- store state in SQLite
- send conditional GETs on every revisit
- hash the body when you get
200 - only trigger downstream processing for
changed
That is enough to turn a brute-force crawler into a disciplined pipeline.
Bottom line
Incremental web scraping is not about being clever. It is about refusing to pay twice for the same unchanged page.
If you remember only one thing, make it this:
- use cheap change signals first
- use a content hash as the final truth
- keep per-URL state so every run gets smarter
Once that is in place, you can scale more confidently because your scraper is no longer treating every page like it is new every time.
Incremental web scraping reduces waste before you scale. Once your recrawl policy is selective, ProxiesAPI can focus on the pages that truly need another fetch.