Web Scraping Caching: ETag + Last-Modified + Redis (When to Re-fetch vs Reuse)
Most scrapers waste money and risk bans for one reason:
They re-fetch pages that haven’t changed.
If you scrape product pages, docs, directories, or issue lists, you’ll see the same URLs over and over.
This guide shows how to implement web scraping caching that actually works in production:
- HTTP cache headers (ETag, Last-Modified)
- conditional requests (If-None-Match, If-Modified-Since)
- Redis cache keys + TTLs
- content hashing for sites that don’t give you cache headers
Target keyword (natural): web scraping caching
ProxiesAPI helps stabilize fetching — but the cheapest request is the one you don’t make. A good cache cuts proxy spend and lowers block risk by reducing request volume.
The 3 caching tiers (use all three)
Tier 1: In-memory (per run) — prevents duplicate fetches inside one crawl.
Tier 2: Persistent cache (Redis / disk) — prevents repeated fetches across runs.
Tier 3: Conditional HTTP requests — “give me the page only if it changed”.
ETag and Last-Modified (the core idea)
Servers may return:
- ETag: "abc123"
- Last-Modified: Tue, 28 May 2026 10:12:00 GMT
On the next request, you can send:
- If-None-Match: "abc123"
- If-Modified-Since: Tue, 28 May 2026 10:12:00 GMT
If nothing changed, the server responds 304 Not Modified.
A cache-aware fetcher with Redis
This pattern stores per-URL metadata in Redis:
- last ETag
- last Last-Modified
- cached body (optional; depends on size)
import hashlib
import json
import os
import time
import requests
import redis # pip install redis
REDIS_URL = os.environ.get("REDIS_URL", "redis://localhost:6379/0")
rdb = redis.Redis.from_url(REDIS_URL, decode_responses=True)
session = requests.Session()
def cache_key(url: str) -> str:
h = hashlib.sha1(url.encode("utf-8")).hexdigest()
return f"scrape:cache:v1:{h}"
def get_meta(url: str) -> dict:
raw = rdb.get(cache_key(url))
return json.loads(raw) if raw else {}
def set_meta(url: str, meta: dict, ttl_seconds: int) -> None:
rdb.setex(cache_key(url), ttl_seconds, json.dumps(meta))
def fetch_cached(url: str, *, ttl_seconds: int = 86400, timeout: int = 30) -> tuple[int, str | None]:
meta = get_meta(url)
headers = {
"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
"Accept-Language": "en-US,en;q=0.9",
}
if meta.get("etag"):
headers["If-None-Match"] = meta["etag"]
if meta.get("last_modified"):
headers["If-Modified-Since"] = meta["last_modified"]
resp = session.get(url, headers=headers, timeout=timeout)
if resp.status_code == 304:
return 304, meta.get("body")
resp.raise_for_status()
body = resp.text
meta2 = {
"etag": resp.headers.get("ETag"),
"last_modified": resp.headers.get("Last-Modified"),
"fetched_at": int(time.time()),
"body": body, # optional; consider size limits
}
set_meta(url, meta2, ttl_seconds)
return resp.status_code, body
When cache headers are missing: content hashing
If a site doesn’t send ETag/Last-Modified:
- fetch the body
- compute a hash
- only re-parse when the hash changes
import hashlib
def body_hash(body: str) -> str:
return hashlib.sha256(body.encode("utf-8")).hexdigest()
Store the hash and reuse parsed output when it’s unchanged.
TTL strategy (the part most people get wrong)
Different pages change at different rates:
- home page / trending: minutes
- listings: hours
- product pages: hours → days
- evergreen docs: days → weeks
If you’re unsure, start with 1 day (86400 seconds), then measure change rate and adjust.
Cache keys that won’t betray you
Your cache key must include everything that changes the response:
- URL (including query params)
- locale/geo variants
- auth state (if any)
If you cache localized content, include a variant dimension in the key (e.g., locale).
Where ProxiesAPI fits
ProxiesAPI helps when you do need to fetch.
Caching helps you avoid fetching in the first place.
Together:
- cache aggressively (especially stable pages)
- when you miss cache, fetch through a stable layer (ProxiesAPI)
- use retries/backoff and soft-block detection
No overclaims: caching won’t fix broken selectors, and ProxiesAPI won’t fix missing cache keys — but both reduce the request volume that triggers blocks and drives cost.
ProxiesAPI helps stabilize fetching — but the cheapest request is the one you don’t make. A good cache cuts proxy spend and lowers block risk by reducing request volume.