Web Scraping Caching: ETag + Last-Modified + Redis (When to Re-fetch vs Reuse)

Most scrapers waste money and risk bans for one reason:

They re-fetch pages that haven’t changed.

If you scrape product pages, docs, directories, or issue lists, you’ll see the same URLs over and over.

This guide shows how to implement web scraping caching that actually works in production:

  • HTTP cache headers (ETag, Last-Modified)
  • conditional requests (If-None-Match, If-Modified-Since)
  • Redis cache keys + TTLs
  • content hashing for sites that don’t give you cache headers

Target keyword (natural): web scraping caching

Reduce crawl cost by caching aggressively with ProxiesAPI

ProxiesAPI helps stabilize fetching — but the cheapest request is the one you don’t make. A good cache cuts proxy spend and lowers block risk by reducing request volume.


The 3 caching tiers (use all three)

Tier 1: In-memory (per run) — prevents duplicate fetches inside one crawl.

Tier 2: Persistent cache (Redis / disk) — prevents repeated fetches across runs.

Tier 3: Conditional HTTP requests — “give me the page only if it changed”.


ETag and Last-Modified (the core idea)

Servers may return:

  • ETag: "abc123"
  • Last-Modified: Tue, 28 May 2026 10:12:00 GMT

On the next request, you can send:

  • If-None-Match: "abc123"
  • If-Modified-Since: Tue, 28 May 2026 10:12:00 GMT

If nothing changed, the server responds 304 Not Modified.


A cache-aware fetcher with Redis

This pattern stores per-URL metadata in Redis:

  • last ETag
  • last Last-Modified
  • cached body (optional; depends on size)
import hashlib
import json
import os
import time
import requests

import redis  # pip install redis


REDIS_URL = os.environ.get("REDIS_URL", "redis://localhost:6379/0")
rdb = redis.Redis.from_url(REDIS_URL, decode_responses=True)

session = requests.Session()


def cache_key(url: str) -> str:
    h = hashlib.sha1(url.encode("utf-8")).hexdigest()
    return f"scrape:cache:v1:{h}"


def get_meta(url: str) -> dict:
    raw = rdb.get(cache_key(url))
    return json.loads(raw) if raw else {}


def set_meta(url: str, meta: dict, ttl_seconds: int) -> None:
    rdb.setex(cache_key(url), ttl_seconds, json.dumps(meta))


def fetch_cached(url: str, *, ttl_seconds: int = 86400, timeout: int = 30) -> tuple[int, str | None]:
    meta = get_meta(url)

    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
        "Accept-Language": "en-US,en;q=0.9",
    }

    if meta.get("etag"):
        headers["If-None-Match"] = meta["etag"]
    if meta.get("last_modified"):
        headers["If-Modified-Since"] = meta["last_modified"]

    resp = session.get(url, headers=headers, timeout=timeout)

    if resp.status_code == 304:
        return 304, meta.get("body")

    resp.raise_for_status()
    body = resp.text

    meta2 = {
        "etag": resp.headers.get("ETag"),
        "last_modified": resp.headers.get("Last-Modified"),
        "fetched_at": int(time.time()),
        "body": body,  # optional; consider size limits
    }
    set_meta(url, meta2, ttl_seconds)

    return resp.status_code, body

When cache headers are missing: content hashing

If a site doesn’t send ETag/Last-Modified:

  1. fetch the body
  2. compute a hash
  3. only re-parse when the hash changes
import hashlib


def body_hash(body: str) -> str:
    return hashlib.sha256(body.encode("utf-8")).hexdigest()

Store the hash and reuse parsed output when it’s unchanged.


TTL strategy (the part most people get wrong)

Different pages change at different rates:

  • home page / trending: minutes
  • listings: hours
  • product pages: hours → days
  • evergreen docs: days → weeks

If you’re unsure, start with 1 day (86400 seconds), then measure change rate and adjust.


Cache keys that won’t betray you

Your cache key must include everything that changes the response:

  • URL (including query params)
  • locale/geo variants
  • auth state (if any)

If you cache localized content, include a variant dimension in the key (e.g., locale).


Where ProxiesAPI fits

ProxiesAPI helps when you do need to fetch.

Caching helps you avoid fetching in the first place.

Together:

  • cache aggressively (especially stable pages)
  • when you miss cache, fetch through a stable layer (ProxiesAPI)
  • use retries/backoff and soft-block detection

No overclaims: caching won’t fix broken selectors, and ProxiesAPI won’t fix missing cache keys — but both reduce the request volume that triggers blocks and drives cost.

Reduce crawl cost by caching aggressively with ProxiesAPI

ProxiesAPI helps stabilize fetching — but the cheapest request is the one you don’t make. A good cache cuts proxy spend and lowers block risk by reducing request volume.

Related guides

Web Scraping Pagination: 7 Patterns That Don’t Break (Offset, Cursor, Infinite Scroll)
A practical playbook for reliable pagination: offset vs cursor, next-page discovery, infinite scroll, duplicate prevention, and retry/backoff patterns you can copy into production.
guide#web-scraping#pagination#python
Web Scraping with Python Requests: Proxies, Retries, and Timeouts (2026)
Make Python Requests reliable for scraping: proxy configuration, timeouts, retries with backoff, common failure modes, and when to use ProxiesAPI for a stable fetch layer.
guide#python#requests#proxy
Python Requests with Proxy: Setup and Rotation Guide
A practical guide to using proxies with Python Requests: basic config, authenticated proxies, session rotation, retries, timeouts, and a simpler ProxiesAPI fetch pattern.
guide#python#requests#proxy
How to Scrape Google Finance Data with Python (Quotes, News, and Historical Prices)
Scrape Google Finance quote pages for price, key stats, news headlines, and a simple historical price series with Python. Includes selector-first HTML parsing, CSV export, and block-avoidance tactics (timeouts, retries, and ProxiesAPI-friendly patterns).
guide#python#google-finance#web-scraping