Web Scraping Caching: ETag + Last-Modified + Redis (When to Re-fetch vs Reuse)

May 31, 2026 · guide · #web-scraping, #caching, #etag, #last-modified, #redis, #python, #requests

Most scrapers waste money and risk bans for one reason:

They re-fetch pages that haven’t changed.

If you scrape product pages, docs, directories, or issue lists, you’ll see the same URLs over and over.

This guide shows how to implement web scraping caching that actually works in production:

HTTP cache headers (ETag, Last-Modified)
conditional requests (If-None-Match, If-Modified-Since)
Redis cache keys + TTLs
content hashing for sites that don’t give you cache headers

Target keyword (natural): web scraping caching

Reduce crawl cost by caching aggressively with ProxiesAPI

ProxiesAPI helps stabilize fetching — but the cheapest request is the one you don’t make. A good cache cuts proxy spend and lowers block risk by reducing request volume.

Get 1,000 free API calls View pricing

The 3 caching tiers (use all three)

Tier 1: In-memory (per run) — prevents duplicate fetches inside one crawl.

Tier 2: Persistent cache (Redis / disk) — prevents repeated fetches across runs.

Tier 3: Conditional HTTP requests — “give me the page only if it changed”.

ETag and Last-Modified (the core idea)

Servers may return:

ETag: "abc123"
Last-Modified: Tue, 28 May 2026 10:12:00 GMT

On the next request, you can send:

If-None-Match: "abc123"
If-Modified-Since: Tue, 28 May 2026 10:12:00 GMT

If nothing changed, the server responds 304 Not Modified.

A cache-aware fetcher with Redis

This pattern stores per-URL metadata in Redis:

last ETag
last Last-Modified
cached body (optional; depends on size)

import hashlib
import json
import os
import time
import requests

import redis  # pip install redis


REDIS_URL = os.environ.get("REDIS_URL", "redis://localhost:6379/0")
rdb = redis.Redis.from_url(REDIS_URL, decode_responses=True)

session = requests.Session()


def cache_key(url: str) -> str:
    h = hashlib.sha1(url.encode("utf-8")).hexdigest()
    return f"scrape:cache:v1:{h}"


def get_meta(url: str) -> dict:
    raw = rdb.get(cache_key(url))
    return json.loads(raw) if raw else {}


def set_meta(url: str, meta: dict, ttl_seconds: int) -> None:
    rdb.setex(cache_key(url), ttl_seconds, json.dumps(meta))


def fetch_cached(url: str, *, ttl_seconds: int = 86400, timeout: int = 30) -> tuple[int, str | None]:
    meta = get_meta(url)

    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)",
        "Accept-Language": "en-US,en;q=0.9",
    }

    if meta.get("etag"):
        headers["If-None-Match"] = meta["etag"]
    if meta.get("last_modified"):
        headers["If-Modified-Since"] = meta["last_modified"]

    resp = session.get(url, headers=headers, timeout=timeout)

    if resp.status_code == 304:
        return 304, meta.get("body")

    resp.raise_for_status()
    body = resp.text

    meta2 = {
        "etag": resp.headers.get("ETag"),
        "last_modified": resp.headers.get("Last-Modified"),
        "fetched_at": int(time.time()),
        "body": body,  # optional; consider size limits
    }
    set_meta(url, meta2, ttl_seconds)

    return resp.status_code, body

When cache headers are missing: content hashing

If a site doesn’t send ETag/Last-Modified:

fetch the body
compute a hash
only re-parse when the hash changes

import hashlib


def body_hash(body: str) -> str:
    return hashlib.sha256(body.encode("utf-8")).hexdigest()

Store the hash and reuse parsed output when it’s unchanged.

TTL strategy (the part most people get wrong)

Different pages change at different rates:

home page / trending: minutes
listings: hours
product pages: hours → days
evergreen docs: days → weeks

If you’re unsure, start with 1 day (86400 seconds), then measure change rate and adjust.

Cache keys that won’t betray you

Your cache key must include everything that changes the response:

URL (including query params)
locale/geo variants
auth state (if any)

If you cache localized content, include a variant dimension in the key (e.g., locale).

Where ProxiesAPI fits

ProxiesAPI helps when you do need to fetch.

Caching helps you avoid fetching in the first place.

Together:

cache aggressively (especially stable pages)
when you miss cache, fetch through a stable layer (ProxiesAPI)
use retries/backoff and soft-block detection

No overclaims: caching won’t fix broken selectors, and ProxiesAPI won’t fix missing cache keys — but both reduce the request volume that triggers blocks and drives cost.

Reduce crawl cost by caching aggressively with ProxiesAPI

ProxiesAPI helps stabilize fetching — but the cheapest request is the one you don’t make. A good cache cuts proxy spend and lowers block risk by reducing request volume.

Get 1,000 free API calls View pricing

A practical guide to incremental web scraping: use ETag, Last-Modified, sitemap hints, and content hashes to avoid full recrawls while keeping datasets fresh.

guides#incremental web scraping#web-scraping#etag

Proxy Authentication for Web Scraping: Setup Patterns and Common Failures

Learn the practical proxy authentication patterns that actually matter in scraping systems, including URL credentials, auth headers, environment variables, and the failures that break crawls in production.

guide#proxies#authentication#web-scraping

Web Scraping with Python: The Complete 2026 Tutorial

A from-scratch, production-minded guide to web scraping in Python: requests + BeautifulSoup, pagination, retries, caching, proxies, and a reusable scraper template.

guide#web scraping python#python#web-scraping

Web Scraping Pagination: 7 Patterns That Don’t Break (Offset, Cursor, Infinite Scroll)

A practical playbook for reliable pagination: offset vs cursor, next-page discovery, infinite scroll, duplicate prevention, and retry/backoff patterns you can copy into production.

guide#web-scraping#pagination#python