Crawl Budget for Web Scraping: How to Prioritize URLs and Avoid Waste

Crawl budget for web scraping is the discipline of deciding which URLs deserve a request right now and which ones do not.

That sounds obvious, but most scrapers waste a shocking number of requests on pages that barely change:

  • recrawling dormant listings every hour
  • hitting search result pages after the relevant items have already been extracted
  • retrying permanently broken URLs like they are temporary failures
  • treating high-value pages and low-value pages exactly the same

If you want a crawler that stays cheap, fast, and useful, crawl budget is not an SEO-only concept. It is an engineering constraint.

Use your requests where they matter instead of burning them evenly

Proxy capacity and retry logic help, but they do not solve bad prioritization. ProxiesAPI works best when your crawler already knows which URLs are worth another fetch.


What crawl budget means in scraping

For a scraper, crawl budget is the combination of:

  • available requests per hour or day
  • worker capacity
  • proxy bandwidth or credit usage
  • target-site tolerance
  • the business value of each fetched page

The mistake is assuming every discovered URL should be fetched immediately.

In practice, a good crawler makes tradeoffs. A product page with price changes every few hours deserves more attention than a terms-of-service page that changes once a year.


Start with URL tiers, not one giant queue

Before writing any code, classify URL types by value and expected volatility.

URL typeExampleBusiness valueChange frequencySuggested priority
product or listing detail/product/123highmedium to highhighest
category or search result/search?q=ssdmediumhighmedium
seller or company profile/store/acmemediumlow to mediummedium
help or policy pages/shipping-policylowlowlowest
known broken or duplicate URLsredirect loops, 404snonenoneskip

This tiering alone cuts waste because you stop pretending that every page is equally valuable.


The four signals that should drive priority

The best recrawl policies usually combine four signals:

SignalQuestion it answersExample
business valuedoes this page matter if it changes?price page vs FAQ page
freshness riskhow likely is it to change soon?airline search result vs static docs page
extraction yielddoes this page reveal new entities or links?category page with pagination
failure historyhas this URL been worth the pain?chronic 403 or empty page

One simple scoring formula:

priority = (business_value * freshness_risk * extraction_yield) - failure_penalty

You do not need a fancy ML model. A practical heuristic beats a uniform crawl every time.


A simple scoring model you can ship quickly

Here is a small Python model for assigning recrawl priority:

from dataclasses import dataclass
from datetime import datetime, timedelta, timezone


@dataclass
class UrlRecord:
    url: str
    url_type: str
    last_fetched_at: datetime | None
    last_changed_at: datetime | None
    consecutive_failures: int
    business_value: int   # 1..5
    extraction_yield: int # 1..5


def freshness_risk(record: UrlRecord, now: datetime) -> int:
    if record.last_changed_at is None:
        return 5

    age = now - record.last_changed_at
    if age <= timedelta(hours=6):
        return 5
    if age <= timedelta(days=1):
        return 4
    if age <= timedelta(days=3):
        return 3
    if age <= timedelta(days=7):
        return 2
    return 1


def failure_penalty(record: UrlRecord) -> int:
    return min(record.consecutive_failures, 4)


def priority_score(record: UrlRecord, now: datetime) -> int:
    return (
        record.business_value
        + freshness_risk(record, now)
        + record.extraction_yield
        - failure_penalty(record)
    )

This is deliberately simple. The goal is not mathematical beauty. The goal is a stable, explainable ordering.


Recrawl intervals should be earned

Do not assign the same recrawl interval to everything. Let pages earn their frequency based on observed change rates.

Page behaviorRecommended recrawl
changes multiple times per dayevery 15 to 60 minutes
changes dailyevery 6 to 24 hours
changes weeklyevery 2 to 7 days
rarely changesevery 14 to 30 days
never produced useful dataarchive or skip

This sounds conservative, but it is how you free capacity for the pages that actually move.


Use discovery crawls and refresh crawls differently

One crawler usually does two very different jobs:

  1. discovery: find new URLs
  2. refresh: revisit known high-value URLs

If you mix them into one undifferentiated queue, discovery pages can starve your important refreshes.

The better pattern is:

  • a light discovery queue for category pages, sitemaps, or feeds
  • a separate refresh queue for known valuable detail pages
  • different rate limits and concurrency settings for each

This is one of the fastest ways to improve crawl budget without buying more infrastructure.


Track "change rate," not just "fetched at"

Many systems store only last_fetched_at. That is not enough.

What you really want is:

  • last_fetched_at
  • last_changed_at
  • last_status_code
  • content_hash
  • consecutive_failures
  • records_extracted

If the page has not changed in 20 consecutive fetches, the system should recrawl it less often. If the page changed on three straight visits, it should move up in priority.

That means your budget responds to evidence instead of guesswork.


Example: prioritize a frontier with a heap

Here is a small example that turns priority scores into a fetch order:

import heapq
from datetime import datetime, timezone


def build_frontier(records: list[UrlRecord]) -> list[tuple[int, str]]:
    now = datetime.now(timezone.utc)
    heap: list[tuple[int, str]] = []

    for record in records:
        score = priority_score(record, now)
        heapq.heappush(heap, (-score, record.url))

    return heap


records = [
    UrlRecord(
        url="https://example.com/product/1",
        url_type="product",
        last_fetched_at=None,
        last_changed_at=None,
        consecutive_failures=0,
        business_value=5,
        extraction_yield=4,
    ),
    UrlRecord(
        url="https://example.com/help/refunds",
        url_type="help",
        last_fetched_at=None,
        last_changed_at=None,
        consecutive_failures=0,
        business_value=1,
        extraction_yield=1,
    ),
]

frontier = build_frontier(records)
print(heapq.heappop(frontier))

That same pattern works whether you have 200 URLs or 20 million.


Where teams waste crawl budget most often

Waste patternWhy it hurtsBetter move
recrawling result pages too oftenlots of requests, little new datascrape result pages for discovery, refresh detail pages directly
retrying hard failures foreverburns worker time and proxy creditsquarantine after N failures and review separately
no deduplicationsame entity fetched from many pathscanonicalize URLs before queueing
no change detectionunchanged pages keep winning budgetstore content hashes and slow down stable pages
same policy for every sourceignores target-specific behaviortune per domain and URL type

This is why crawl budget is operational, not theoretical.


How ProxiesAPI fits into the picture

ProxiesAPI can improve fetch reliability, but reliability is only half of efficiency.

If your prioritization is bad, proxy-backed requests still get wasted on low-value pages.

The right order is:

  1. choose the right URLs
  2. choose the right recrawl interval
  3. route the fetches through a stable network layer when scale demands it

That is how you turn crawl budget into better data instead of just more requests.


A practical rollout plan

If your scraper is currently naive, do this in order:

  1. add URL types and business-value scores
  2. track change history and failures
  3. split discovery from refresh queues
  4. recrawl based on observed volatility
  5. add ProxiesAPI or other network hardening when request volume grows

Most teams start at step 5 and wonder why the crawler is still wasteful. The answer is that throughput does not replace prioritization.

If you remember one thing, make it this: crawl budget for web scraping is about saying "not now" to the wrong URLs so you can say "yes" to the right ones more often.

Use your requests where they matter instead of burning them evenly

Proxy capacity and retry logic help, but they do not solve bad prioritization. ProxiesAPI works best when your crawler already knows which URLs are worth another fetch.

Related guides

Web Scraping Queues: Concurrency, Retries, and Backpressure in Production
Design a production web scraping queue with bounded concurrency, safe retries, and backpressure so workers stay productive without overwhelming targets or your own infrastructure.
seo#web scraping queue#web scraping#concurrency
Price Scraping: How to Monitor Competitor Prices Automatically
A practical blueprint for price scraping and competitor price monitoring: what to track, how to crawl responsibly, change detection, and how to keep scrapers stable at scale.
seo#price scraping#price monitoring#web scraping
Web Crawling vs Web Scraping: Architecture, Scope, and When to Use Each
A practical guide to web crawling vs web scraping: what each one does, how the architectures differ, and when to use a crawler, a scraper, or both together.
guides#web crawling#web scraping#architecture
What Is Web Scraping? A Plain-English Guide for 2026 (Use Cases, Risks, and Best Practices)
Web scraping explained without jargon: what it is, how it works, common use cases, risks (legal, technical, and data quality), and a tiny Python example you can run today.
guides#what is web scraping#web scraping#python