Crawl Budget for Web Scraping: How to Prioritize URLs and Avoid Waste
Crawl budget for web scraping is the discipline of deciding which URLs deserve a request right now and which ones do not.
That sounds obvious, but most scrapers waste a shocking number of requests on pages that barely change:
- recrawling dormant listings every hour
- hitting search result pages after the relevant items have already been extracted
- retrying permanently broken URLs like they are temporary failures
- treating high-value pages and low-value pages exactly the same
If you want a crawler that stays cheap, fast, and useful, crawl budget is not an SEO-only concept. It is an engineering constraint.
Proxy capacity and retry logic help, but they do not solve bad prioritization. ProxiesAPI works best when your crawler already knows which URLs are worth another fetch.
What crawl budget means in scraping
For a scraper, crawl budget is the combination of:
- available requests per hour or day
- worker capacity
- proxy bandwidth or credit usage
- target-site tolerance
- the business value of each fetched page
The mistake is assuming every discovered URL should be fetched immediately.
In practice, a good crawler makes tradeoffs. A product page with price changes every few hours deserves more attention than a terms-of-service page that changes once a year.
Start with URL tiers, not one giant queue
Before writing any code, classify URL types by value and expected volatility.
| URL type | Example | Business value | Change frequency | Suggested priority |
|---|---|---|---|---|
| product or listing detail | /product/123 | high | medium to high | highest |
| category or search result | /search?q=ssd | medium | high | medium |
| seller or company profile | /store/acme | medium | low to medium | medium |
| help or policy pages | /shipping-policy | low | low | lowest |
| known broken or duplicate URLs | redirect loops, 404s | none | none | skip |
This tiering alone cuts waste because you stop pretending that every page is equally valuable.
The four signals that should drive priority
The best recrawl policies usually combine four signals:
| Signal | Question it answers | Example |
|---|---|---|
| business value | does this page matter if it changes? | price page vs FAQ page |
| freshness risk | how likely is it to change soon? | airline search result vs static docs page |
| extraction yield | does this page reveal new entities or links? | category page with pagination |
| failure history | has this URL been worth the pain? | chronic 403 or empty page |
One simple scoring formula:
priority = (business_value * freshness_risk * extraction_yield) - failure_penalty
You do not need a fancy ML model. A practical heuristic beats a uniform crawl every time.
A simple scoring model you can ship quickly
Here is a small Python model for assigning recrawl priority:
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
@dataclass
class UrlRecord:
url: str
url_type: str
last_fetched_at: datetime | None
last_changed_at: datetime | None
consecutive_failures: int
business_value: int # 1..5
extraction_yield: int # 1..5
def freshness_risk(record: UrlRecord, now: datetime) -> int:
if record.last_changed_at is None:
return 5
age = now - record.last_changed_at
if age <= timedelta(hours=6):
return 5
if age <= timedelta(days=1):
return 4
if age <= timedelta(days=3):
return 3
if age <= timedelta(days=7):
return 2
return 1
def failure_penalty(record: UrlRecord) -> int:
return min(record.consecutive_failures, 4)
def priority_score(record: UrlRecord, now: datetime) -> int:
return (
record.business_value
+ freshness_risk(record, now)
+ record.extraction_yield
- failure_penalty(record)
)
This is deliberately simple. The goal is not mathematical beauty. The goal is a stable, explainable ordering.
Recrawl intervals should be earned
Do not assign the same recrawl interval to everything. Let pages earn their frequency based on observed change rates.
| Page behavior | Recommended recrawl |
|---|---|
| changes multiple times per day | every 15 to 60 minutes |
| changes daily | every 6 to 24 hours |
| changes weekly | every 2 to 7 days |
| rarely changes | every 14 to 30 days |
| never produced useful data | archive or skip |
This sounds conservative, but it is how you free capacity for the pages that actually move.
Use discovery crawls and refresh crawls differently
One crawler usually does two very different jobs:
- discovery: find new URLs
- refresh: revisit known high-value URLs
If you mix them into one undifferentiated queue, discovery pages can starve your important refreshes.
The better pattern is:
- a light discovery queue for category pages, sitemaps, or feeds
- a separate refresh queue for known valuable detail pages
- different rate limits and concurrency settings for each
This is one of the fastest ways to improve crawl budget without buying more infrastructure.
Track "change rate," not just "fetched at"
Many systems store only last_fetched_at. That is not enough.
What you really want is:
last_fetched_atlast_changed_atlast_status_codecontent_hashconsecutive_failuresrecords_extracted
If the page has not changed in 20 consecutive fetches, the system should recrawl it less often. If the page changed on three straight visits, it should move up in priority.
That means your budget responds to evidence instead of guesswork.
Example: prioritize a frontier with a heap
Here is a small example that turns priority scores into a fetch order:
import heapq
from datetime import datetime, timezone
def build_frontier(records: list[UrlRecord]) -> list[tuple[int, str]]:
now = datetime.now(timezone.utc)
heap: list[tuple[int, str]] = []
for record in records:
score = priority_score(record, now)
heapq.heappush(heap, (-score, record.url))
return heap
records = [
UrlRecord(
url="https://example.com/product/1",
url_type="product",
last_fetched_at=None,
last_changed_at=None,
consecutive_failures=0,
business_value=5,
extraction_yield=4,
),
UrlRecord(
url="https://example.com/help/refunds",
url_type="help",
last_fetched_at=None,
last_changed_at=None,
consecutive_failures=0,
business_value=1,
extraction_yield=1,
),
]
frontier = build_frontier(records)
print(heapq.heappop(frontier))
That same pattern works whether you have 200 URLs or 20 million.
Where teams waste crawl budget most often
| Waste pattern | Why it hurts | Better move |
|---|---|---|
| recrawling result pages too often | lots of requests, little new data | scrape result pages for discovery, refresh detail pages directly |
| retrying hard failures forever | burns worker time and proxy credits | quarantine after N failures and review separately |
| no deduplication | same entity fetched from many paths | canonicalize URLs before queueing |
| no change detection | unchanged pages keep winning budget | store content hashes and slow down stable pages |
| same policy for every source | ignores target-specific behavior | tune per domain and URL type |
This is why crawl budget is operational, not theoretical.
How ProxiesAPI fits into the picture
ProxiesAPI can improve fetch reliability, but reliability is only half of efficiency.
If your prioritization is bad, proxy-backed requests still get wasted on low-value pages.
The right order is:
- choose the right URLs
- choose the right recrawl interval
- route the fetches through a stable network layer when scale demands it
That is how you turn crawl budget into better data instead of just more requests.
A practical rollout plan
If your scraper is currently naive, do this in order:
- add URL types and business-value scores
- track change history and failures
- split discovery from refresh queues
- recrawl based on observed volatility
- add ProxiesAPI or other network hardening when request volume grows
Most teams start at step 5 and wonder why the crawler is still wasteful. The answer is that throughput does not replace prioritization.
If you remember one thing, make it this: crawl budget for web scraping is about saying "not now" to the wrong URLs so you can say "yes" to the right ones more often.
Proxy capacity and retry logic help, but they do not solve bad prioritization. ProxiesAPI works best when your crawler already knows which URLs are worth another fetch.