robots.txt for Web Scraping: What It Really Means (and What It Doesn’t)
If you’ve ever scraped a site seriously, you’ve run into robots.txt.
And if you’ve googled it, you’ve seen two extremes:
- “robots.txt is legally binding—never scrape anything disallowed”
- “robots.txt is just a suggestion—ignore it”
Reality is more nuanced.
This post explains what robots.txt actually does, what it doesn’t do, and how to use it as a sane input into a real scraping policy.
Proxies help with stability, not permission. ProxiesAPI can keep large crawls reliable — but the real win is pairing it with a respectful crawl policy and rate limits.
What robots.txt is (mechanically)
robots.txt is a plain text file (usually at https://example.com/robots.txt) that provides crawl directives for automated user agents.
It was designed for web crawlers (search engines) to coordinate load and to avoid crawling “junk” areas like:
- search result pages
- internal account pages
- “infinite” filters / sort combinations
Key point
robots.txt is interpreted by clients.
The server doesn’t enforce it the way it enforces auth, CAPTCHAs, or IP blocks.
What robots.txt is NOT
1) It’s not authentication
If a URL is publicly accessible, robots.txt doesn’t magically protect it.
If a URL is sensitive, the correct protection is:
- authentication, or
- permission checks, or
- removing it from public access
2) It’s not an access control system
A Disallow: line doesn’t stop a browser from loading the URL.
3) It’s not a guarantee of “allowed”
Even if robots allows a path, you can still violate:
- the site’s Terms of Service
- privacy expectations
- rate limits / abuse rules
What robots.txt DOES mean in practice
It’s best to treat robots.txt as:
- a load-management and intent signal (“please don’t crawl these paths”)
- one input to your crawler behavior
- a place where some sites publish a contact email / sitemap hints
For many organizations, “disallowed in robots” is the line their legal/compliance team expects you to respect, even if it’s not strictly “enforced”.
How robots directives work (quick reference)
Robots files are made of groups:
User-agent: SomeBot
Disallow: /private/
Allow: /private/public-report.pdf
Common directives you’ll see:
| Directive | Meaning | Typical scraper behavior |
|---|---|---|
User-agent | Which bot the rules apply to | Set a clear UA, don’t pretend to be Google |
Disallow | Paths you should not crawl | Avoid those paths in your queue |
Allow | Exceptions to disallow | Include if explicitly allowed |
Crawl-delay | Suggested delay between requests | Treat as a minimum delay if present |
Sitemap | Sitemap URL(s) | Great for URL discovery |
Important nuance: matching rules differ by parser, and many scrapers don’t implement full Google-style robots semantics.
If you need strict interpretation, use a known parser library.
The legal question: is robots.txt “binding”?
I’m not a lawyer, but here’s the practical framing:
- robots.txt is not a contract you “agree” to by default
- many sites still use it as a policy marker
- legal outcomes depend more on:
- ToS wording
- whether you bypassed technical restrictions
- jurisdiction and specific facts
- how you used the data (republishing, reselling, personal research, etc)
So the best operational approach is:
- If you can avoid disallowed paths, do it.
- If you can’t, decide consciously: document why, limit scope, and consider contacting the site.
A sane robots-aware scraping policy
Here’s the policy I recommend for most teams:
- Always fetch and log
robots.txtfor the origin. - Use a clear, honest
User-Agentfor your scraper (include a URL/email). - Respect
Disallowfor broad crawl jobs by default. - Apply rate limiting regardless of robots:
- per host
- per path cluster (API vs HTML)
- Prefer sitemaps for discovery when available.
- Never scrape authenticated pages unless you own the account / have explicit permission.
- Don’t collect personal data you don’t need.
Robots is not the whole policy — it’s the “front cover”.
How to check robots.txt programmatically (Python)
This snippet fetches robots.txt and uses Python’s built-in parser to ask “is this URL allowed for my user agent?”
import urllib.robotparser
from urllib.parse import urlparse
import requests
TIMEOUT = (10, 30)
def can_fetch(target_url: str, user_agent: str = "MyScraperBot") -> bool:
u = urlparse(target_url)
robots_url = f"{u.scheme}://{u.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
r = requests.get(robots_url, timeout=TIMEOUT, headers={"User-Agent": user_agent})
r.raise_for_status()
rp.parse(r.text.splitlines())
return rp.can_fetch(user_agent, target_url)
Two caveats:
- Python’s
urllib.robotparserdoesn’t perfectly match Google’s robots behavior. - Many sites omit crawl-delay or publish rules only for specific bots.
Still, it’s a useful baseline.
A respectful crawl-delay pattern (even if robots has none)
Even when robots doesn’t specify crawl-delay, you should pick one.
For example:
- 1 request/second per host (safe default)
- 0.2–0.5 req/sec if pages are heavy
- cap concurrency per host (e.g. 2–4)
That discipline prevents “accidental DDoS” scraping.
Where proxies fit (and where they don’t)
Proxies are often misunderstood.
Proxies help with:
- stability across large URL sets
- reducing random blocks due to bursty traffic
- handling geo variance when required
Proxies do NOT help with:
- permission
- legal/ethical compliance
- “making scraping okay”
Use ProxiesAPI as an infrastructure layer, then enforce your crawl policy above it.
Quick checklist (save this)
- Fetch + store robots.txt for every origin
- Use a real
User-Agentwith contact info - Respect disallowed paths for broad crawls by default
- Rate limit per host + add jitter
- Prefer sitemaps for discovery
- Avoid personal data unless necessary
Proxies help with stability, not permission. ProxiesAPI can keep large crawls reliable — but the real win is pairing it with a respectful crawl policy and rate limits.