How to Find All URLs on Any Website: 5 Methods
If you’ve ever tried to crawl a site and realized you don’t even have a complete URL list, you’re not alone.
People search for find all urls on a website because it’s surprisingly non-trivial:
- some sites have multiple sitemaps
- some pages are orphaned (not linked)
- some URLs are hidden behind internal search
- some content is generated dynamically
This guide gives you 5 practical methods — from “fast and official” to “brute-force but effective” — and includes a working Python crawler you can use today.
URL discovery is the first step of every crawl. ProxiesAPI helps keep the fetch layer consistent as you move from a few sitemap requests to large, multi-step crawling workflows across many sites.
Method 1) Check sitemap.xml (the fastest win)
Start here.
Most sites expose one of these:
https://example.com/sitemap.xmlhttps://example.com/sitemap_index.xmlhttps://example.com/sitemap/sitemap.xml
Quick terminal check
curl -s https://example.com/sitemap.xml | head
You might see a <urlset> (a list of URLs) or a <sitemapindex> (a list of sitemap files).
Minimal Python sitemap fetch
import requests
import xml.etree.ElementTree as ET
TIMEOUT = (10, 30)
def fetch_text(url: str) -> str:
r = requests.get(url, timeout=TIMEOUT, headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"})
r.raise_for_status()
return r.text
def parse_sitemap_urls(xml_text: str) -> list[str]:
root = ET.fromstring(xml_text)
# The XML uses namespaces; easiest is to ignore the namespace by stripping it.
def strip(tag: str) -> str:
return tag.split('}', 1)[-1]
urls = []
for el in root.iter():
if strip(el.tag) == "loc" and el.text:
urls.append(el.text.strip())
return urls
xml_text = fetch_text("https://example.com/sitemap.xml")
urls = parse_sitemap_urls(xml_text)
print("found", len(urls))
print(urls[:5])
If you got a sitemap index, the same parser will return the child sitemap URLs — fetch each and merge.
Method 2) Read robots.txt (it often points to sitemaps)
robots.txt is not a URL list, but it frequently contains sitemap hints.
Try:
https://example.com/robots.txt
curl -s https://example.com/robots.txt | grep -i sitemap
You’ll often find:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-1.xml
Method 3) Crawl internal links (discover what’s actually reachable)
If sitemaps aren’t available or incomplete, crawl.
A good crawler:
- starts from a small set of seeds (homepage + top nav pages)
- stays on the same host
- avoids non-HTML assets
- deduplicates URLs (canonicalization)
- limits depth/pages so you don’t explode
A simple Python crawler (requests + BeautifulSoup)
This crawler:
- BFS crawls from a seed URL
- collects internal URLs
- respects a max page limit
from collections import deque
from urllib.parse import urljoin, urlparse, urldefrag
import requests
from bs4 import BeautifulSoup
TIMEOUT = (10, 30)
def is_html_response(resp: requests.Response) -> bool:
ctype = resp.headers.get("Content-Type", "")
return "text/html" in ctype
def normalize_url(base: str, href: str) -> str | None:
if not href:
return None
url = urljoin(base, href)
url, _frag = urldefrag(url)
parsed = urlparse(url)
if parsed.scheme not in ("http", "https"):
return None
return url
def crawl_site(seed: str, max_pages: int = 200) -> list[str]:
seed_parsed = urlparse(seed)
host = seed_parsed.netloc
q = deque([seed])
seen = set([seed])
out = []
s = requests.Session()
s.headers.update({"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"})
while q and len(out) < max_pages:
url = q.popleft()
try:
resp = s.get(url, timeout=TIMEOUT, allow_redirects=True)
resp.raise_for_status()
except Exception as e:
print("fetch failed", url, e)
continue
if not is_html_response(resp):
continue
out.append(url)
soup = BeautifulSoup(resp.text, "lxml")
for a in soup.select("a[href]"):
nxt = normalize_url(url, a.get("href"))
if not nxt:
continue
p = urlparse(nxt)
if p.netloc != host:
continue
# avoid obvious non-content file types
if any(p.path.lower().endswith(ext) for ext in (".jpg", ".png", ".gif", ".pdf", ".zip")):
continue
if nxt not in seen:
seen.add(nxt)
q.append(nxt)
return out
urls = crawl_site("https://example.com/", max_pages=100)
print("discovered", len(urls))
print(urls[:10])
Where ProxiesAPI fits in crawling
When you crawl, you’re making many fetches. Swapping the fetch layer to ProxiesAPI can be as simple as wrapping each URL:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com/" | head
In Python, you can wrap URLs before GET and keep parsing unchanged.
Method 4) Use site: search (find URLs search engines already know)
Google/Bing (and other engines) can act like a “third-party index”.
Example query:
site:example.com
Pros:
- finds orphaned pages you won’t reach by crawling
Cons:
- not complete
- may show old/redirected URLs
- results are capped and sampling-based
If you need completeness, treat this as a supplement.
Method 5) Export from your own sources (analytics + CMS)
If you control the site, the most accurate URL list is internal:
- CMS export (WordPress, Webflow, Shopify, custom DB)
- analytics (GA4, Plausible) pages report
- server logs
This is often the best method for internal audits because it captures URLs that are live but not necessarily linked.
Compare the 5 methods (when to use what)
| Method | Best for | Coverage | Effort | Notes |
|---|---|---|---|---|
| Sitemap | SEO-friendly sites | High | Low | Start here; also check sitemap indexes |
| robots.txt | Discover sitemap hints | Medium | Low | Commonly points to the real sitemap |
| Crawl links | Reachable internal pages | Medium–High | Medium | Misses orphan pages; depends on seeds |
| site: search | Orphan / long-tail discovery | Low–Medium | Low | Sampling + caps; supplement only |
| Internal exports | Sites you control | Very High | Medium | Most complete if you have access |
A practical workflow (recommended)
If you need a “mostly complete” URL set:
- fetch
robots.txt→ collect sitemap URLs - fetch sitemaps → extract all
<loc>URLs - crawl from homepage to discover non-sitemap pages
- (optional) supplement with
site:search
Then dedupe + normalize (remove fragments, unify trailing slashes).
Where ProxiesAPI fits (honestly)
URL discovery isn’t just one request — it’s a pipeline:
- robots.txt fetch
- sitemap index fetch
- many sitemap fetches
- lots of internal page fetches
ProxiesAPI helps when you want a consistent fetch interface across all those steps, especially when you run URL discovery frequently across many domains.
Checklist
- you tried sitemap.xml and sitemap_index.xml
- you checked robots.txt for sitemap hints
- your crawler stays on-host and dedupes URLs
- you set max_pages so the crawl doesn’t explode
- you normalized URLs (remove fragments, consistent scheme/host)
URL discovery is the first step of every crawl. ProxiesAPI helps keep the fetch layer consistent as you move from a few sitemap requests to large, multi-step crawling workflows across many sites.