How to Find All URLs on Any Website: 5 Methods (Sitemaps, Crawling, Search & More)
If you’ve ever tried to find all URLs on a website, you’ve probably hit the same wall:
- you find the obvious pages (homepage, a couple of categories)
- then you realize the site has thousands of deep pages
- and the real question becomes: what’s the most reliable way to discover them all?
This guide is a practical playbook: 5 methods you can combine, depending on the site.
We’ll cover:
- Sitemap discovery (
sitemap.xml, sitemap index files) - robots.txt discovery (sites often declare sitemaps there)
- In-page internal link extraction (quick wins)
- Crawling with rules (BFS crawl with dedupe + limits)
- Search-based discovery (when the site hides URLs behind JS or forms)
Along the way, you’ll get working Python code you can adapt into a production URL discovery pipeline.
URL discovery is network-heavy: you’ll hit sitemaps, category pages, pagination, and lots of internal links. ProxiesAPI helps keep large URL discovery runs stable by proxying requests through a single, simple HTTP interface.
Before you start: what “all URLs” really means
A website can expose URLs via multiple channels:
- public HTML links (discoverable by crawling)
- sitemaps (often the best, most complete source)
- parameterized URLs (filters, tracking params, pagination)
- JS-generated links (harder without a browser)
- API endpoints (sometimes the real data source)
So the goal is usually:
- All canonical, indexable URLs you care about (SEO view)
- or All reachable internal URLs (crawler view)
In this tutorial, we’ll aim for high coverage without junk.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor fetchingBeautifulSoup(lxml)for parsing
Shared utilities (URL normalization + filtering)
URL discovery goes sideways if you don’t normalize.
Here’s a small utility layer that:
- converts relative links to absolute
- strips URL fragments (
#section) - optionally strips tracking parameters (simple version)
- filters to same-host URLs
from __future__ import annotations
from dataclasses import dataclass
from urllib.parse import urljoin, urlparse, urlunparse, parse_qsl, urlencode
TRACKING_PARAMS = {
"utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
"gclid", "fbclid",
}
def normalize_url(url: str, base: str) -> str | None:
if not url:
return None
abs_url = urljoin(base, url)
p = urlparse(abs_url)
# only http(s)
if p.scheme not in {"http", "https"}:
return None
# drop fragments
p = p._replace(fragment="")
# strip common tracking params (keep others)
q = [(k, v) for (k, v) in parse_qsl(p.query, keep_blank_values=True) if k not in TRACKING_PARAMS]
p = p._replace(query=urlencode(q, doseq=True))
return urlunparse(p)
def same_host(url: str, root: str) -> bool:
return urlparse(url).netloc == urlparse(root).netloc
Method 1 — Sitemaps (best coverage, fastest)
If a site provides a sitemap, use it first.
Common URLs:
https://example.com/sitemap.xmlhttps://example.com/sitemap_index.xmlhttps://example.com/sitemap/sitemap.xml
Fetch + parse sitemap XML
Sitemaps are XML. You can parse with xml.etree.ElementTree.
import requests
import xml.etree.ElementTree as ET
TIMEOUT = (10, 30)
def fetch(url: str) -> str:
r = requests.get(url, timeout=TIMEOUT, headers={"User-Agent": "Mozilla/5.0 (URL-Discovery-Bot)"})
r.raise_for_status()
return r.text
def parse_sitemap(xml_text: str) -> list[str]:
"""Returns URLs from <urlset> sitemaps."""
root = ET.fromstring(xml_text)
# Namespace-safe parsing: grab all <loc> regardless of namespace
urls = []
for loc in root.findall(".//{*}loc"):
if loc.text:
urls.append(loc.text.strip())
return urls
sitemap_url = "https://example.com/sitemap.xml"
xml_text = fetch(sitemap_url)
urls = parse_sitemap(xml_text)
print("sitemap urls:", len(urls))
print(urls[:5])
Handle sitemap indexes
Many large sites use a sitemap index containing links to multiple sitemap files.
def is_sitemap_index(xml_text: str) -> bool:
return "<sitemapindex" in xml_text
def discover_from_sitemap(url: str, limit: int = 200_000) -> list[str]:
xml_text = fetch(url)
if is_sitemap_index(xml_text):
parts = parse_sitemap(xml_text) # locs to sub-sitemaps
out = []
for part in parts:
sub = fetch(part)
out.extend(parse_sitemap(sub))
if len(out) >= limit:
break
return out[:limit]
return parse_sitemap(xml_text)[:limit]
ProxiesAPI variant (drop-in network layer)
If you’re doing URL discovery across many sites, the fetch layer becomes the fragile part.
ProxiesAPI gives you a single HTTP endpoint. You pass the target URL and your key:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com/sitemap.xml"
In Python:
import requests
PROXIESAPI_KEY = "API_KEY" # replace
def fetch_via_proxiesapi(target_url: str) -> str:
api_url = "http://api.proxiesapi.com/"
r = requests.get(
api_url,
params={"key": PROXIESAPI_KEY, "url": target_url},
timeout=TIMEOUT,
headers={"User-Agent": "Mozilla/5.0 (URL-Discovery-Bot)"},
)
r.raise_for_status()
return r.text
Method 2 — robots.txt (often declares the sitemap)
Even if sitemap.xml is not where you expect, many sites declare it in robots.txt:
https://example.com/robots.txt
Example lines:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_posts.xml
Parse it like this:
import re
def discover_sitemaps_from_robots(root_url: str) -> list[str]:
robots_url = root_url.rstrip("/") + "/robots.txt"
txt = fetch(robots_url)
sitemaps = []
for line in txt.splitlines():
m = re.match(r"^\s*Sitemap:\s*(\S+)\s*$", line, flags=re.I)
if m:
sitemaps.append(m.group(1))
return sitemaps
root = "https://example.com"
print(discover_sitemaps_from_robots(root))
In practice:
- fetch
robots.txt - extract all sitemap URLs
- parse each sitemap (and indexes)
- merge + dedupe
Method 3 — Extract internal links from a page (quick win)
If you just need a list of internal URLs from a few key pages (home, category, search results), parsing <a href> is easy.
from bs4 import BeautifulSoup
def extract_links_from_html(html: str, base_url: str, root_url: str) -> set[str]:
soup = BeautifulSoup(html, "lxml")
out: set[str] = set()
for a in soup.select("a[href]"):
href = a.get("href")
u = normalize_url(href, base_url)
if not u:
continue
if same_host(u, root_url):
out.add(u)
return out
root = "https://example.com"
html = fetch(root)
links = extract_links_from_html(html, base_url=root, root_url=root)
print("internal links from home:", len(links))
print(list(sorted(links))[:10])
Terminal sanity check
python url_discovery.py
Typical output:
internal links from home: 128
https://example.com/about
https://example.com/blog/
https://example.com/pricing
...
Method 4 — Crawl the site (BFS with limits)
Crawling is where you get breadth.
A sensible production crawl should include:
- dedupe (
seenset) - a queue (BFS)
- rate limiting and timeouts
- scope rules (same host, allowlist paths, extension filters)
- max pages to avoid infinite loops
Here’s a simple crawler you can extend.
import time
from collections import deque
@dataclass
class CrawlConfig:
root: str
max_pages: int = 2000
delay_s: float = 0.2
def crawl_internal_urls(cfg: CrawlConfig) -> list[str]:
q = deque([cfg.root])
seen: set[str] = set([cfg.root])
out: list[str] = []
while q and len(out) < cfg.max_pages:
url = q.popleft()
try:
html = fetch(url)
except requests.HTTPError as e:
print("HTTP error", url, e)
continue
except requests.RequestException as e:
print("Request error", url, e)
continue
out.append(url)
new_links = extract_links_from_html(html, base_url=url, root_url=cfg.root)
for link in new_links:
if link in seen:
continue
seen.add(link)
q.append(link)
print("crawled", len(out), "queued", len(q), "seen", len(seen))
time.sleep(cfg.delay_s)
return out
cfg = CrawlConfig(root="https://example.com", max_pages=200)
urls = crawl_internal_urls(cfg)
print("total crawled:", len(urls))
Make it scale: crawl with ProxiesAPI
If you’re crawling thousands of pages, your request pipeline should be simple, consistent, and easy to retry.
The easiest swap is: keep everything the same, but use fetch_via_proxiesapi().
def fetch_crawl(url: str) -> str:
# return fetch(url) # direct
return fetch_via_proxiesapi(url) # via ProxiesAPI
Then use fetch_crawl() inside crawl_internal_urls().
Method 5 — Search-based discovery (when crawling isn’t enough)
Some sites:
- hide important URLs behind JS
- require form submits
- block bots from deep paths
In these cases, search-based discovery can reveal URLs your crawler misses.
Two common approaches:
site:example.com some keyword- Search for a pattern:
site:example.com inurl:product
You can do this manually, or via a search API.
If you do it programmatically, treat it as a seed generator: you still need to dedupe and normalize.
Putting it together: a practical URL discovery pipeline
A robust workflow looks like this:
- robots.txt → sitemap URLs
- sitemap parsing → bulk URLs
- crawl key sections (category pages, pagination) to catch non-sitemap URLs
- search-based seeds when coverage is low
- normalize + dedupe + export
Export to JSONL (easy for pipelines)
import json
with open("discovered_urls.jsonl", "w", encoding="utf-8") as f:
for u in sorted(set(urls)):
f.write(json.dumps({"url": u}, ensure_ascii=False) + "\n")
print("wrote discovered_urls.jsonl")
Common pitfalls (and how to avoid them)
- Infinite URL spaces: calendars, faceted search,
?page=loops → add caps and allowlists. - Duplicate content: same page with many params → strip tracking params; consider canonical URLs.
- Non-HTML URLs: PDFs, images → filter by extension or
Content-Type. - Overloading the server: add delays, respect robots, and keep
max_pagessane.
FAQ
Is sitemap.xml always complete?
No. Many sites omit URLs (especially private or user-generated pages). But it’s the best place to start.
Can I find URLs behind JavaScript?
Sometimes not with pure requests. You may need a browser (Playwright) or an API the site uses.
What’s the fastest method?
If available: robots.txt → sitemaps.
Where ProxiesAPI fits (honestly)
URL discovery is mostly a networking problem:
- many requests
- many retries
- inconsistent responses
ProxiesAPI helps by giving you a simple “fetch this URL” interface you can plug into any crawler.
You still need to be polite (delays, caps) and scope your crawl.
URL discovery is network-heavy: you’ll hit sitemaps, category pages, pagination, and lots of internal links. ProxiesAPI helps keep large URL discovery runs stable by proxying requests through a single, simple HTTP interface.