How to Find All URLs on Any Website: 5 Methods

If you’ve ever tried to crawl a site and realized you don’t even have a complete URL list, you’re not alone.

People search for find all urls on a website because it’s surprisingly non-trivial:

  • some sites have multiple sitemaps
  • some pages are orphaned (not linked)
  • some URLs are hidden behind internal search
  • some content is generated dynamically

This guide gives you 5 practical methods — from “fast and official” to “brute-force but effective” — and includes a working Python crawler you can use today.

Turn URL discovery into a reliable pipeline with ProxiesAPI

URL discovery is the first step of every crawl. ProxiesAPI helps keep the fetch layer consistent as you move from a few sitemap requests to large, multi-step crawling workflows across many sites.


Method 1) Check sitemap.xml (the fastest win)

Start here.

Most sites expose one of these:

  • https://example.com/sitemap.xml
  • https://example.com/sitemap_index.xml
  • https://example.com/sitemap/sitemap.xml

Quick terminal check

curl -s https://example.com/sitemap.xml | head

You might see a <urlset> (a list of URLs) or a <sitemapindex> (a list of sitemap files).

Minimal Python sitemap fetch

import requests
import xml.etree.ElementTree as ET

TIMEOUT = (10, 30)

def fetch_text(url: str) -> str:
    r = requests.get(url, timeout=TIMEOUT, headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"})
    r.raise_for_status()
    return r.text


def parse_sitemap_urls(xml_text: str) -> list[str]:
    root = ET.fromstring(xml_text)

    # The XML uses namespaces; easiest is to ignore the namespace by stripping it.
    def strip(tag: str) -> str:
        return tag.split('}', 1)[-1]

    urls = []
    for el in root.iter():
        if strip(el.tag) == "loc" and el.text:
            urls.append(el.text.strip())
    return urls

xml_text = fetch_text("https://example.com/sitemap.xml")
urls = parse_sitemap_urls(xml_text)
print("found", len(urls))
print(urls[:5])

If you got a sitemap index, the same parser will return the child sitemap URLs — fetch each and merge.


Method 2) Read robots.txt (it often points to sitemaps)

robots.txt is not a URL list, but it frequently contains sitemap hints.

Try:

  • https://example.com/robots.txt
curl -s https://example.com/robots.txt | grep -i sitemap

You’ll often find:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-1.xml

If sitemaps aren’t available or incomplete, crawl.

A good crawler:

  • starts from a small set of seeds (homepage + top nav pages)
  • stays on the same host
  • avoids non-HTML assets
  • deduplicates URLs (canonicalization)
  • limits depth/pages so you don’t explode

A simple Python crawler (requests + BeautifulSoup)

This crawler:

  • BFS crawls from a seed URL
  • collects internal URLs
  • respects a max page limit
from collections import deque
from urllib.parse import urljoin, urlparse, urldefrag
import requests
from bs4 import BeautifulSoup

TIMEOUT = (10, 30)

def is_html_response(resp: requests.Response) -> bool:
    ctype = resp.headers.get("Content-Type", "")
    return "text/html" in ctype


def normalize_url(base: str, href: str) -> str | None:
    if not href:
        return None

    url = urljoin(base, href)
    url, _frag = urldefrag(url)

    parsed = urlparse(url)
    if parsed.scheme not in ("http", "https"):
        return None

    return url


def crawl_site(seed: str, max_pages: int = 200) -> list[str]:
    seed_parsed = urlparse(seed)
    host = seed_parsed.netloc

    q = deque([seed])
    seen = set([seed])
    out = []

    s = requests.Session()
    s.headers.update({"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"})

    while q and len(out) < max_pages:
        url = q.popleft()

        try:
            resp = s.get(url, timeout=TIMEOUT, allow_redirects=True)
            resp.raise_for_status()
        except Exception as e:
            print("fetch failed", url, e)
            continue

        if not is_html_response(resp):
            continue

        out.append(url)

        soup = BeautifulSoup(resp.text, "lxml")
        for a in soup.select("a[href]"):
            nxt = normalize_url(url, a.get("href"))
            if not nxt:
                continue

            p = urlparse(nxt)
            if p.netloc != host:
                continue

            # avoid obvious non-content file types
            if any(p.path.lower().endswith(ext) for ext in (".jpg", ".png", ".gif", ".pdf", ".zip")):
                continue

            if nxt not in seen:
                seen.add(nxt)
                q.append(nxt)

    return out

urls = crawl_site("https://example.com/", max_pages=100)
print("discovered", len(urls))
print(urls[:10])

Where ProxiesAPI fits in crawling

When you crawl, you’re making many fetches. Swapping the fetch layer to ProxiesAPI can be as simple as wrapping each URL:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com/" | head

In Python, you can wrap URLs before GET and keep parsing unchanged.


Method 4) Use site: search (find URLs search engines already know)

Google/Bing (and other engines) can act like a “third-party index”.

Example query:

  • site:example.com

Pros:

  • finds orphaned pages you won’t reach by crawling

Cons:

  • not complete
  • may show old/redirected URLs
  • results are capped and sampling-based

If you need completeness, treat this as a supplement.


Method 5) Export from your own sources (analytics + CMS)

If you control the site, the most accurate URL list is internal:

  • CMS export (WordPress, Webflow, Shopify, custom DB)
  • analytics (GA4, Plausible) pages report
  • server logs

This is often the best method for internal audits because it captures URLs that are live but not necessarily linked.


Compare the 5 methods (when to use what)

MethodBest forCoverageEffortNotes
SitemapSEO-friendly sitesHighLowStart here; also check sitemap indexes
robots.txtDiscover sitemap hintsMediumLowCommonly points to the real sitemap
Crawl linksReachable internal pagesMedium–HighMediumMisses orphan pages; depends on seeds
site: searchOrphan / long-tail discoveryLow–MediumLowSampling + caps; supplement only
Internal exportsSites you controlVery HighMediumMost complete if you have access

If you need a “mostly complete” URL set:

  1. fetch robots.txt → collect sitemap URLs
  2. fetch sitemaps → extract all <loc> URLs
  3. crawl from homepage to discover non-sitemap pages
  4. (optional) supplement with site: search

Then dedupe + normalize (remove fragments, unify trailing slashes).


Where ProxiesAPI fits (honestly)

URL discovery isn’t just one request — it’s a pipeline:

  • robots.txt fetch
  • sitemap index fetch
  • many sitemap fetches
  • lots of internal page fetches

ProxiesAPI helps when you want a consistent fetch interface across all those steps, especially when you run URL discovery frequently across many domains.


Checklist

  • you tried sitemap.xml and sitemap_index.xml
  • you checked robots.txt for sitemap hints
  • your crawler stays on-host and dedupes URLs
  • you set max_pages so the crawl doesn’t explode
  • you normalized URLs (remove fragments, consistent scheme/host)
Turn URL discovery into a reliable pipeline with ProxiesAPI

URL discovery is the first step of every crawl. ProxiesAPI helps keep the fetch layer consistent as you move from a few sitemap requests to large, multi-step crawling workflows across many sites.

Related guides

Scrape a WordPress Site via sitemap_index.xml (Python): Crawl, Extract, Dedupe, Export
A production-grade, sitemap-first WordPress scraper in Python (no guessed selectors): crawl sitemaps, fetch posts, extract clean text + metadata, and export to CSV/JSON.
tutorial#python#wordpress#sitemap
Web Scraping with Python: The Complete 2026 Tutorial
A from-scratch, production-minded guide to web scraping in Python: requests + BeautifulSoup, pagination, retries, caching, proxies, and a reusable scraper template.
guide#web scraping python#python#web-scraping
How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)
A real-world IMDb scraping tutorial covering browser-rendered HTML, verified selectors, sample output, and why naive requests can fail.
scraping-tutorials#python#beautifulsoup#web-scraping
Scrape Wikipedia Article Data at Scale (Tables + Infobox + Links)
Extract structured fields from many Wikipedia pages (infobox + tables + links) with ProxiesAPI + Python, then save to CSV/JSON.
tutorial#python#wikipedia#web-scraping