How to Find All URLs on Any Website: 5 Methods (Sitemaps, Crawling, Search & More)

If you’ve ever tried to find all URLs on a website, you’ve probably hit the same wall:

  • you find the obvious pages (homepage, a couple of categories)
  • then you realize the site has thousands of deep pages
  • and the real question becomes: what’s the most reliable way to discover them all?

This guide is a practical playbook: 5 methods you can combine, depending on the site.

We’ll cover:

  1. Sitemap discovery (sitemap.xml, sitemap index files)
  2. robots.txt discovery (sites often declare sitemaps there)
  3. In-page internal link extraction (quick wins)
  4. Crawling with rules (BFS crawl with dedupe + limits)
  5. Search-based discovery (when the site hides URLs behind JS or forms)

Along the way, you’ll get working Python code you can adapt into a production URL discovery pipeline.

Discover URLs at scale with ProxiesAPI

URL discovery is network-heavy: you’ll hit sitemaps, category pages, pagination, and lots of internal links. ProxiesAPI helps keep large URL discovery runs stable by proxying requests through a single, simple HTTP interface.


Before you start: what “all URLs” really means

A website can expose URLs via multiple channels:

  • public HTML links (discoverable by crawling)
  • sitemaps (often the best, most complete source)
  • parameterized URLs (filters, tracking params, pagination)
  • JS-generated links (harder without a browser)
  • API endpoints (sometimes the real data source)

So the goal is usually:

  • All canonical, indexable URLs you care about (SEO view)
  • or All reachable internal URLs (crawler view)

In this tutorial, we’ll aim for high coverage without junk.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for fetching
  • BeautifulSoup(lxml) for parsing

Shared utilities (URL normalization + filtering)

URL discovery goes sideways if you don’t normalize.

Here’s a small utility layer that:

  • converts relative links to absolute
  • strips URL fragments (#section)
  • optionally strips tracking parameters (simple version)
  • filters to same-host URLs
from __future__ import annotations

from dataclasses import dataclass
from urllib.parse import urljoin, urlparse, urlunparse, parse_qsl, urlencode


TRACKING_PARAMS = {
    "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
    "gclid", "fbclid",
}


def normalize_url(url: str, base: str) -> str | None:
    if not url:
        return None

    abs_url = urljoin(base, url)
    p = urlparse(abs_url)

    # only http(s)
    if p.scheme not in {"http", "https"}:
        return None

    # drop fragments
    p = p._replace(fragment="")

    # strip common tracking params (keep others)
    q = [(k, v) for (k, v) in parse_qsl(p.query, keep_blank_values=True) if k not in TRACKING_PARAMS]
    p = p._replace(query=urlencode(q, doseq=True))

    return urlunparse(p)


def same_host(url: str, root: str) -> bool:
    return urlparse(url).netloc == urlparse(root).netloc

Method 1 — Sitemaps (best coverage, fastest)

If a site provides a sitemap, use it first.

Common URLs:

  • https://example.com/sitemap.xml
  • https://example.com/sitemap_index.xml
  • https://example.com/sitemap/sitemap.xml

Fetch + parse sitemap XML

Sitemaps are XML. You can parse with xml.etree.ElementTree.

import requests
import xml.etree.ElementTree as ET

TIMEOUT = (10, 30)


def fetch(url: str) -> str:
    r = requests.get(url, timeout=TIMEOUT, headers={"User-Agent": "Mozilla/5.0 (URL-Discovery-Bot)"})
    r.raise_for_status()
    return r.text


def parse_sitemap(xml_text: str) -> list[str]:
    """Returns URLs from <urlset> sitemaps."""
    root = ET.fromstring(xml_text)

    # Namespace-safe parsing: grab all <loc> regardless of namespace
    urls = []
    for loc in root.findall(".//{*}loc"):
        if loc.text:
            urls.append(loc.text.strip())
    return urls


sitemap_url = "https://example.com/sitemap.xml"
xml_text = fetch(sitemap_url)
urls = parse_sitemap(xml_text)
print("sitemap urls:", len(urls))
print(urls[:5])

Handle sitemap indexes

Many large sites use a sitemap index containing links to multiple sitemap files.


def is_sitemap_index(xml_text: str) -> bool:
    return "<sitemapindex" in xml_text


def discover_from_sitemap(url: str, limit: int = 200_000) -> list[str]:
    xml_text = fetch(url)

    if is_sitemap_index(xml_text):
        parts = parse_sitemap(xml_text)  # locs to sub-sitemaps
        out = []
        for part in parts:
            sub = fetch(part)
            out.extend(parse_sitemap(sub))
            if len(out) >= limit:
                break
        return out[:limit]

    return parse_sitemap(xml_text)[:limit]

ProxiesAPI variant (drop-in network layer)

If you’re doing URL discovery across many sites, the fetch layer becomes the fragile part.

ProxiesAPI gives you a single HTTP endpoint. You pass the target URL and your key:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com/sitemap.xml"

In Python:

import requests

PROXIESAPI_KEY = "API_KEY"  # replace


def fetch_via_proxiesapi(target_url: str) -> str:
    api_url = "http://api.proxiesapi.com/"
    r = requests.get(
        api_url,
        params={"key": PROXIESAPI_KEY, "url": target_url},
        timeout=TIMEOUT,
        headers={"User-Agent": "Mozilla/5.0 (URL-Discovery-Bot)"},
    )
    r.raise_for_status()
    return r.text

Method 2 — robots.txt (often declares the sitemap)

Even if sitemap.xml is not where you expect, many sites declare it in robots.txt:

  • https://example.com/robots.txt

Example lines:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_posts.xml

Parse it like this:

import re


def discover_sitemaps_from_robots(root_url: str) -> list[str]:
    robots_url = root_url.rstrip("/") + "/robots.txt"
    txt = fetch(robots_url)

    sitemaps = []
    for line in txt.splitlines():
        m = re.match(r"^\s*Sitemap:\s*(\S+)\s*$", line, flags=re.I)
        if m:
            sitemaps.append(m.group(1))
    return sitemaps


root = "https://example.com"
print(discover_sitemaps_from_robots(root))

In practice:

  1. fetch robots.txt
  2. extract all sitemap URLs
  3. parse each sitemap (and indexes)
  4. merge + dedupe

If you just need a list of internal URLs from a few key pages (home, category, search results), parsing <a href> is easy.

from bs4 import BeautifulSoup


def extract_links_from_html(html: str, base_url: str, root_url: str) -> set[str]:
    soup = BeautifulSoup(html, "lxml")
    out: set[str] = set()

    for a in soup.select("a[href]"):
        href = a.get("href")
        u = normalize_url(href, base_url)
        if not u:
            continue
        if same_host(u, root_url):
            out.add(u)

    return out


root = "https://example.com"
html = fetch(root)
links = extract_links_from_html(html, base_url=root, root_url=root)
print("internal links from home:", len(links))
print(list(sorted(links))[:10])

Terminal sanity check

python url_discovery.py

Typical output:

internal links from home: 128
https://example.com/about
https://example.com/blog/
https://example.com/pricing
...

Method 4 — Crawl the site (BFS with limits)

Crawling is where you get breadth.

A sensible production crawl should include:

  • dedupe (seen set)
  • a queue (BFS)
  • rate limiting and timeouts
  • scope rules (same host, allowlist paths, extension filters)
  • max pages to avoid infinite loops

Here’s a simple crawler you can extend.

import time
from collections import deque


@dataclass
class CrawlConfig:
    root: str
    max_pages: int = 2000
    delay_s: float = 0.2


def crawl_internal_urls(cfg: CrawlConfig) -> list[str]:
    q = deque([cfg.root])
    seen: set[str] = set([cfg.root])
    out: list[str] = []

    while q and len(out) < cfg.max_pages:
        url = q.popleft()

        try:
            html = fetch(url)
        except requests.HTTPError as e:
            print("HTTP error", url, e)
            continue
        except requests.RequestException as e:
            print("Request error", url, e)
            continue

        out.append(url)

        new_links = extract_links_from_html(html, base_url=url, root_url=cfg.root)
        for link in new_links:
            if link in seen:
                continue
            seen.add(link)
            q.append(link)

        print("crawled", len(out), "queued", len(q), "seen", len(seen))
        time.sleep(cfg.delay_s)

    return out


cfg = CrawlConfig(root="https://example.com", max_pages=200)
urls = crawl_internal_urls(cfg)
print("total crawled:", len(urls))

Make it scale: crawl with ProxiesAPI

If you’re crawling thousands of pages, your request pipeline should be simple, consistent, and easy to retry.

The easiest swap is: keep everything the same, but use fetch_via_proxiesapi().


def fetch_crawl(url: str) -> str:
    # return fetch(url)                 # direct
    return fetch_via_proxiesapi(url)    # via ProxiesAPI

Then use fetch_crawl() inside crawl_internal_urls().


Method 5 — Search-based discovery (when crawling isn’t enough)

Some sites:

  • hide important URLs behind JS
  • require form submits
  • block bots from deep paths

In these cases, search-based discovery can reveal URLs your crawler misses.

Two common approaches:

  1. site:example.com some keyword
  2. Search for a pattern: site:example.com inurl:product

You can do this manually, or via a search API.

If you do it programmatically, treat it as a seed generator: you still need to dedupe and normalize.


Putting it together: a practical URL discovery pipeline

A robust workflow looks like this:

  1. robots.txt → sitemap URLs
  2. sitemap parsing → bulk URLs
  3. crawl key sections (category pages, pagination) to catch non-sitemap URLs
  4. search-based seeds when coverage is low
  5. normalize + dedupe + export

Export to JSONL (easy for pipelines)

import json

with open("discovered_urls.jsonl", "w", encoding="utf-8") as f:
    for u in sorted(set(urls)):
        f.write(json.dumps({"url": u}, ensure_ascii=False) + "\n")

print("wrote discovered_urls.jsonl")

Common pitfalls (and how to avoid them)

  • Infinite URL spaces: calendars, faceted search, ?page= loops → add caps and allowlists.
  • Duplicate content: same page with many params → strip tracking params; consider canonical URLs.
  • Non-HTML URLs: PDFs, images → filter by extension or Content-Type.
  • Overloading the server: add delays, respect robots, and keep max_pages sane.

FAQ

Is sitemap.xml always complete?

No. Many sites omit URLs (especially private or user-generated pages). But it’s the best place to start.

Can I find URLs behind JavaScript?

Sometimes not with pure requests. You may need a browser (Playwright) or an API the site uses.

What’s the fastest method?

If available: robots.txt → sitemaps.


Where ProxiesAPI fits (honestly)

URL discovery is mostly a networking problem:

  • many requests
  • many retries
  • inconsistent responses

ProxiesAPI helps by giving you a simple “fetch this URL” interface you can plug into any crawler.

You still need to be polite (delays, caps) and scope your crawl.

Discover URLs at scale with ProxiesAPI

URL discovery is network-heavy: you’ll hit sitemaps, category pages, pagination, and lots of internal links. ProxiesAPI helps keep large URL discovery runs stable by proxying requests through a single, simple HTTP interface.

Related guides

Scrape a WordPress Site via sitemap_index.xml (Python): Crawl, Extract, Dedupe, Export
A production-grade, sitemap-first WordPress scraper in Python (no guessed selectors): crawl sitemaps, fetch posts, extract clean text + metadata, and export to CSV/JSON.
tutorial#python#wordpress#sitemap
What Is Web Scraping? A Plain-English Guide for 2026 (With Real Examples)
A beginner-friendly explanation of what web scraping is, how it differs from APIs, common use cases, risks (blocks/legal), and a real end-to-end Python example with ProxiesAPI.
seo#what is web scraping#web-scraping#python
Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
Build a routes→prices dataset from Google Flights with pagination-safe requests, retries, and a proof screenshot. Includes export to CSV/JSON and pragmatic anti-blocking guidance.
tutorial#python#google#google-flights
How to Scrape IMDb Top 250 with Python (Without Guessing Selectors)
A real-world IMDb scraping tutorial covering browser-rendered HTML, verified selectors, sample output, and why naive requests can fail.
scraping-tutorials#python#beautifulsoup#web-scraping