How to Find All URLs on Any Website: 5 Methods (Sitemaps, Crawling, Search & More)

Mar 21, 2026 · seo · #find all urls on a website, #web-crawling, #sitemap, #python, #requests, #beautifulsoup, #seo

If you’ve ever tried to find all URLs on a website, you’ve probably hit the same wall:

you find the obvious pages (homepage, a couple of categories)
then you realize the site has thousands of deep pages
and the real question becomes: what’s the most reliable way to discover them all?

This guide is a practical playbook: 5 methods you can combine, depending on the site.

We’ll cover:

Sitemap discovery (sitemap.xml, sitemap index files)
robots.txt discovery (sites often declare sitemaps there)
In-page internal link extraction (quick wins)
Crawling with rules (BFS crawl with dedupe + limits)
Search-based discovery (when the site hides URLs behind JS or forms)

Along the way, you’ll get working Python code you can adapt into a production URL discovery pipeline.

Discover URLs at scale with ProxiesAPI

URL discovery is network-heavy: you’ll hit sitemaps, category pages, pagination, and lots of internal links. ProxiesAPI helps keep large URL discovery runs stable by proxying requests through a single, simple HTTP interface.

Get 1,000 free API calls View pricing

Before you start: what “all URLs” really means

A website can expose URLs via multiple channels:

public HTML links (discoverable by crawling)
sitemaps (often the best, most complete source)
parameterized URLs (filters, tracking params, pagination)
JS-generated links (harder without a browser)
API endpoints (sometimes the real data source)

So the goal is usually:

All canonical, indexable URLs you care about (SEO view)
or All reachable internal URLs (crawler view)

In this tutorial, we’ll aim for high coverage without junk.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

requests for fetching
BeautifulSoup(lxml) for parsing

Shared utilities (URL normalization + filtering)

URL discovery goes sideways if you don’t normalize.

Here’s a small utility layer that:

converts relative links to absolute
strips URL fragments (#section)
optionally strips tracking parameters (simple version)
filters to same-host URLs

from __future__ import annotations

from dataclasses import dataclass
from urllib.parse import urljoin, urlparse, urlunparse, parse_qsl, urlencode


TRACKING_PARAMS = {
    "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
    "gclid", "fbclid",
}


def normalize_url(url: str, base: str) -> str | None:
    if not url:
        return None

    abs_url = urljoin(base, url)
    p = urlparse(abs_url)

    # only http(s)
    if p.scheme not in {"http", "https"}:
        return None

    # drop fragments
    p = p._replace(fragment="")

    # strip common tracking params (keep others)
    q = [(k, v) for (k, v) in parse_qsl(p.query, keep_blank_values=True) if k not in TRACKING_PARAMS]
    p = p._replace(query=urlencode(q, doseq=True))

    return urlunparse(p)


def same_host(url: str, root: str) -> bool:
    return urlparse(url).netloc == urlparse(root).netloc

Method 1 — Sitemaps (best coverage, fastest)

If a site provides a sitemap, use it first.

Common URLs:

https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/sitemap/sitemap.xml

Fetch + parse sitemap XML

Sitemaps are XML. You can parse with xml.etree.ElementTree.

import requests
import xml.etree.ElementTree as ET

TIMEOUT = (10, 30)


def fetch(url: str) -> str:
    r = requests.get(url, timeout=TIMEOUT, headers={"User-Agent": "Mozilla/5.0 (URL-Discovery-Bot)"})
    r.raise_for_status()
    return r.text


def parse_sitemap(xml_text: str) -> list[str]:
    """Returns URLs from <urlset> sitemaps."""
    root = ET.fromstring(xml_text)

    # Namespace-safe parsing: grab all <loc> regardless of namespace
    urls = []
    for loc in root.findall(".//{*}loc"):
        if loc.text:
            urls.append(loc.text.strip())
    return urls


sitemap_url = "https://example.com/sitemap.xml"
xml_text = fetch(sitemap_url)
urls = parse_sitemap(xml_text)
print("sitemap urls:", len(urls))
print(urls[:5])

Handle sitemap indexes

Many large sites use a sitemap index containing links to multiple sitemap files.


def is_sitemap_index(xml_text: str) -> bool:
    return "<sitemapindex" in xml_text


def discover_from_sitemap(url: str, limit: int = 200_000) -> list[str]:
    xml_text = fetch(url)

    if is_sitemap_index(xml_text):
        parts = parse_sitemap(xml_text)  # locs to sub-sitemaps
        out = []
        for part in parts:
            sub = fetch(part)
            out.extend(parse_sitemap(sub))
            if len(out) >= limit:
                break
        return out[:limit]

    return parse_sitemap(xml_text)[:limit]

ProxiesAPI variant (drop-in network layer)

If you’re doing URL discovery across many sites, the fetch layer becomes the fragile part.

ProxiesAPI gives you a single HTTP endpoint. You pass the target URL and your key:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com/sitemap.xml"

In Python:

import requests

PROXIESAPI_KEY = "API_KEY"  # replace


def fetch_via_proxiesapi(target_url: str) -> str:
    api_url = "http://api.proxiesapi.com/"
    r = requests.get(
        api_url,
        params={"key": PROXIESAPI_KEY, "url": target_url},
        timeout=TIMEOUT,
        headers={"User-Agent": "Mozilla/5.0 (URL-Discovery-Bot)"},
    )
    r.raise_for_status()
    return r.text

Method 2 — robots.txt (often declares the sitemap)

Even if sitemap.xml is not where you expect, many sites declare it in robots.txt:

https://example.com/robots.txt

Example lines:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_posts.xml

Parse it like this:

import re


def discover_sitemaps_from_robots(root_url: str) -> list[str]:
    robots_url = root_url.rstrip("/") + "/robots.txt"
    txt = fetch(robots_url)

    sitemaps = []
    for line in txt.splitlines():
        m = re.match(r"^\s*Sitemap:\s*(\S+)\s*$", line, flags=re.I)
        if m:
            sitemaps.append(m.group(1))
    return sitemaps


root = "https://example.com"
print(discover_sitemaps_from_robots(root))

In practice:

fetch robots.txt
extract all sitemap URLs
parse each sitemap (and indexes)
merge + dedupe

Method 3 — Extract internal links from a page (quick win)

If you just need a list of internal URLs from a few key pages (home, category, search results), parsing <a href> is easy.

from bs4 import BeautifulSoup


def extract_links_from_html(html: str, base_url: str, root_url: str) -> set[str]:
    soup = BeautifulSoup(html, "lxml")
    out: set[str] = set()

    for a in soup.select("a[href]"):
        href = a.get("href")
        u = normalize_url(href, base_url)
        if not u:
            continue
        if same_host(u, root_url):
            out.add(u)

    return out


root = "https://example.com"
html = fetch(root)
links = extract_links_from_html(html, base_url=root, root_url=root)
print("internal links from home:", len(links))
print(list(sorted(links))[:10])

Terminal sanity check

python url_discovery.py

Typical output:

internal links from home: 128
https://example.com/about
https://example.com/blog/
https://example.com/pricing
...

Method 4 — Crawl the site (BFS with limits)

Crawling is where you get breadth.

A sensible production crawl should include:

dedupe (seen set)
a queue (BFS)
rate limiting and timeouts
scope rules (same host, allowlist paths, extension filters)
max pages to avoid infinite loops

Here’s a simple crawler you can extend.

import time
from collections import deque


@dataclass
class CrawlConfig:
    root: str
    max_pages: int = 2000
    delay_s: float = 0.2


def crawl_internal_urls(cfg: CrawlConfig) -> list[str]:
    q = deque([cfg.root])
    seen: set[str] = set([cfg.root])
    out: list[str] = []

    while q and len(out) < cfg.max_pages:
        url = q.popleft()

        try:
            html = fetch(url)
        except requests.HTTPError as e:
            print("HTTP error", url, e)
            continue
        except requests.RequestException as e:
            print("Request error", url, e)
            continue

        out.append(url)

        new_links = extract_links_from_html(html, base_url=url, root_url=cfg.root)
        for link in new_links:
            if link in seen:
                continue
            seen.add(link)
            q.append(link)

        print("crawled", len(out), "queued", len(q), "seen", len(seen))
        time.sleep(cfg.delay_s)

    return out


cfg = CrawlConfig(root="https://example.com", max_pages=200)
urls = crawl_internal_urls(cfg)
print("total crawled:", len(urls))

Make it scale: crawl with ProxiesAPI

If you’re crawling thousands of pages, your request pipeline should be simple, consistent, and easy to retry.

The easiest swap is: keep everything the same, but use fetch_via_proxiesapi().


def fetch_crawl(url: str) -> str:
    # return fetch(url)                 # direct
    return fetch_via_proxiesapi(url)    # via ProxiesAPI

Then use fetch_crawl() inside crawl_internal_urls().

Method 5 — Search-based discovery (when crawling isn’t enough)

Some sites:

hide important URLs behind JS
require form submits
block bots from deep paths

In these cases, search-based discovery can reveal URLs your crawler misses.

Two common approaches:

site:example.com some keyword
Search for a pattern: site:example.com inurl:product

You can do this manually, or via a search API.

If you do it programmatically, treat it as a seed generator: you still need to dedupe and normalize.

Putting it together: a practical URL discovery pipeline

A robust workflow looks like this:

robots.txt → sitemap URLs
sitemap parsing → bulk URLs
crawl key sections (category pages, pagination) to catch non-sitemap URLs
search-based seeds when coverage is low
normalize + dedupe + export

Export to JSONL (easy for pipelines)

import json

with open("discovered_urls.jsonl", "w", encoding="utf-8") as f:
    for u in sorted(set(urls)):
        f.write(json.dumps({"url": u}, ensure_ascii=False) + "\n")

print("wrote discovered_urls.jsonl")

Common pitfalls (and how to avoid them)

Infinite URL spaces: calendars, faceted search, ?page= loops → add caps and allowlists.
Duplicate content: same page with many params → strip tracking params; consider canonical URLs.
Non-HTML URLs: PDFs, images → filter by extension or Content-Type.
Overloading the server: add delays, respect robots, and keep max_pages sane.