Scrape Government Contract Data from SAM.gov (Opportunities + Details)

SAM.gov is the canonical place to find US government contracting opportunities.

The catch: it’s not a static “one page = one dataset” site. A real pipeline usually looks like:

  1. query / search
  2. paginate results
  3. fetch details per opportunity
  4. normalize into a clean schema
  5. export to JSON/CSV (or insert into a DB)

In this guide we’ll build exactly that in Python, using ProxiesAPI to make the network layer resilient.

SAM.gov opportunities search page (we'll crawl results + follow detail pages)

Make high-volume SAM.gov crawls reliable with ProxiesAPI

Opportunity search is bursty (many pages + details). ProxiesAPI helps stabilize the request layer with IP rotation and consistent fetch behavior so your scraper can focus on parsing and normalization.


What we’re scraping (shape of the crawl)

At a high level, we want to produce records like:

  • notice id
  • title
  • agency
  • posted date
  • response deadline
  • place of performance
  • set-aside / NAICS (if available)
  • URL
  • detail text/attachments links (if needed)

Depending on SAM.gov’s current implementation, the “search” and “detail” experiences may be:

  • server-rendered HTML pages
  • HTML pages that call JSON APIs behind the scenes

Best practice: prefer official JSON endpoints if they’re stable and publicly accessible. If not, scrape HTML.

This tutorial shows a hybrid approach:

  • start from the search page (so you’re not guessing)
  • if you can identify a JSON API call, use it
  • otherwise parse HTML with conservative selectors

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dotenv

ProxiesAPI fetch wrapper (retries + timeouts)

You’ll need a ProxiesAPI key:

export PROXIESAPI_KEY="YOUR_KEY"
import os
import random
import urllib.parse
import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY")
TIMEOUT = (10, 40)

session = requests.Session()

UA_POOL = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
]


def build_proxiesapi_url(target_url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Missing PROXIESAPI_KEY")

    # Example format; adjust for your ProxiesAPI plan.
    return "https://api.proxiesapi.com/?" + urllib.parse.urlencode(
        {
            "auth_key": PROXIESAPI_KEY,
            "url": target_url,
            # Optional (if supported):
            # "country": "US",
            # "render": "false",
        }
    )


@retry(stop=stop_after_attempt(6), wait=wait_exponential_jitter(initial=1, max=25))
def fetch(url: str) -> str:
    headers = {
        "User-Agent": random.choice(UA_POOL),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
    }
    r = session.get(build_proxiesapi_url(url), headers=headers, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text

Step 1: Pick a real SAM.gov query URL

Open SAM.gov → search for opportunities with a narrow query.

Examples of filters you might apply:

  • keyword: “cybersecurity”
  • posted date: last 30 days
  • place of performance: a state

Copy the URL from your browser.

You’ll end up with a URL like (placeholder):

SEARCH_URL = "https://sam.gov/search/?index=opp"  # replace with your real query URL

We’ll parse the search results HTML and try to identify links that lead to an opportunity’s detail page.

Because class names on modern web apps can be unstable, the robust approach is:

  • look for anchors whose href contains stable tokens (/opp/, opportunity, notice, etc.)
  • then validate by fetching one and confirming it contains expected fields
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://sam.gov"


def parse_results(html: str) -> tuple[list[dict], str | None]:
    soup = BeautifulSoup(html, "lxml")

    links = []

    # Best-effort selectors: find anchors likely pointing to opportunity detail pages.
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue

        # Heuristic patterns; adjust after inspecting HTML.
        if any(tok in href for tok in ["/opp/", "opportunity", "notice", "/awards/"]):
            url = urljoin(BASE, href)
            text = a.get_text(" ", strip=True)[:200]
            links.append({"url": url, "anchor": text})

    # Deduplicate by URL
    seen = set()
    out = []
    for x in links:
        if x["url"] in seen:
            continue
        seen.add(x["url"])
        out.append(x)

    # Pagination: try rel=next, then a "Next" button
    next_url = None
    rel_next = soup.select_one("a[rel='next']")
    if rel_next and rel_next.get("href"):
        next_url = urljoin(BASE, rel_next["href"])
    else:
        for a in soup.select("a[href]"):
            if a.get_text(" ", strip=True).lower() in {"next", "next page"}:
                next_url = urljoin(BASE, a["href"])
                break

    return out, next_url

This finds candidates. The next step is to fetch detail pages and parse what you need.


Step 3: Parse fields from an opportunity detail page

On a detail page, look for stable elements:

  • JSON-LD (application/ld+json)
  • meta tags
  • headings/labels

Here’s a parser that:

  • pulls the page title
  • searches text for common labels
  • keeps a raw_text_excerpt for debugging
import re
from bs4 import BeautifulSoup


def find_label_value(text: str, label: str) -> str | None:
    # Very conservative: "Label: value" patterns
    # Works surprisingly well on many detail pages.
    pat = rf"{re.escape(label)}\s*[:\-]\s*(.+)"
    m = re.search(pat, text, flags=re.IGNORECASE)
    if not m:
        return None
    # stop at newline / double space
    val = m.group(1).strip()
    val = val.split("\n")[0].strip()
    return val[:300]


def parse_detail(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")
    title = soup.select_one("title")
    title_text = title.get_text(" ", strip=True) if title else None

    page_text = soup.get_text("\n", strip=True)

    record = {
        "url": url,
        "title": title_text,
        "notice_id": find_label_value(page_text, "Notice ID") or find_label_value(page_text, "Solicitation Number"),
        "agency": find_label_value(page_text, "Agency") or find_label_value(page_text, "Organization"),
        "posted_date": find_label_value(page_text, "Posted Date") or find_label_value(page_text, "Posted"),
        "response_deadline": find_label_value(page_text, "Response Date") or find_label_value(page_text, "Response Deadline"),
        "naics": find_label_value(page_text, "NAICS") or find_label_value(page_text, "NAICS Code"),
        "set_aside": find_label_value(page_text, "Set-Aside") or find_label_value(page_text, "Set Aside"),
        "place_of_performance": find_label_value(page_text, "Place of Performance"),
        "raw_text_excerpt": page_text[:1000],
    }

    return record

This is intentionally “best effort.” SAM.gov’s UI evolves, so after your first run you’ll refine label keys and selectors based on what you actually see in the HTML.


Step 4: Crawl: results → details → export

import json
import csv


def crawl(search_url: str, max_pages: int = 3, max_items: int = 50) -> list[dict]:
    out: list[dict] = []
    seen = set()

    url = search_url
    page = 0

    while url and page < max_pages and len(out) < max_items:
        page += 1
        html = fetch(url)
        items, next_url = parse_results(html)
        print(f"page {page}: candidates={len(items)}")

        for it in items:
            if it["url"] in seen:
                continue
            seen.add(it["url"])

            detail_html = fetch(it["url"])
            rec = parse_detail(detail_html, it["url"])
            out.append(rec)

            if len(out) >= max_items:
                break

        url = next_url

    return out


def export_json(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


def export_csv(rows: list[dict], path: str) -> None:
    if not rows:
        raise ValueError("No rows")
    with open(path, "w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
        w.writeheader()
        w.writerows(rows)


if __name__ == "__main__":
    SEARCH_URL = "PASTE_YOUR_SAM_GOV_SEARCH_URL_HERE"
    rows = crawl(SEARCH_URL, max_pages=2, max_items=25)
    export_json(rows, "sam_opportunities.json")
    export_csv(rows, "sam_opportunities.csv")
    print("wrote", len(rows), "opportunities")

Hardening tips for SAM.gov

Use narrower queries

Start narrow (keyword + agency + date) so your first run is debuggable.

Persist intermediate state

Write discovered opportunity URLs to disk so you can resume.

Add caching

Cache fetch(url) responses (even a simple diskcache) to avoid re-fetching during development.

Watch for API calls

Open DevTools → Network while loading a results page. If you see a stable JSON endpoint returning opportunity cards, prefer that instead of parsing HTML.


Where ProxiesAPI fits (honestly)

SAM.gov scraping is naturally “many requests”:

  • results pages
  • detail pages

Even moderate crawls can become flaky if IP reputation or rate patterns trip defenses. ProxiesAPI helps you keep the transport layer stable so your crawler can focus on parsing, normalization, and export.


Next upgrades

  • normalize dates to ISO-8601
  • store to SQLite/Postgres
  • enrich with agency metadata
  • schedule daily crawls for new opportunities
Make high-volume SAM.gov crawls reliable with ProxiesAPI

Opportunity search is bursty (many pages + details). ProxiesAPI helps stabilize the request layer with IP rotation and consistent fetch behavior so your scraper can focus on parsing and normalization.

Related guides

Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
How to Scrape Stack Overflow Questions and Accepted Answers with Python (By Tag)
Build a resilient Stack Overflow scraper: crawl tag pages, extract question metadata, follow links, and parse accepted answers. Includes retries, dedupe, and ProxiesAPI-ready requests + a screenshot of the tag page.
tutorial#python#stack-overflow#web-scraping
Scrape Government Contract Data from SAM.gov (Opportunities + Details)
Build a SAM.gov opportunities dataset in Python: search with filters, paginate results, follow detail pages, and export structured contract fields with retries and polite crawling.
tutorial#python#sam-gov#government-contracts
Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)
Build a repeatable Rightmove sold-price dataset pipeline in Python: crawl result pages, extract listing URLs, parse sold-price details, and export clean CSV/JSON with retries and politeness.
tutorial#python#rightmove#real-estate