Scrape Government Contract Data from SAM.gov (Opportunities + Details)

SAM.gov is the main US government portal for contract opportunities. If you’re building a market intelligence dataset (opportunity titles, NAICS codes, agencies, due dates, attachments, etc.), you’ll usually need to collect:

  • search results (lists of opportunities)
  • detail pages (the full opportunity record)

In this tutorial we’ll build a repeatable dataset builder:

  • fetch results using a stable search URL pattern (configurable)
  • paginate
  • extract opportunity URLs/IDs
  • fetch + parse details
  • export clean JSON and CSV

SAM.gov contract opportunities (we’ll crawl results → then details)

Make SAM.gov collection more reliable with ProxiesAPI

Government portals can be slow and rate-limited. ProxiesAPI can help keep long-running opportunity crawls stable when you’re collecting many pages and detail URLs.


A reality check: SAM.gov has APIs (use them if you can)

Before scraping, check whether your use case can use official APIs.

Scraping is still useful when:

  • you need fields not exposed in the API
  • you’re prototyping quickly
  • you need to validate UI-visible data

But if the API works for your requirements, it’ll be more stable long-term.

This guide focuses on building a scraping pipeline that you can adapt, not on promising permanence.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

Step 1: Networking layer (timeouts + retries)

import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 40)

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}


class FetchError(Exception):
    pass


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(FetchError),
)
def fetch_html(session: requests.Session, url: str) -> FetchResult:
    r = session.get(url, headers=DEFAULT_HEADERS, timeout=TIMEOUT)

    if r.status_code in (403, 429):
        raise FetchError(f"blocked: {r.status_code}")
    if r.status_code >= 500:
        raise FetchError(f"server error: {r.status_code}")

    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)


def polite_sleep(min_s: float = 0.8, max_s: float = 2.0) -> None:
    time.sleep(random.uniform(min_s, max_s))

Step 2: Start from a known-good search URL (filters)

SAM.gov search is filter-driven. The most reliable approach is:

  1. set your filters in the UI
  2. copy the resulting URL
  3. use that as START_URL

That way you’re not guessing query parameters.

Examples of filters you might apply:

  • posted date range (last 30 days)
  • NAICS code(s)
  • set-aside type
  • place of performance

For the scraper, we treat the start URL as a configurable input:

START_URL = "PASTE_SAMGOV_OPPORTUNITIES_SEARCH_URL_HERE"

Step 3: Extract opportunity detail URLs from the results page

SAM.gov is a modern web app. Depending on how it renders for you, you may see:

  • server-rendered HTML with usable links
  • client-rendered content (harder with raw requests)

This tutorial uses a best-effort HTML extraction approach, plus a fallback to Playwright later if needed.

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://sam.gov"

# Heuristic: opportunity links often include an "opp" path.
OPP_LINK_RE = re.compile(r"/opp/|/opportunities/")


def extract_opportunity_urls(results_html: str) -> list[str]:
    soup = BeautifulSoup(results_html, "lxml")

    urls: list[str] = []
    seen: set[str] = set()

    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue

        if not OPP_LINK_RE.search(href):
            continue

        abs_url = urljoin(BASE, href)
        if abs_url not in seen:
            seen.add(abs_url)
            urls.append(abs_url)

    return urls

Sanity check

session = requests.Session()
res = fetch_html(session, START_URL)
urls = extract_opportunity_urls(res.text)
print("found", len(urls), "opportunity urls")
print(urls[:5])

If you get 0, you likely need a browser-based fetch (Playwright) for the initial HTML.


Step 4: Parse opportunity details (HTML-first, JSON-when-available)

Opportunity pages often contain structured labels (“Agency”, “Posted Date”, etc.).

A practical parser strategy:

  • capture the page title
  • extract a small set of labeled fields from definition lists / table rows
  • keep the raw URL + scrape timestamp
from datetime import datetime, timezone


def normalize_space(s: str | None) -> str | None:
    if s is None:
        return None
    return " ".join(s.split())


def parse_labeled_fields(soup: BeautifulSoup) -> dict[str, str]:
    out: dict[str, str] = {}

    # Generic approach: look for label/value pairs in the DOM.
    # This is intentionally flexible; adjust selectors based on what you see.
    for row in soup.select("div, li, tr"):
        text = row.get_text(" ", strip=True)
        if not text:
            continue

        # Very lightweight heuristics for common fields.
        if "Posted Date" in text and "|" not in text:
            out.setdefault("posted_date_text", text)
        if "Response Date" in text and "|" not in text:
            out.setdefault("response_date_text", text)
        if "NAICS" in text:
            out.setdefault("naics_text", text)
        if "Set Aside" in text:
            out.setdefault("set_aside_text", text)

    return out


def parse_opportunity_detail(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    title = soup.title.get_text(strip=True) if soup.title else None
    h1 = soup.select_one("h1")
    h1_text = normalize_space(h1.get_text(" ", strip=True) if h1 else None)

    fields = parse_labeled_fields(soup)

    return {
        "url": url,
        "title_tag": title,
        "title": h1_text,
        "scraped_at": datetime.now(timezone.utc).isoformat(),
        **fields,
    }

Step 5: Crawl results → crawl details → export

This is the dataset builder loop.

import csv
import json


def crawl_opportunities(start_url: str, max_items: int = 50) -> list[dict]:
    session = requests.Session()

    res = fetch_html(session, start_url)
    opp_urls = extract_opportunity_urls(res.text)

    rows: list[dict] = []

    for i, url in enumerate(opp_urls[:max_items], start=1):
        try:
            polite_sleep(1.0, 2.0)
            detail = fetch_html(session, url)
            rows.append(parse_opportunity_detail(detail.text, url))
            print(f"[{i}/{min(len(opp_urls), max_items)}] ok {url}")
        except Exception as e:
            print(f"[{i}] failed {url}: {e}")

    return rows


def export_json(path: str, rows: list[dict]) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


def export_csv(path: str, rows: list[dict]) -> None:
    if not rows:
        return
    keys = sorted({k for r in rows for k in r.keys()})
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=keys)
        w.writeheader()
        w.writerows(rows)


rows = crawl_opportunities(START_URL, max_items=30)
export_json("samgov_opportunities.json", rows)
export_csv("samgov_opportunities.csv", rows)
print("wrote", len(rows), "rows")

If HTML extraction fails: use Playwright for rendering

If SAM.gov returns minimal HTML, you’ll need a browser.

A pragmatic hybrid:

  • use Playwright to fetch the rendered HTML (or intercept the JSON XHR)
  • keep the rest of the pipeline identical

Skeleton:

# pip install playwright
# playwright install

from playwright.sync_api import sync_playwright


def fetch_rendered_html(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
        return html

# html = fetch_rendered_html(START_URL)
# urls = extract_opportunity_urls(html)

Where ProxiesAPI fits

For SAM.gov datasets, you’ll typically run long jobs:

  • many search pages
  • many detail URLs
  • occasional failures

ProxiesAPI can help by stabilizing the fetch step when you encounter throttling.

A simple pattern is to route requests through ProxiesAPI while leaving your parsing unchanged.

import os

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")


def fetch_html_via_proxiesapi(session: requests.Session, url: str) -> FetchResult:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY")

    proxiesapi_url = "https://api.proxiesapi.com"
    params = {"auth_key": PROXIESAPI_KEY, "url": url}

    r = session.get(proxiesapi_url, params=params, headers=DEFAULT_HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)

QA checklist

  • start URL loads consistently (not blocked)
  • opportunity URLs extracted > 0
  • detail parser returns non-empty titles for several items
  • exports load cleanly
  • you’re sleeping between requests

Next upgrades

  • paginate results pages (iterate your search URL parameters)
  • extract structured fields by observing exact labels in the DOM
  • intercept JSON API calls in Playwright for the cleanest data
  • store in SQLite/Postgres for incremental updates
Make SAM.gov collection more reliable with ProxiesAPI

Government portals can be slow and rate-limited. ProxiesAPI can help keep long-running opportunity crawls stable when you’re collecting many pages and detail URLs.

Related guides

Scrape Government Contract Opportunities from SAM.gov (Python + ProxiesAPI)
Build a reliable scraper for SAM.gov contract opportunities: crawl search results, paginate, extract listing cards, fetch detail pages, and export CSV/JSON. Includes retry logic and a screenshot step for proof.
tutorial#python#sam-gov#government-contracts
Scrape Government Contract Opportunities from SAM.gov (Python + ProxiesAPI)
Pull contract opportunity listings from SAM.gov into a clean CSV: pagination, robust retries, request headers, and an honest ProxiesAPI integration to reduce throttling.
tutorial#python#sam-gov#government-contracts
Scrape Government Contract Data from SAM.gov with Python (Green List #4)
Extract contract opportunity listings from SAM.gov: build a resilient scraper with pagination, retries, and clean JSON/CSV output. Includes a target-page screenshot and ProxiesAPI integration.
tutorial#python#sam-gov#government-contracts
Scrape UK Property Prices from Rightmove (Dataset Builder + Screenshots)
Build a repeatable Rightmove sold-price dataset pipeline in Python: crawl result pages, extract listing URLs, parse sold-price details, and export clean CSV/JSON with retries and politeness.
tutorial#python#rightmove#real-estate