Scrape Government Contract Data from SAM.gov with Python (Opportunities + Details)

SAM.gov is the US government’s system of record for many contracting opportunities. If you’re building a searchable feed or a dataset for analysis (set-asides, NAICS codes, deadlines, agencies, contacts), you usually need a two-step pipeline:

  1. Search / list: collect opportunity IDs and summary fields across many pages
  2. Detail enrichment: visit each opportunity’s detail page and extract structured fields

In this guide we’ll build that pipeline in Python, using ProxiesAPI in the network layer.

SAM.gov opportunities search page (we’ll capture rows + pagination)

Make SAM.gov data pulls reliable with ProxiesAPI

Opportunity lists are easy. The hard part is scaling detail-page enrichment without timeouts, throttling, and flaky responses. ProxiesAPI helps stabilize the fetch layer so your pipeline finishes.


A quick reality check: prefer official exports when available

SAM.gov provides APIs and data services for some use cases.

If an official API meets your needs, use it—scraping is best for:

  • prototyping
  • filling gaps where APIs are limited
  • building a “good enough” internal dataset quickly

This tutorial focuses on public pages and the mechanics of a robust list→detail enrichment flow.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

ProxiesAPI fetch layer

As with any site tutorial, keep ProxiesAPI integration simple: wrap fetch() so the rest of your scraper is normal Python.

import os
import time
import random
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY")
TIMEOUT = (10, 50)

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
})


class FetchError(Exception):
    pass


def proxiesapi_url(target_url: str) -> str:
    if not PROXIESAPI_KEY:
        raise RuntimeError("Set PROXIESAPI_KEY env var")
    return f"https://proxiesapi.com/api?auth_key={PROXIESAPI_KEY}&url={requests.utils.quote(target_url, safe='')}"


@retry(
    reraise=True,
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=1, max=15),
    retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch(url: str) -> str:
    r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
    if r.status_code >= 400:
        raise FetchError(f"HTTP {r.status_code}")
    text = r.text or ""
    if len(text) < 3000:
        raise FetchError("Response too small (possible block/interstitial)")
    return text


def jitter_sleep(min_s=0.5, max_s=1.3):
    time.sleep(random.uniform(min_s, max_s))

Step 1: Find and parse the opportunities list

SAM.gov search pages are dynamic and can change. The scraping approach that survives changes best is:

  • Treat the list page as an HTML document
  • Extract stable identifiers (notice ID / solicitation ID) and the detail link
  • Keep selectors defensive and add fallbacks

Here’s a list-page parser that looks for common patterns:

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://sam.gov"


def parse_opportunity_cards(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []

    # Many rows include a link to a detail route; we collect those.
    for a in soup.select('a[href*="/opp/"] , a[href*="opportunity"], a[href*="/opportunities/"]'):
        href = a.get("href")
        if not href:
            continue

        url = href if href.startswith("http") else urljoin(BASE, href)
        title = a.get_text(" ", strip=True)

        # Try to pull a nearby ID-like token
        parent = a.find_parent(["div", "li", "article"]) or a
        blob = parent.get_text(" ", strip=True)
        m = re.search(r"\b([A-Z0-9][A-Z0-9\-]{5,})\b", blob)
        notice_id = m.group(1) if m else None

        out.append({
            "title": title or None,
            "notice_id": notice_id,
            "detail_url": url,
        })

    # Dedupe by URL
    seen = set()
    deduped = []
    for r in out:
        if r["detail_url"] in seen:
            continue
        seen.add(r["detail_url"])
        deduped.append(r)

    return deduped

Pagination

Search pages often include query params for page/offset.

If you have a working search URL from your browser, the easiest approach is to:

  • copy that URL
  • keep it as your SEARCH_URL
  • update a single page parameter (e.g., page= or offset=)

Because this varies, we’ll implement a helper that adds/replaces page.

from urllib.parse import urlparse, parse_qs, urlencode, urlunparse


def set_query_param(url: str, key: str, value: str) -> str:
    parts = urlparse(url)
    q = parse_qs(parts.query)
    q[key] = [value]
    return urlunparse((parts.scheme, parts.netloc, parts.path, parts.params, urlencode(q, doseq=True), parts.fragment))


def crawl_search(search_url: str, pages: int = 3) -> list[dict]:
    all_rows = []
    seen_urls = set()

    for p in range(1, pages + 1):
        page_url = set_query_param(search_url, "page", str(p))
        html = fetch(page_url)
        batch = parse_opportunity_cards(html)

        for r in batch:
            u = r["detail_url"]
            if u in seen_urls:
                continue
            seen_urls.add(u)
            all_rows.append(r)

        print(f"page {p}/{pages}: {len(batch)} cards (total {len(all_rows)})")
        jitter_sleep()

    return all_rows

Step 2: Enrich with detail-page fields

On the detail page, you typically want:

  • Agency
  • Posted date / response deadline
  • NAICS / PSC codes
  • Set-aside / contract type
  • Place of performance
  • A short description

The exact HTML varies, so treat the detail page as a “label → value” document.

A resilient approach:

  • Build a small function to find a value next to a label
  • Use multiple label variants (e.g., Response Date, Due Date)
from dataclasses import dataclass, asdict


@dataclass
class SamOpportunity:
    notice_id: str | None
    title: str | None
    detail_url: str
    agency: str | None
    posted_date: str | None
    response_deadline: str | None
    naics: str | None
    set_aside: str | None
    place_of_performance: str | None


def text_or_none(el):
    return el.get_text(" ", strip=True) if el else None


def find_value_by_label(soup: BeautifulSoup, labels: list[str]) -> str | None:
    # Look for label text in dt/dd pairs, or in two-column rows.
    label_set = {l.strip().lower() for l in labels}

    # 1) definition lists
    for dt in soup.select("dt"):
        t = dt.get_text(" ", strip=True).lower()
        if t in label_set:
            dd = dt.find_next_sibling("dd")
            return text_or_none(dd)

    # 2) generic: find an element whose text is exactly a label
    for lab in soup.find_all(string=True):
        t = (lab or "").strip().lower()
        if t in label_set:
            el = lab.parent
            # try next sibling
            sib = el.find_next_sibling()
            if sib:
                val = text_or_none(sib)
                if val and val.lower() not in label_set:
                    return val

    return None


def parse_detail(detail_url: str, html: str, base_row: dict) -> SamOpportunity:
    soup = BeautifulSoup(html, "lxml")

    title = base_row.get("title")
    notice_id = base_row.get("notice_id")

    agency = find_value_by_label(soup, ["Agency", "Office", "Department"])
    posted_date = find_value_by_label(soup, ["Posted Date", "Publish Date", "Posted"])
    response_deadline = find_value_by_label(soup, ["Response Date", "Due Date", "Response Deadline"])
    naics = find_value_by_label(soup, ["NAICS", "NAICS Code"])
    set_aside = find_value_by_label(soup, ["Set-Aside", "Set Aside"])
    place = find_value_by_label(soup, ["Place of Performance", "Place"])

    # If the title wasn’t captured from list page, try h1
    if not title:
        h1 = soup.select_one("h1")
        title = text_or_none(h1)

    return SamOpportunity(
        notice_id=notice_id,
        title=title,
        detail_url=detail_url,
        agency=agency,
        posted_date=posted_date,
        response_deadline=response_deadline,
        naics=naics,
        set_aside=set_aside,
        place_of_performance=place,
    )

Step 3: Full pipeline + export

import csv


def build_dataset(search_url: str, pages: int = 2) -> list[SamOpportunity]:
    base_rows = crawl_search(search_url, pages=pages)
    out: list[SamOpportunity] = []

    for i, r in enumerate(base_rows, start=1):
        url = r["detail_url"]
        html = fetch(url)
        opp = parse_detail(url, html, r)
        out.append(opp)
        print(f"{i}/{len(base_rows)} {opp.notice_id} {opp.title}")
        jitter_sleep()

    return out


def export_csv(rows: list[SamOpportunity], path: str = "samgov_opportunities.csv"):
    if not rows:
        raise RuntimeError("No rows")
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=list(asdict(rows[0]).keys()))
        w.writeheader()
        for r in rows:
            w.writerow(asdict(r))


if __name__ == "__main__":
    # Tip: run a search in your browser (e.g., keyword=software, set a date range), then copy the URL.
    SEARCH_URL = "https://sam.gov/search/?index=opp&sort=-modifiedDate&keywords=software"

    rows = build_dataset(SEARCH_URL, pages=2)
    export_csv(rows)
    print("exported", len(rows))

Hard-earned tips for SAM.gov scraping

  1. Don’t overfit selectors. Favor label/value parsing instead of fragile classnames.
  2. Expect mixed layouts. Some opportunities render different sections depending on type.
  3. Build a debug mode. Save HTML when parsing fails.
  4. Scale carefully. Detail pages are the expensive part; cache results by notice_id.

Debug helper:

from pathlib import Path

def save_debug_html(name: str, html: str):
    Path("debug_html").mkdir(exist_ok=True)
    Path(f"debug_html/{name}.html").write_text(html, encoding="utf-8")

QA checklist

  • List crawl returns unique detail URLs
  • Detail enrichment extracts agency + response deadline for most rows
  • CSV export has consistent columns
  • Screenshot saved to /public/images/posts/<slug>/...

Next upgrades

  • Parse structured JSON if SAM.gov embeds it in scripts
  • Store in SQLite/Postgres with unique indexes (notice_id)
  • Add incremental refresh (re-crawl last 7 days daily)
  • Add concurrency (carefully) once your block rate is low

Once you move from “a few pages” to “hundreds of pages + thousands of detail URLs”, using ProxiesAPI in the fetch layer makes the whole pipeline far less brittle.

Make SAM.gov data pulls reliable with ProxiesAPI

Opportunity lists are easy. The hard part is scaling detail-page enrichment without timeouts, throttling, and flaky responses. ProxiesAPI helps stabilize the fetch layer so your pipeline finishes.

Related guides

Scrape Government Contract Data from SAM.gov (Opportunities + Details)
Extract live SAM.gov contract opportunities and enrich them with detail pages (filters, pagination, retries). Includes a production-ready Python scraper and export to JSON/CSV.
tutorial#python#sam-gov#government-contracts
Scrape Government Contract Data from SAM.gov (Opportunities + Details)
Build a SAM.gov opportunities dataset in Python: search with filters, paginate results, follow detail pages, and export structured contract fields with retries and polite crawling.
tutorial#python#sam-gov#government-contracts
Scrape Government Contract Opportunities from SAM.gov (Python + ProxiesAPI)
Build a reliable scraper for SAM.gov contract opportunities: crawl search results, paginate, extract listing cards, fetch detail pages, and export CSV/JSON. Includes retry logic and a screenshot step for proof.
tutorial#python#sam-gov#government-contracts
Scrape Government Contract Opportunities from SAM.gov (Python + ProxiesAPI)
Pull contract opportunity listings from SAM.gov into a clean CSV: pagination, robust retries, request headers, and an honest ProxiesAPI integration to reduce throttling.
tutorial#python#sam-gov#government-contracts