Scrape Government Contract Data from SAM.gov with Python (Opportunities + Details)
SAM.gov is the US government’s system of record for many contracting opportunities. If you’re building a searchable feed or a dataset for analysis (set-asides, NAICS codes, deadlines, agencies, contacts), you usually need a two-step pipeline:
- Search / list: collect opportunity IDs and summary fields across many pages
- Detail enrichment: visit each opportunity’s detail page and extract structured fields
In this guide we’ll build that pipeline in Python, using ProxiesAPI in the network layer.

Opportunity lists are easy. The hard part is scaling detail-page enrichment without timeouts, throttling, and flaky responses. ProxiesAPI helps stabilize the fetch layer so your pipeline finishes.
A quick reality check: prefer official exports when available
SAM.gov provides APIs and data services for some use cases.
If an official API meets your needs, use it—scraping is best for:
- prototyping
- filling gaps where APIs are limited
- building a “good enough” internal dataset quickly
This tutorial focuses on public pages and the mechanics of a robust list→detail enrichment flow.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
ProxiesAPI fetch layer
As with any site tutorial, keep ProxiesAPI integration simple: wrap fetch() so the rest of your scraper is normal Python.
import os
import time
import random
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY")
TIMEOUT = (10, 50)
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
})
class FetchError(Exception):
pass
def proxiesapi_url(target_url: str) -> str:
if not PROXIESAPI_KEY:
raise RuntimeError("Set PROXIESAPI_KEY env var")
return f"https://proxiesapi.com/api?auth_key={PROXIESAPI_KEY}&url={requests.utils.quote(target_url, safe='')}"
@retry(
reraise=True,
stop=stop_after_attempt(4),
wait=wait_exponential(multiplier=1, min=1, max=15),
retry=retry_if_exception_type((requests.RequestException, FetchError)),
)
def fetch(url: str) -> str:
r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
if r.status_code >= 400:
raise FetchError(f"HTTP {r.status_code}")
text = r.text or ""
if len(text) < 3000:
raise FetchError("Response too small (possible block/interstitial)")
return text
def jitter_sleep(min_s=0.5, max_s=1.3):
time.sleep(random.uniform(min_s, max_s))
Step 1: Find and parse the opportunities list
SAM.gov search pages are dynamic and can change. The scraping approach that survives changes best is:
- Treat the list page as an HTML document
- Extract stable identifiers (notice ID / solicitation ID) and the detail link
- Keep selectors defensive and add fallbacks
Here’s a list-page parser that looks for common patterns:
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://sam.gov"
def parse_opportunity_cards(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
# Many rows include a link to a detail route; we collect those.
for a in soup.select('a[href*="/opp/"] , a[href*="opportunity"], a[href*="/opportunities/"]'):
href = a.get("href")
if not href:
continue
url = href if href.startswith("http") else urljoin(BASE, href)
title = a.get_text(" ", strip=True)
# Try to pull a nearby ID-like token
parent = a.find_parent(["div", "li", "article"]) or a
blob = parent.get_text(" ", strip=True)
m = re.search(r"\b([A-Z0-9][A-Z0-9\-]{5,})\b", blob)
notice_id = m.group(1) if m else None
out.append({
"title": title or None,
"notice_id": notice_id,
"detail_url": url,
})
# Dedupe by URL
seen = set()
deduped = []
for r in out:
if r["detail_url"] in seen:
continue
seen.add(r["detail_url"])
deduped.append(r)
return deduped
Pagination
Search pages often include query params for page/offset.
If you have a working search URL from your browser, the easiest approach is to:
- copy that URL
- keep it as your
SEARCH_URL - update a single page parameter (e.g.,
page=oroffset=)
Because this varies, we’ll implement a helper that adds/replaces page.
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
def set_query_param(url: str, key: str, value: str) -> str:
parts = urlparse(url)
q = parse_qs(parts.query)
q[key] = [value]
return urlunparse((parts.scheme, parts.netloc, parts.path, parts.params, urlencode(q, doseq=True), parts.fragment))
def crawl_search(search_url: str, pages: int = 3) -> list[dict]:
all_rows = []
seen_urls = set()
for p in range(1, pages + 1):
page_url = set_query_param(search_url, "page", str(p))
html = fetch(page_url)
batch = parse_opportunity_cards(html)
for r in batch:
u = r["detail_url"]
if u in seen_urls:
continue
seen_urls.add(u)
all_rows.append(r)
print(f"page {p}/{pages}: {len(batch)} cards (total {len(all_rows)})")
jitter_sleep()
return all_rows
Step 2: Enrich with detail-page fields
On the detail page, you typically want:
- Agency
- Posted date / response deadline
- NAICS / PSC codes
- Set-aside / contract type
- Place of performance
- A short description
The exact HTML varies, so treat the detail page as a “label → value” document.
A resilient approach:
- Build a small function to find a value next to a label
- Use multiple label variants (e.g.,
Response Date,Due Date)
from dataclasses import dataclass, asdict
@dataclass
class SamOpportunity:
notice_id: str | None
title: str | None
detail_url: str
agency: str | None
posted_date: str | None
response_deadline: str | None
naics: str | None
set_aside: str | None
place_of_performance: str | None
def text_or_none(el):
return el.get_text(" ", strip=True) if el else None
def find_value_by_label(soup: BeautifulSoup, labels: list[str]) -> str | None:
# Look for label text in dt/dd pairs, or in two-column rows.
label_set = {l.strip().lower() for l in labels}
# 1) definition lists
for dt in soup.select("dt"):
t = dt.get_text(" ", strip=True).lower()
if t in label_set:
dd = dt.find_next_sibling("dd")
return text_or_none(dd)
# 2) generic: find an element whose text is exactly a label
for lab in soup.find_all(string=True):
t = (lab or "").strip().lower()
if t in label_set:
el = lab.parent
# try next sibling
sib = el.find_next_sibling()
if sib:
val = text_or_none(sib)
if val and val.lower() not in label_set:
return val
return None
def parse_detail(detail_url: str, html: str, base_row: dict) -> SamOpportunity:
soup = BeautifulSoup(html, "lxml")
title = base_row.get("title")
notice_id = base_row.get("notice_id")
agency = find_value_by_label(soup, ["Agency", "Office", "Department"])
posted_date = find_value_by_label(soup, ["Posted Date", "Publish Date", "Posted"])
response_deadline = find_value_by_label(soup, ["Response Date", "Due Date", "Response Deadline"])
naics = find_value_by_label(soup, ["NAICS", "NAICS Code"])
set_aside = find_value_by_label(soup, ["Set-Aside", "Set Aside"])
place = find_value_by_label(soup, ["Place of Performance", "Place"])
# If the title wasn’t captured from list page, try h1
if not title:
h1 = soup.select_one("h1")
title = text_or_none(h1)
return SamOpportunity(
notice_id=notice_id,
title=title,
detail_url=detail_url,
agency=agency,
posted_date=posted_date,
response_deadline=response_deadline,
naics=naics,
set_aside=set_aside,
place_of_performance=place,
)
Step 3: Full pipeline + export
import csv
def build_dataset(search_url: str, pages: int = 2) -> list[SamOpportunity]:
base_rows = crawl_search(search_url, pages=pages)
out: list[SamOpportunity] = []
for i, r in enumerate(base_rows, start=1):
url = r["detail_url"]
html = fetch(url)
opp = parse_detail(url, html, r)
out.append(opp)
print(f"{i}/{len(base_rows)} {opp.notice_id} {opp.title}")
jitter_sleep()
return out
def export_csv(rows: list[SamOpportunity], path: str = "samgov_opportunities.csv"):
if not rows:
raise RuntimeError("No rows")
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=list(asdict(rows[0]).keys()))
w.writeheader()
for r in rows:
w.writerow(asdict(r))
if __name__ == "__main__":
# Tip: run a search in your browser (e.g., keyword=software, set a date range), then copy the URL.
SEARCH_URL = "https://sam.gov/search/?index=opp&sort=-modifiedDate&keywords=software"
rows = build_dataset(SEARCH_URL, pages=2)
export_csv(rows)
print("exported", len(rows))
Hard-earned tips for SAM.gov scraping
- Don’t overfit selectors. Favor label/value parsing instead of fragile classnames.
- Expect mixed layouts. Some opportunities render different sections depending on type.
- Build a debug mode. Save HTML when parsing fails.
- Scale carefully. Detail pages are the expensive part; cache results by notice_id.
Debug helper:
from pathlib import Path
def save_debug_html(name: str, html: str):
Path("debug_html").mkdir(exist_ok=True)
Path(f"debug_html/{name}.html").write_text(html, encoding="utf-8")
QA checklist
- List crawl returns unique detail URLs
- Detail enrichment extracts agency + response deadline for most rows
- CSV export has consistent columns
- Screenshot saved to
/public/images/posts/<slug>/...
Next upgrades
- Parse structured JSON if SAM.gov embeds it in scripts
- Store in SQLite/Postgres with unique indexes (notice_id)
- Add incremental refresh (re-crawl last 7 days daily)
- Add concurrency (carefully) once your block rate is low
Once you move from “a few pages” to “hundreds of pages + thousands of detail URLs”, using ProxiesAPI in the fetch layer makes the whole pipeline far less brittle.
Opportunity lists are easy. The hard part is scaling detail-page enrichment without timeouts, throttling, and flaky responses. ProxiesAPI helps stabilize the fetch layer so your pipeline finishes.