Scrape SAM.gov Contract Opportunities with Python (Search API + Dataset + Screenshots)

Apr 26, 2026 · tutorial · #python, #sam-gov, #government-contracts, #api, #requests, #dataset, #csv, #json

SAM.gov is the official U.S. government system for contract opportunities. The good news: unlike many “scrape-only” targets, SAM.gov provides a documented API for pulling opportunities.

So instead of reverse-engineering HTML, we’ll build a clean dataset builder around the official endpoint:

query opportunities with filters (date range, keywords, set-asides, NAICS)
paginate reliably
normalize the results into a tidy schema
export JSONL (stream-friendly) and CSV

We’ll also capture screenshots of the SAM.gov UI so it’s easy to validate what the API data corresponds to.

Keep your pipeline stable when you crawl beyond the API

SAM.gov offers an official API for opportunities — use it first. When your workflow expands to crawling attachment pages, vendor sites, and linked documents, ProxiesAPI helps make the network layer reliable (timeouts, retries, IP rotation) without rewriting your scraper.

Get 1,000 free API calls View pricing

Important notes (before you touch the API)

Get an API key
- SAM.gov uses API keys for access.
- You’ll pass it as a header, usually X-Api-Key.
Prefer the API over HTML
- HTML scraping is more brittle.
- The API is designed for exactly this workflow.
Respect rate limits
- Keep reasonable pagination and caching.
- Retry politely on transient errors.

Step 1: Environment and dependencies

python -m venv .venv
source .venv/bin/activate
pip install requests tenacity python-dateutil pandas

Step 2: Define the Opportunities API client

SAM.gov’s API surface evolves, but the common pattern is:

base: https://api.sam.gov/
opportunities search endpoint (varies by version)
request parameters for filters + pagination

We’ll implement a client where the endpoint URL is configurable so you can adjust it if SAM changes versions.

Create a file samgov_opps.py:

import os
import time
import random
from dataclasses import dataclass
from typing import Any

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


@dataclass
class SamGovConfig:
    api_key: str
    base_url: str = "https://api.sam.gov"
    # Common opportunities endpoint (update if SAM publishes a different path/version)
    opportunities_path: str = "/opportunities/v2/search"
    timeout: tuple[int, int] = (10, 45)  # connect, read
    min_delay_s: float = 0.4
    max_delay_s: float = 1.2


class SamGovClient:
    def __init__(self, cfg: SamGovConfig):
        self.cfg = cfg
        self.session = requests.Session()
        self.session.headers.update({
            "Accept": "application/json",
            "User-Agent": "proxiesapi-guides/1.0 (+https://proxiesapi.com)",
            "X-Api-Key": cfg.api_key,
        })

    def _polite_sleep(self):
        time.sleep(random.uniform(self.cfg.min_delay_s, self.cfg.max_delay_s))

    @retry(
        stop=stop_after_attempt(6),
        wait=wait_exponential(multiplier=1, min=2, max=30),
        retry=retry_if_exception_type((requests.RequestException,)),
        reraise=True,
    )
    def get(self, path: str, params: dict[str, Any]) -> dict:
        self._polite_sleep()
        url = self.cfg.base_url.rstrip("/") + path
        r = self.session.get(url, params=params, timeout=self.cfg.timeout)
        r.raise_for_status()
        return r.json()


def load_client() -> SamGovClient:
    api_key = os.getenv("SAMGOV_API_KEY")
    if not api_key:
        raise RuntimeError("Missing SAMGOV_API_KEY in environment")
    return SamGovClient(SamGovConfig(api_key=api_key))

This client gives you:

retries
timeouts
jittered delays

That’s the foundation for “it runs overnight” rather than “it runs once”.

Step 3: Search opportunities with filters

You typically want to filter by:

posted date range
keyword text
set-aside types
place of performance
NAICS codes

We’ll implement a minimal, sane default search:

last N days
optional keyword
page size

from datetime import datetime, timedelta, timezone


def iso_date(d: datetime) -> str:
    # SAM endpoints often accept YYYY-MM-DD or ISO8601; keep it simple.
    return d.strftime("%Y-%m-%d")


def search_opportunities(
    client: SamGovClient,
    keyword: str | None = None,
    days_back: int = 7,
    page_size: int = 100,
    offset: int = 0,
) -> dict:
    end = datetime.now(timezone.utc)
    start = end - timedelta(days=days_back)

    params = {
        "limit": page_size,
        "offset": offset,
        # Parameter names vary by version; keep these configurable if needed.
        "postedFrom": iso_date(start),
        "postedTo": iso_date(end),
    }
    if keyword:
        params["q"] = keyword

    return client.get(client.cfg.opportunities_path, params=params)

Inspect one response

if __name__ == "__main__":
    client = load_client()
    data = search_opportunities(client, keyword="software", days_back=3, page_size=25)
    print(data.keys())

If this fails with a 404 or schema mismatch, SAM has likely moved the endpoint path/params.

Fix: update opportunities_path and param names to match current SAM documentation.

Step 4: Normalize results into a tidy schema

API responses usually contain nested objects. For analysis, you want flat, consistent rows.

Here’s a normalizer that tries to extract fields that commonly exist:

notice id
title
solicitation number
agency / office
posted date
response due date
NAICS
place of performance
URL

from dateutil import parser as dtparse


def safe_dt(s: str | None):
    if not s:
        return None
    try:
        return dtparse.parse(s).isoformat()
    except Exception:
        return s


def normalize_notice(n: dict) -> dict:
    return {
        "notice_id": n.get("noticeId") or n.get("id") or n.get("notice_id"),
        "title": n.get("title") or n.get("noticeTitle"),
        "solicitation_number": n.get("solicitationNumber") or n.get("solicitation_number"),
        "agency": (n.get("department") or n.get("agency") or {}).get("name")
            if isinstance(n.get("department") or n.get("agency"), dict)
            else (n.get("department") or n.get("agency")),
        "office": (n.get("office") or {}).get("name") if isinstance(n.get("office"), dict) else n.get("office"),
        "posted_date": safe_dt(n.get("postedDate") or n.get("posted_date")),
        "response_deadline": safe_dt(n.get("responseDeadLine") or n.get("response_deadline") or n.get("dueDate")),
        "naics": n.get("naics") or n.get("naicsCode"),
        "place_of_performance": n.get("placeOfPerformance") or n.get("place_of_performance"),
        "type": n.get("type"),
        "set_aside": n.get("setAside") or n.get("set_aside"),
        "ui_url": n.get("uiLink") or n.get("url") or n.get("link"),
        "raw": n,
    }

Step 5: Pagination (crawl all pages)

Most SAM endpoints use offset + limit.

We’ll keep crawling until we get fewer than limit items.

import json
from typing import Iterable


def iter_notices(payload: dict) -> list[dict]:
    # Different versions use different keys.
    for key in ["opportunities", "data", "results", "notices"]:
        v = payload.get(key)
        if isinstance(v, list):
            return v
    return []


def crawl_all(
    client: SamGovClient,
    keyword: str | None = None,
    days_back: int = 7,
    page_size: int = 100,
    max_pages: int = 50,
) -> list[dict]:
    out = []

    for page in range(max_pages):
        offset = page * page_size
        payload = search_opportunities(
            client,
            keyword=keyword,
            days_back=days_back,
            page_size=page_size,
            offset=offset,
        )

        notices = iter_notices(payload)
        print(f"page {page+1}: {len(notices)}")

        for n in notices:
            out.append(normalize_notice(n))

        if len(notices) < page_size:
            break

    return out


if __name__ == "__main__":
    client = load_client()
    rows = crawl_all(client, keyword="cybersecurity", days_back=14, page_size=50)
    print("rows:", len(rows))
    print(json.dumps(rows[0], indent=2)[:800])

Step 6: Export to JSONL + CSV

import pandas as pd


def export(rows: list[dict], stem: str = "samgov_opportunities"):
    jsonl_path = f"{stem}.jsonl"
    with open(jsonl_path, "w", encoding="utf-8") as f:
        for r in rows:
            # avoid duplicating raw huge objects in CSV workflows if you don’t need them
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

    # flatten: drop raw for CSV by default
    flat = [{k: v for k, v in r.items() if k != "raw"} for r in rows]
    df = pd.DataFrame(flat)
    csv_path = f"{stem}.csv"
    df.to_csv(csv_path, index=False)

    print("wrote", jsonl_path, "and", csv_path, "rows:", len(rows))

Optional: Crawl linked attachments (where ProxiesAPI helps)

The SAM API gets you opportunity records. But real workflows often also:

download attachments (PDFs, Word docs)
crawl vendor/agency pages linked from notices
follow “additional info” URLs

Those are classic “web scraping” tasks and can be much less stable than the API.

If you add an attachments crawler, that’s where ProxiesAPI is useful:

consistent proxy configuration across many target domains
fewer failures on large batch downloads
IP rotation when a domain throttles you

QA checklist

you can run 1 request and see a JSON payload
crawling prints pages and stops naturally
notice_id and title look correct for 10 samples
exports open in your spreadsheet tool
if endpoint/params changed, you updated opportunities_path and param names

A quick “do not do this” list

Don’t scrape HTML when the API gives you the data.
Don’t parallelize aggressively; start with serial pagination.
Don’t store raw PII or sensitive fields unnecessarily.

If you build around the API first, you’ll end up with a cleaner, more robust dataset pipeline — and you’ll only need scraping/proxies for the parts that truly require it.

Keep your pipeline stable when you crawl beyond the API

Get 1,000 free API calls View pricing

Build a reliable scraper for SAM.gov contract opportunities: crawl search results, paginate, extract listing cards, fetch detail pages, and export CSV/JSON. Includes retry logic and a screenshot step for proof.

tutorial#python#sam-gov#government-contracts

Scrape Government Contract Opportunities from SAM.gov (Python + ProxiesAPI)

Pull contract opportunity listings from SAM.gov into a clean CSV: pagination, robust retries, request headers, and an honest ProxiesAPI integration to reduce throttling.

tutorial#python#sam-gov#government-contracts

Scrape Government Contract Data from SAM.gov with Python (Green List #4)

Extract contract opportunity listings from SAM.gov: build a resilient scraper with pagination, retries, and clean JSON/CSV output. Includes a target-page screenshot and ProxiesAPI integration.

tutorial#python#sam-gov#government-contracts

Scrape Weather Data for Any City (Open-Meteo)

Build a lightweight weather dataset pipeline: geocode a city, fetch forecasts from Open-Meteo, add caching + retries, and export clean JSON/CSV.

tutorial#python#open-meteo#api

Scrape SAM.gov Contract Opportunities with Python (Search API + Dataset + Screenshots)

Related guides