Scrape Crunchbase Company Data

Crunchbase is useful when you need a compact company research dataset: name, description, categories, location, website, and a few ranking or momentum signals in one place.

The catch is that Crunchbase is not a simple static HTML site. The public pages are wrapped in a large client-side app, and direct requests often return a Cloudflare block page instead of the company profile you wanted.

In this guide we will use a practical two-step pattern:

  • discover organization profile URLs from the public company search page
  • fetch each profile page, render the HTML, and parse structured data plus visible score signals

The result is a CSV you can use for lead lists, market maps, or founder research.

Crunchbase company profile page

Route tougher company pages through ProxiesAPI

Crunchbase mixes Cloudflare protection with a JavaScript-heavy app. ProxiesAPI gives you one fetch layer for rendered HTML while you keep the parsing code simple.


What we are scraping

We will use two public page types:

  • discover page: https://www.crunchbase.com/discover/organization.companies
  • profile page: https://www.crunchbase.com/organization/openai

On the discover page, the useful pattern is the organization link itself:

  • result links look like a[aria-label][href^="/organization/"]
  • the rendered result rows live inside Crunchbase grid-row elements

On profile pages, the cleanest source is usually the structured data block:

  • script[type="application/ld+json"]

Crunchbase also exposes useful visible text signals in the rendered DOM, such as:

  • Growth Score
  • CB Rank
  • Heat Score
  • company type / funding stage text such as Private or Venture - Series Unknown

That combination is enough for a solid research dataset.


Setup

python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 pandas lxml

We will use:

  • requests for HTTP
  • BeautifulSoup for HTML parsing
  • pandas for export

If you plan to fetch through ProxiesAPI, set your key first:

export PROXIESAPI_KEY="YOUR_KEY"

Step 1: Fetch rendered HTML

Crunchbase is one of those sites where "download the raw HTML and parse it" is usually not enough. A direct request often fails before the real app loads.

This helper routes requests through ProxiesAPI when a key is present. I also pass render=1 because the page content is JavaScript-heavy.

import os
import time
import requests
from urllib.parse import urlencode

TIMEOUT = (20, 60)
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY", "").strip()


def proxiesapi_url(target_url: str, render: bool = True) -> str:
    if not PROXIESAPI_KEY:
        return target_url
    params = {
        "auth_key": PROXIESAPI_KEY,
        "url": target_url,
    }
    if render:
        params["render"] = "1"
    return "https://api.proxiesapi.com/?" + urlencode(params)


def fetch_html(url: str, session: requests.Session | None = None) -> str:
    s = session or requests.Session()
    final_url = proxiesapi_url(url, render=True)
    r = s.get(final_url, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()

    text = r.text
    lowered = text.lower()
    if "attention required" in lowered or "sorry, you have been blocked" in lowered:
        raise RuntimeError(f"blocked while fetching {url}")
    return text

This does not overpromise anything: if the target still returns a block page, the script fails loudly instead of silently parsing junk.


Step 2: Discover company profile URLs

The company search page already contains links to organization profiles. That gives us a simple way to bootstrap a crawl.

from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.crunchbase.com"
DISCOVER_URL = f"{BASE}/discover/organization.companies"


def parse_company_links(html: str, limit: int = 25) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    seen = set()
    urls = []

    for a in soup.select('a[aria-label][href^="/organization/"]'):
        href = a.get("href", "").strip()
        label = a.get("aria-label", "").strip()
        if not href or not label:
            continue

        full_url = urljoin(BASE, href)
        if full_url in seen:
            continue

        seen.add(full_url)
        urls.append(full_url)

        if len(urls) >= limit:
            break

    return urls


session = requests.Session()
discover_html = fetch_html(DISCOVER_URL, session=session)
company_urls = parse_company_links(discover_html, limit=10)

print("discovered:", len(company_urls))
print(company_urls[:5])

Typical output:

discovered: 10
['https://www.crunchbase.com/organization/european-investment-bank',
 'https://www.crunchbase.com/organization/coreweave',
 'https://www.crunchbase.com/organization/xai', ...]

For a lot of research tasks, that is enough: discover a small batch of companies, then enrich the details from the profile pages.


Step 3: Parse the profile JSON-LD

The rendered profile page includes application/ld+json blocks. That is much more stable than scraping visible labels one by one.

import json


def find_org_jsonld(html: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    for tag in soup.select('script[type="application/ld+json"]'):
        raw = (tag.string or tag.get_text() or "").strip()
        if not raw:
            continue
        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            continue

        candidates = data if isinstance(data, list) else [data]
        for item in candidates:
            entity = item.get("mainEntity", item)
            if entity.get("@type") in {"Corporation", "Organization"}:
                return entity

    raise ValueError("Could not find organization JSON-LD")


def clean_list(value) -> list[str]:
    if value is None:
        return []
    if isinstance(value, str):
        return [x.strip() for x in value.split(",") if x.strip()]
    if isinstance(value, list):
        return [str(x).strip() for x in value if str(x).strip()]
    return [str(value).strip()]


def parse_profile_fields(html: str, profile_url: str) -> dict:
    entity = find_org_jsonld(html)
    soup = BeautifulSoup(html, "lxml")
    page_text = soup.get_text(" ", strip=True)

    address = entity.get("address", {}) or {}
    location = ", ".join(
        part for part in [
            address.get("addressLocality"),
            address.get("addressRegion"),
            address.get("addressCountry"),
        ] if part
    )

    scores = {}
    for label in ["Growth Score", "CB Rank", "Heat Score"]:
        import re
        m = re.search(rf"{label}\s+(\d+)", page_text)
        scores[label.lower().replace(" ", "_")] = int(m.group(1)) if m else None

    stage = None
    for candidate in [
        "Venture - Series Unknown",
        "Private",
        "Public",
        "Seed",
        "Series A",
        "Series B",
    ]:
        if candidate in page_text:
            stage = candidate
            break

    return {
        "profile_url": profile_url,
        "name": entity.get("name"),
        "description": entity.get("description"),
        "website": entity.get("url"),
        "location": location,
        "categories": clean_list(entity.get("keywords")),
        "linkedin": next((u for u in clean_list(entity.get("sameAs")) if "linkedin.com" in u), None),
        "growth_score": scores["growth_score"],
        "cb_rank": scores["cb_rank"],
        "heat_score": scores["heat_score"],
        "stage_signal": stage,
    }

This gives you a useful structure without depending on brittle nth-child selectors.


Step 4: Crawl a batch and export CSV

import pandas as pd


def crawl_companies(limit: int = 10, delay_seconds: float = 2.0) -> pd.DataFrame:
    session = requests.Session()
    discover_html = fetch_html(DISCOVER_URL, session=session)
    company_urls = parse_company_links(discover_html, limit=limit)

    rows = []
    for i, url in enumerate(company_urls, start=1):
        print(f"[{i}/{len(company_urls)}] {url}")
        html = fetch_html(url, session=session)
        row = parse_profile_fields(html, profile_url=url)
        rows.append(row)
        time.sleep(delay_seconds)

    return pd.DataFrame(rows)


if __name__ == "__main__":
    df = crawl_companies(limit=10, delay_seconds=2.5)
    df["categories"] = df["categories"].apply(lambda xs: "; ".join(xs))
    df.to_csv("crunchbase_companies.csv", index=False)
    print(df.head(3).to_dict(orient="records"))

Example output shape:

[
  {
    'profile_url': 'https://www.crunchbase.com/organization/coreweave',
    'name': 'CoreWeave',
    'description': 'CoreWeave is a cloud infrastructure provider purpose-built for AI.',
    'website': 'https://www.coreweave.com',
    'location': 'Roseland, New Jersey, United States',
    'categories': 'Artificial Intelligence; Cloud Computing; GPU',
    'growth_score': 97,
    'cb_rank': 2,
    'heat_score': 95
  }
]

Practical notes for Crunchbase

Crunchbase is a good example of why scraper architecture matters more than clever selectors.

Use this checklist:

  • fail if the response contains a block page
  • dedupe profile URLs before crawling
  • keep request rates low
  • treat scores as optional fields because the visible page can change
  • prefer JSON-LD when it exists

Also remember that discover results are paginated and filterable. Once the basic flow works, you can expand it by:

  • storing the query URL you used for discovery
  • following the next result page
  • segmenting by industry or geography

If your goal is account enrichment rather than full-site crawling, a smaller high-quality crawl is usually better than a giant noisy one.


When to switch to Playwright

Stay with requests + rendered HTML when:

  • the rendered response contains the fields you need
  • you only need profile text, links, and visible scores

Switch to Playwright when:

  • the page requires button clicks or scrolling before data appears
  • you need data hidden behind tabs or dialogs
  • the rendered HTML route stops exposing stable structure

For many research pipelines, the hybrid pattern is enough:

  • discover URLs from rendered HTML
  • parse JSON-LD and visible text
  • export a narrow, reliable CSV

That gets you company names, descriptions, categories, websites, and lightweight momentum signals from Crunchbase without building a full browser automation stack from day one.

Route tougher company pages through ProxiesAPI

Crunchbase mixes Cloudflare protection with a JavaScript-heavy app. ProxiesAPI gives you one fetch layer for rendered HTML while you keep the parsing code simple.

Related guides

Scrape Wikipedia Category Pages into CSV
Crawl a Wikipedia category tree, collect page titles and URLs, and export a clean CSV with subcategories and article members.
tutorial#python#wikipedia#web-scraping
Scrape Craigslist Listings by Category and City
Show how to pull listing titles, prices, neighborhoods, and posting URLs from Craigslist search pages into a clean dataset.
tutorial#python#craigslist#web-scraping
Scrape Rightmove Sold Prices
Walk through building a sold-price dataset from Rightmove with listing details, pagination, and clean CSV export.
tutorial#python#rightmove#real-estate
Steam Deal Tracker: Scrape Daily Specials + Price Drops (Python + ProxiesAPI)
Scrape Steam specials/search pages via ProxiesAPI, extract discount + price + appid, and persist a daily snapshot to detect price drops. Includes pagination, CSV export, and a screenshot of the target page.
tutorial#python#steam#price-tracking