How to Build a Job Board by Scraping Indeed + LinkedIn (Pipeline + Deduping)

Building a job board sounds simple: “scrape jobs, show them on a website.”

In practice, job sites are some of the most aggressively protected pages on the internet:

  • heavy anti-bot
  • frequent HTML changes
  • dynamic rendering
  • strict rate limiting

So the goal of this post is not to give you a brittle script.

It’s to give you a pipeline design you can actually run:

  • Collect jobs from Indeed and LinkedIn (at least to a URL + metadata level)
  • Normalize them into a single schema
  • Deduplicate (same job appears in both places)
  • Enrich with company + location standardization
  • Refresh on a cadence so your board stays current

And we’ll be honest about the constraints: for certain LinkedIn surfaces, you’ll likely need either a logged-in session, a browser automation layer, or a third-party data provider.

Target keyword (used naturally): how to build a job board by scraping indeed + linkedin

Use ProxiesAPI when you move from prototype to pipeline

The hard part of job scraping isn’t parsing HTML—it’s running the pipeline every day without getting rate-limited. ProxiesAPI helps by rotating IPs and smoothing intermittent blocks.


The first-principles approach

A job board is a data product. Treat it like one:

  1. Acquire job records reliably
  2. Normalize into a single schema
  3. De-duplicate and score for quality
  4. Store with history (so you can detect changes)
  5. Serve via an API + frontend
  6. Refresh with careful scheduling

Scraping is only step 1—and it’s rarely the hardest part long-term.


What you should collect (minimum viable job schema)

Start with a schema you can safely populate from both sites, even if fields are missing:

{
  "source": "indeed|linkedin",
  "source_id": "string",
  "source_url": "string",
  "title": "string",
  "company": "string",
  "location_raw": "string",
  "location_norm": "string",
  "remote_type": "onsite|hybrid|remote|unknown",
  "posted_at": "ISO8601|unknown",
  "description_text": "string",
  "employment_type": "full-time|contract|intern|unknown",
  "seniority": "string|unknown",
  "salary_raw": "string|unknown",
  "tags": ["string"],
  "first_seen_at": "ISO8601",
  "last_seen_at": "ISO8601",
  "hash": "string"
}

Two fields matter a lot:

  • source_id (unique per source)
  • hash (your cross-source dedupe key)

Collection layer: crawl “search → job detail”

Both Indeed and LinkedIn follow a similar shape:

  • a search results page (many jobs)
  • a job detail page (one job)

A robust pipeline uses a two-step crawl:

  1. Search crawl: collect job URLs + lightweight metadata
  2. Detail crawl: fetch each job page to extract the full record

Why two steps?

  • search pages are cheap and allow incremental discovery
  • details are expensive—only fetch what’s new

Data structures to persist

You want at least 3 tables (or collections):

  • job_urls (discovery queue)
  • jobs (canonical job records)
  • crawl_runs (run status, errors, stats)

Network reliability: timeouts, retries, proxies

When you scale from 50 requests/day to 50,000, your failure rate goes from “rare” to “constant.”

So you need a boring-but-solid network layer:

  • timeouts (connect + read)
  • bounded retries
  • backoff on 403/429
  • proxy rotation for bursty workloads

Here’s a practical ProxiesAPI-powered fetch helper you can reuse.

import os
import time
import urllib.parse
import requests

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY", "")
TIMEOUT = (10, 30)

session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
})


def proxiesapi_url(target_url: str) -> str:
    qs = urllib.parse.urlencode({"auth_key": PROXIESAPI_KEY, "url": target_url})
    return f"https://api.proxiesapi.com/?{qs}"


def fetch(url: str, retries: int = 5) -> str:
    last_exc = None
    for attempt in range(1, retries + 1):
        try:
            r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
            if r.status_code in (403, 429, 500, 502, 503, 504):
                time.sleep(min(2 ** attempt, 30))
                continue
            r.raise_for_status()
            return r.text
        except requests.RequestException as e:
            last_exc = e
            time.sleep(min(2 ** attempt, 30))
    raise RuntimeError(f"Failed to fetch: {url}") from last_exc

Parsing strategy (don’t fight the whole DOM)

For job pages, your objective is not “extract everything.”

It’s:

  • title
  • company
  • location
  • posted date (if available)
  • description text

If you can extract those reliably, you can build a useful board.

Indeed: selectors and JSON-LD

Many job pages include structured data (application/ld+json). When present, it’s more stable than CSS classes.

import json
from bs4 import BeautifulSoup


def extract_jsonld(soup: BeautifulSoup) -> list[dict]:
    out = []
    for tag in soup.select('script[type="application/ld+json"]'):
        try:
            data = json.loads(tag.get_text(strip=True))
            if isinstance(data, dict):
                out.append(data)
            elif isinstance(data, list):
                out.extend([x for x in data if isinstance(x, dict)])
        except Exception:
            continue
    return out

Then you can map fields:

  • title
  • hiringOrganization.name
  • jobLocation.address.addressLocality
  • description

LinkedIn: be realistic

LinkedIn is notorious for anti-bot and logged-in gating.

A practical approach is:

  • treat LinkedIn as a discovery source (URLs + metadata) unless you can reliably fetch details
  • if you need full descriptions at scale, consider:
    • a browser automation worker (Playwright) with careful pacing
    • an official API or data provider

You can still build a job board that links out to LinkedIn for the details.


Deduping: how to detect the “same job” across sources

Cross-site dedupe is the difference between a clean board and a spammy one.

1) Normalize company + title

  • lowercase
  • strip punctuation
  • collapse whitespace
  • remove seniority prefixes that are noisy (“Sr”, “Senior”, “Lead”) depending on your needs
import re
import hashlib


def norm_text(s: str) -> str:
    s = (s or "").lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s


def job_hash(title: str, company: str, location: str) -> str:
    base = "|".join([norm_text(title), norm_text(company), norm_text(location)])
    return hashlib.sha1(base.encode("utf-8")).hexdigest()

This hash won’t be perfect, but it’s a strong baseline.

2) Add URL canonicalization

Sometimes two URLs point to the same job. Store a canonical form:

  • strip tracking params (utm_*)
  • normalize scheme/host
  • keep the job ID when present

3) Use fuzzy matching as a second pass

For high quality, you can run a second pass:

  • same company
  • similar title (token similarity)
  • same city/remote type

Refresh cadence: keep it current without hammering the sites

A job board dies when it shows stale listings.

Typical cadence:

  • Search pages: every 1–6 hours (depending on niche)
  • Detail pages: fetch on first discovery, then refresh every 1–3 days
  • Expiry: if a job hasn’t been seen for 7–30 days, mark as expired

Key principle: don’t re-fetch everything.

Make your pipeline incremental:

  • only fetch details for URLs that are new or need refresh
  • track last_seen_at by re-crawling search pages

Comparison: Indeed vs LinkedIn (for a job board)

DimensionIndeedLinkedIn
CoverageBroad, many rolesStrong for white-collar / tech
Anti-bot difficultyHighVery high
Structured dataOften JSON-LDLess consistently accessible
Best useCore acquisition + detailsDiscovery + outbound linking (unless you have a browser worker)

  • Read each site’s terms.
  • Respect robots.txt where it makes sense for your risk tolerance.
  • Don’t collect personal data you don’t need.
  • If you get cease-and-desist, stop and reassess.

If you’re building a serious business, consider using licensed job data providers.


Where ProxiesAPI fits (honestly)

Proxies don’t “solve” scraping.

But once you have a good pipeline (timeouts, retries, pacing), ProxiesAPI helps with:

  • rotating IPs so you don’t burn one address
  • smoothing occasional 403/429 spikes
  • keeping scheduled jobs from failing randomly

Think of it as an infrastructure layer: not a hack.


A practical next step

If you’re implementing this, do it in phases:

  1. Indeed search → URL queue → detail scrape → store
  2. Add LinkedIn as discovery-only (URLs + minimal metadata)
  3. Add dedupe + enrichment
  4. Add refresh scheduling + monitoring

Once the pipeline is stable, you can obsess over UI.

Use ProxiesAPI when you move from prototype to pipeline

The hard part of job scraping isn’t parsing HTML—it’s running the pipeline every day without getting rate-limited. ProxiesAPI helps by rotating IPs and smoothing intermittent blocks.

Related guides

How to Scrape LinkedIn Job Postings (Public Jobs) with Python + ProxiesAPI
Collect role, company, location, and posted date from LinkedIn public job pages (no login) using robust HTML parsing, retries, and a clean export format. Includes a real screenshot.
tutorial#python#linkedin#jobs
ISP Proxies Explained: When Datacenter and Residential Aren’t Enough
What ISP proxies are, when they outperform datacenter/residential, tradeoffs, and how to rotate them safely for scraping at scale.
guide#proxies#isp-proxies#rotating-proxies
Google Trends Scraping: API Options and DIY Methods (2026)
Compare official and unofficial ways to fetch Google Trends data, plus a DIY approach with throttling, retries, and proxy rotation for stability.
guide#google-trends#web-scraping#python
Web Scraping with Rust: reqwest + scraper Crate Tutorial (2026)
A practical Rust scraping guide: fetch pages with reqwest, rotate proxies, parse HTML with the scraper crate, handle retries/timeouts, and export structured data.
guide#rust#web-scraping#reqwest