How to Build a Job Board by Scraping Indeed + LinkedIn (Pipeline + Deduping)

Apr 03, 2026 · guide · #job-board, #indeed, #linkedin, #web-scraping, #proxies, #architecture, #deduping

Building a job board sounds simple: “scrape jobs, show them on a website.”

In practice, job sites are some of the most aggressively protected pages on the internet:

heavy anti-bot
frequent HTML changes
dynamic rendering
strict rate limiting

So the goal of this post is not to give you a brittle script.

It’s to give you a pipeline design you can actually run:

Collect jobs from Indeed and LinkedIn (at least to a URL + metadata level)
Normalize them into a single schema
Deduplicate (same job appears in both places)
Enrich with company + location standardization
Refresh on a cadence so your board stays current

And we’ll be honest about the constraints: for certain LinkedIn surfaces, you’ll likely need either a logged-in session, a browser automation layer, or a third-party data provider.

Target keyword (used naturally): how to build a job board by scraping indeed + linkedin

Use ProxiesAPI when you move from prototype to pipeline

The hard part of job scraping isn’t parsing HTML—it’s running the pipeline every day without getting rate-limited. ProxiesAPI helps by rotating IPs and smoothing intermittent blocks.

Get 1,000 free API calls View pricing

The first-principles approach

A job board is a data product. Treat it like one:

Acquire job records reliably
Normalize into a single schema
De-duplicate and score for quality
Store with history (so you can detect changes)
Serve via an API + frontend
Refresh with careful scheduling

Scraping is only step 1—and it’s rarely the hardest part long-term.

What you should collect (minimum viable job schema)

Start with a schema you can safely populate from both sites, even if fields are missing:

{
  "source": "indeed|linkedin",
  "source_id": "string",
  "source_url": "string",
  "title": "string",
  "company": "string",
  "location_raw": "string",
  "location_norm": "string",
  "remote_type": "onsite|hybrid|remote|unknown",
  "posted_at": "ISO8601|unknown",
  "description_text": "string",
  "employment_type": "full-time|contract|intern|unknown",
  "seniority": "string|unknown",
  "salary_raw": "string|unknown",
  "tags": ["string"],
  "first_seen_at": "ISO8601",
  "last_seen_at": "ISO8601",
  "hash": "string"
}

Two fields matter a lot:

source_id (unique per source)
hash (your cross-source dedupe key)

Collection layer: crawl “search → job detail”

Both Indeed and LinkedIn follow a similar shape:

a search results page (many jobs)
a job detail page (one job)

A robust pipeline uses a two-step crawl:

Search crawl: collect job URLs + lightweight metadata
Detail crawl: fetch each job page to extract the full record

Why two steps?

search pages are cheap and allow incremental discovery
details are expensive—only fetch what’s new

Data structures to persist

You want at least 3 tables (or collections):

job_urls (discovery queue)
jobs (canonical job records)
crawl_runs (run status, errors, stats)

Network reliability: timeouts, retries, proxies

When you scale from 50 requests/day to 50,000, your failure rate goes from “rare” to “constant.”

So you need a boring-but-solid network layer:

timeouts (connect + read)
bounded retries
backoff on 403/429
proxy rotation for bursty workloads

Here’s a practical ProxiesAPI-powered fetch helper you can reuse.

import os
import time
import urllib.parse
import requests

PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY", "")
TIMEOUT = (10, 30)

session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
})


def proxiesapi_url(target_url: str) -> str:
    qs = urllib.parse.urlencode({"auth_key": PROXIESAPI_KEY, "url": target_url})
    return f"https://api.proxiesapi.com/?{qs}"


def fetch(url: str, retries: int = 5) -> str:
    last_exc = None
    for attempt in range(1, retries + 1):
        try:
            r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
            if r.status_code in (403, 429, 500, 502, 503, 504):
                time.sleep(min(2 ** attempt, 30))
                continue
            r.raise_for_status()
            return r.text
        except requests.RequestException as e:
            last_exc = e
            time.sleep(min(2 ** attempt, 30))
    raise RuntimeError(f"Failed to fetch: {url}") from last_exc

Parsing strategy (don’t fight the whole DOM)

For job pages, your objective is not “extract everything.”

It’s:

title
company
location
posted date (if available)
description text

If you can extract those reliably, you can build a useful board.

Indeed: selectors and JSON-LD

Many job pages include structured data (application/ld+json). When present, it’s more stable than CSS classes.

import json
from bs4 import BeautifulSoup


def extract_jsonld(soup: BeautifulSoup) -> list[dict]:
    out = []
    for tag in soup.select('script[type="application/ld+json"]'):
        try:
            data = json.loads(tag.get_text(strip=True))
            if isinstance(data, dict):
                out.append(data)
            elif isinstance(data, list):
                out.extend([x for x in data if isinstance(x, dict)])
        except Exception:
            continue
    return out

Then you can map fields:

title
hiringOrganization.name
jobLocation.address.addressLocality
description

LinkedIn: be realistic

LinkedIn is notorious for anti-bot and logged-in gating.

A practical approach is:

treat LinkedIn as a discovery source (URLs + metadata) unless you can reliably fetch details
if you need full descriptions at scale, consider:
- a browser automation worker (Playwright) with careful pacing
- an official API or data provider

You can still build a job board that links out to LinkedIn for the details.

Deduping: how to detect the “same job” across sources

Cross-site dedupe is the difference between a clean board and a spammy one.

1) Normalize company + title

lowercase
strip punctuation
collapse whitespace
remove seniority prefixes that are noisy (“Sr”, “Senior”, “Lead”) depending on your needs

import re
import hashlib


def norm_text(s: str) -> str:
    s = (s or "").lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s


def job_hash(title: str, company: str, location: str) -> str:
    base = "|".join([norm_text(title), norm_text(company), norm_text(location)])
    return hashlib.sha1(base.encode("utf-8")).hexdigest()

This hash won’t be perfect, but it’s a strong baseline.

2) Add URL canonicalization

Sometimes two URLs point to the same job. Store a canonical form:

strip tracking params (utm_*)
normalize scheme/host
keep the job ID when present

3) Use fuzzy matching as a second pass

For high quality, you can run a second pass:

same company
similar title (token similarity)
same city/remote type

Refresh cadence: keep it current without hammering the sites

A job board dies when it shows stale listings.

Typical cadence:

Search pages: every 1–6 hours (depending on niche)
Detail pages: fetch on first discovery, then refresh every 1–3 days
Expiry: if a job hasn’t been seen for 7–30 days, mark as expired

Key principle: don’t re-fetch everything.

Make your pipeline incremental:

only fetch details for URLs that are new or need refresh
track last_seen_at by re-crawling search pages

Comparison: Indeed vs LinkedIn (for a job board)

Dimension	Indeed	LinkedIn
Coverage	Broad, many roles	Strong for white-collar / tech
Anti-bot difficulty	High	Very high
Structured data	Often JSON-LD	Less consistently accessible
Best use	Core acquisition + details	Discovery + outbound linking (unless you have a browser worker)

Legal + compliance notes (practical, not scary)

Read each site’s terms.
Respect robots.txt where it makes sense for your risk tolerance.
Don’t collect personal data you don’t need.
If you get cease-and-desist, stop and reassess.

If you’re building a serious business, consider using licensed job data providers.

Where ProxiesAPI fits (honestly)

Proxies don’t “solve” scraping.

But once you have a good pipeline (timeouts, retries, pacing), ProxiesAPI helps with:

rotating IPs so you don’t burn one address
smoothing occasional 403/429 spikes
keeping scheduled jobs from failing randomly

Think of it as an infrastructure layer: not a hack.

A practical next step

If you’re implementing this, do it in phases:

Indeed search → URL queue → detail scrape → store
Add LinkedIn as discovery-only (URLs + minimal metadata)
Add dedupe + enrichment
Add refresh scheduling + monitoring

Once the pipeline is stable, you can obsess over UI.

Use ProxiesAPI when you move from prototype to pipeline

The hard part of job scraping isn’t parsing HTML—it’s running the pipeline every day without getting rate-limited. ProxiesAPI helps by rotating IPs and smoothing intermittent blocks.

Get 1,000 free API calls View pricing

Collect role, company, location, and posted date from LinkedIn public job pages (no login) using robust HTML parsing, retries, and a clean export format. Includes a real screenshot.

tutorial#python#linkedin#jobs

ISP Proxies Explained: When Datacenter and Residential Aren't Enough

Explain where ISP proxies fit between datacenter and residential pools, including speed, trust, and cost tradeoffs.

guide#proxies#isp-proxies#rotating-proxies

Google Trends Scraping: API Options and DIY Methods

Compare official and unofficial ways to fetch Google Trends data, plus a DIY approach with throttling, retries, and proxy rotation for stability.

guide#google-trends#web-scraping#python

Web Scraping with Rust: reqwest + scraper Crate Tutorial

A practical Rust scraping guide: fetch pages with reqwest, rotate proxies, parse HTML with the scraper crate, handle retries/timeouts, and export structured data.

guide#rust#web-scraping#reqwest

How to Build a Job Board by Scraping Indeed + LinkedIn (Pipeline + Deduping)

Related guides