How to Build a Job Board by Scraping Indeed + LinkedIn (Pipeline + Deduping)
Building a job board sounds simple: “scrape jobs, show them on a website.”
In practice, job sites are some of the most aggressively protected pages on the internet:
- heavy anti-bot
- frequent HTML changes
- dynamic rendering
- strict rate limiting
So the goal of this post is not to give you a brittle script.
It’s to give you a pipeline design you can actually run:
- Collect jobs from Indeed and LinkedIn (at least to a URL + metadata level)
- Normalize them into a single schema
- Deduplicate (same job appears in both places)
- Enrich with company + location standardization
- Refresh on a cadence so your board stays current
And we’ll be honest about the constraints: for certain LinkedIn surfaces, you’ll likely need either a logged-in session, a browser automation layer, or a third-party data provider.
Target keyword (used naturally): how to build a job board by scraping indeed + linkedin
The hard part of job scraping isn’t parsing HTML—it’s running the pipeline every day without getting rate-limited. ProxiesAPI helps by rotating IPs and smoothing intermittent blocks.
The first-principles approach
A job board is a data product. Treat it like one:
- Acquire job records reliably
- Normalize into a single schema
- De-duplicate and score for quality
- Store with history (so you can detect changes)
- Serve via an API + frontend
- Refresh with careful scheduling
Scraping is only step 1—and it’s rarely the hardest part long-term.
What you should collect (minimum viable job schema)
Start with a schema you can safely populate from both sites, even if fields are missing:
{
"source": "indeed|linkedin",
"source_id": "string",
"source_url": "string",
"title": "string",
"company": "string",
"location_raw": "string",
"location_norm": "string",
"remote_type": "onsite|hybrid|remote|unknown",
"posted_at": "ISO8601|unknown",
"description_text": "string",
"employment_type": "full-time|contract|intern|unknown",
"seniority": "string|unknown",
"salary_raw": "string|unknown",
"tags": ["string"],
"first_seen_at": "ISO8601",
"last_seen_at": "ISO8601",
"hash": "string"
}
Two fields matter a lot:
source_id(unique per source)hash(your cross-source dedupe key)
Collection layer: crawl “search → job detail”
Both Indeed and LinkedIn follow a similar shape:
- a search results page (many jobs)
- a job detail page (one job)
A robust pipeline uses a two-step crawl:
- Search crawl: collect job URLs + lightweight metadata
- Detail crawl: fetch each job page to extract the full record
Why two steps?
- search pages are cheap and allow incremental discovery
- details are expensive—only fetch what’s new
Data structures to persist
You want at least 3 tables (or collections):
job_urls(discovery queue)jobs(canonical job records)crawl_runs(run status, errors, stats)
Network reliability: timeouts, retries, proxies
When you scale from 50 requests/day to 50,000, your failure rate goes from “rare” to “constant.”
So you need a boring-but-solid network layer:
- timeouts (connect + read)
- bounded retries
- backoff on 403/429
- proxy rotation for bursty workloads
Here’s a practical ProxiesAPI-powered fetch helper you can reuse.
import os
import time
import urllib.parse
import requests
PROXIESAPI_KEY = os.environ.get("PROXIESAPI_KEY", "")
TIMEOUT = (10, 30)
session = requests.Session()
session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
})
def proxiesapi_url(target_url: str) -> str:
qs = urllib.parse.urlencode({"auth_key": PROXIESAPI_KEY, "url": target_url})
return f"https://api.proxiesapi.com/?{qs}"
def fetch(url: str, retries: int = 5) -> str:
last_exc = None
for attempt in range(1, retries + 1):
try:
r = session.get(proxiesapi_url(url), timeout=TIMEOUT)
if r.status_code in (403, 429, 500, 502, 503, 504):
time.sleep(min(2 ** attempt, 30))
continue
r.raise_for_status()
return r.text
except requests.RequestException as e:
last_exc = e
time.sleep(min(2 ** attempt, 30))
raise RuntimeError(f"Failed to fetch: {url}") from last_exc
Parsing strategy (don’t fight the whole DOM)
For job pages, your objective is not “extract everything.”
It’s:
- title
- company
- location
- posted date (if available)
- description text
If you can extract those reliably, you can build a useful board.
Indeed: selectors and JSON-LD
Many job pages include structured data (application/ld+json). When present, it’s more stable than CSS classes.
import json
from bs4 import BeautifulSoup
def extract_jsonld(soup: BeautifulSoup) -> list[dict]:
out = []
for tag in soup.select('script[type="application/ld+json"]'):
try:
data = json.loads(tag.get_text(strip=True))
if isinstance(data, dict):
out.append(data)
elif isinstance(data, list):
out.extend([x for x in data if isinstance(x, dict)])
except Exception:
continue
return out
Then you can map fields:
titlehiringOrganization.namejobLocation.address.addressLocalitydescription
LinkedIn: be realistic
LinkedIn is notorious for anti-bot and logged-in gating.
A practical approach is:
- treat LinkedIn as a discovery source (URLs + metadata) unless you can reliably fetch details
- if you need full descriptions at scale, consider:
- a browser automation worker (Playwright) with careful pacing
- an official API or data provider
You can still build a job board that links out to LinkedIn for the details.
Deduping: how to detect the “same job” across sources
Cross-site dedupe is the difference between a clean board and a spammy one.
1) Normalize company + title
- lowercase
- strip punctuation
- collapse whitespace
- remove seniority prefixes that are noisy (“Sr”, “Senior”, “Lead”) depending on your needs
import re
import hashlib
def norm_text(s: str) -> str:
s = (s or "").lower()
s = re.sub(r"[^a-z0-9\s]", " ", s)
s = re.sub(r"\s+", " ", s).strip()
return s
def job_hash(title: str, company: str, location: str) -> str:
base = "|".join([norm_text(title), norm_text(company), norm_text(location)])
return hashlib.sha1(base.encode("utf-8")).hexdigest()
This hash won’t be perfect, but it’s a strong baseline.
2) Add URL canonicalization
Sometimes two URLs point to the same job. Store a canonical form:
- strip tracking params (
utm_*) - normalize scheme/host
- keep the job ID when present
3) Use fuzzy matching as a second pass
For high quality, you can run a second pass:
- same company
- similar title (token similarity)
- same city/remote type
Refresh cadence: keep it current without hammering the sites
A job board dies when it shows stale listings.
Typical cadence:
- Search pages: every 1–6 hours (depending on niche)
- Detail pages: fetch on first discovery, then refresh every 1–3 days
- Expiry: if a job hasn’t been seen for 7–30 days, mark as expired
Key principle: don’t re-fetch everything.
Make your pipeline incremental:
- only fetch details for URLs that are new or need refresh
- track
last_seen_atby re-crawling search pages
Comparison: Indeed vs LinkedIn (for a job board)
| Dimension | Indeed | |
|---|---|---|
| Coverage | Broad, many roles | Strong for white-collar / tech |
| Anti-bot difficulty | High | Very high |
| Structured data | Often JSON-LD | Less consistently accessible |
| Best use | Core acquisition + details | Discovery + outbound linking (unless you have a browser worker) |
Legal + compliance notes (practical, not scary)
- Read each site’s terms.
- Respect robots.txt where it makes sense for your risk tolerance.
- Don’t collect personal data you don’t need.
- If you get cease-and-desist, stop and reassess.
If you’re building a serious business, consider using licensed job data providers.
Where ProxiesAPI fits (honestly)
Proxies don’t “solve” scraping.
But once you have a good pipeline (timeouts, retries, pacing), ProxiesAPI helps with:
- rotating IPs so you don’t burn one address
- smoothing occasional 403/429 spikes
- keeping scheduled jobs from failing randomly
Think of it as an infrastructure layer: not a hack.
A practical next step
If you’re implementing this, do it in phases:
- Indeed search → URL queue → detail scrape → store
- Add LinkedIn as discovery-only (URLs + minimal metadata)
- Add dedupe + enrichment
- Add refresh scheduling + monitoring
Once the pipeline is stable, you can obsess over UI.
The hard part of job scraping isn’t parsing HTML—it’s running the pipeline every day without getting rate-limited. ProxiesAPI helps by rotating IPs and smoothing intermittent blocks.