How to Scrape LinkedIn Job Postings (Public Jobs) with Python + ProxiesAPI
LinkedIn has a public-facing jobs surface where individual job pages can be viewed without authentication.
That makes it tempting to scrape. The catch is that job pages:
- vary by region and experiment cohort
- may include dynamic fragments
- can intermittently return different markup
In this tutorial we’ll build a practical LinkedIn public job scraper that:
- fetches job pages with a session + timeouts
- retries with exponential backoff
- extracts title, company, location, and posted date
- exports JSONL so you can run it nightly

Public job pages can be inconsistent at scale: timeouts, transient failures, and soft blocks. ProxiesAPI gives you a clean proxy layer so retries don’t collapse your pipeline.
Important notes (read this before you run a crawler)
- This guide focuses on public pages (no login).
- Don’t attempt to scrape private areas behind authentication.
- Keep request rates modest, and build backoff.
If your use case is “alerts for a handful of companies,” you probably don’t need a massive crawler.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity
Step 1: Build a fetch layer that survives the real world
At small scale, you can requests.get() and move on.
At any real scale, you want:
- explicit connect/read timeouts
- retry on transient failures
- headers that look like a normal browser request
import random
import time
from dataclasses import dataclass
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
TIMEOUT = (10, 35)
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]
@dataclass
class FetchResult:
url: str
status_code: int
text: str
def build_session() -> requests.Session:
s = requests.Session()
# If you use ProxiesAPI as an HTTP proxy, wire it here.
# Example pattern (adjust to your ProxiesAPI docs/account):
# PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
# if PROXY_URL:
# s.proxies.update({"http": PROXY_URL, "https": PROXY_URL})
s.headers.update({
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
"Upgrade-Insecure-Requests": "1",
})
return s
session = build_session()
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=20),
retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch(url: str) -> FetchResult:
time.sleep(random.uniform(0.2, 0.7))
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Referer": "https://www.linkedin.com/jobs/",
}
r = session.get(url, headers=headers, timeout=TIMEOUT)
r.raise_for_status()
return FetchResult(url=url, status_code=r.status_code, text=r.text)
Step 2: Understand LinkedIn public job URL patterns
Common public job URLs include:
https://www.linkedin.com/jobs/view/ROLE-TITLE-at-COMPANY-<JOB_ID>https://www.linkedin.com/jobs/view/<JOB_ID>
In practice, you’ll collect these URLs from:
- your own curated list
- search result pages (which change frequently)
- sitemap-like sources (if available)
This post focuses on parsing the job detail page once you have the URL.
Step 3: Parse job fields robustly
LinkedIn’s markup changes. Avoid relying on one brittle selector.
A robust strategy:
- Try high-signal elements first (
h1, company link, location blocks) - Fall back to scanning for known labels
- Keep the parsing functions small and testable
import re
from typing import Optional
from bs4 import BeautifulSoup
def text_or_none(el) -> Optional[str]:
if not el:
return None
t = el.get_text(" ", strip=True)
return t if t else None
def clean_whitespace(s: Optional[str]) -> Optional[str]:
if not s:
return None
return re.sub(r"\s+", " ", s).strip()
def parse_linkedin_job(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "lxml")
# Title
title = None
h1 = soup.select_one("h1")
title = clean_whitespace(text_or_none(h1))
# Company
company = None
# Common pattern: company name in a topcard link
company_el = (
soup.select_one('a[href*="/company/"]')
or soup.select_one('a[href*="linkedin.com/company/"]')
)
company = clean_whitespace(text_or_none(company_el))
# Location
location = None
# Often near the top card; try a few patterns
loc_el = (
soup.select_one('[class*="topcard"] [class*="location"]')
or soup.select_one('[data-test="job-location"]')
)
location = clean_whitespace(text_or_none(loc_el))
if not location:
# fallback: look for a common "·" separated line near top
header_text = clean_whitespace(text_or_none(soup.select_one("main")))
if header_text:
# best-effort: find something that resembles "City, State" or "Area"
m = re.search(r"\b([A-Z][a-zA-Z .'-]+,\s*[A-Z]{2,}(?:\s*[A-Z][a-z]+)?)\b", header_text)
if m:
location = m.group(1)
# Posted date
posted = None
# Some pages have <time datetime="..."> or visible text like "2 days ago"
time_el = soup.select_one("time")
posted = clean_whitespace(text_or_none(time_el))
if not posted:
# fallback: search for phrases
txt = soup.get_text("\n", strip=True)
m = re.search(r"\b(\d+\s+(?:minute|hour|day|week|month)s?\s+ago)\b", txt, re.I)
if m:
posted = m.group(1)
return {
"url": url,
"title": title,
"company": company,
"location": location,
"posted": posted,
}
Selector philosophy (why this is written this way)
Instead of guessing one CSS class (which will break), we:
- use semantic tags like
h1andtime - use URL patterns like
/company/ - implement a final text-based fallback
This won’t be perfect, but it will fail gracefully and keep your pipeline moving.
Step 4: Run it on a list of job URLs (JSONL export)
import json
URLS = [
# Replace with real LinkedIn public job URLs
"https://www.linkedin.com/jobs/view/1234567890/",
]
def run(urls: list[str]) -> None:
with open("linkedin_jobs.jsonl", "w", encoding="utf-8") as f:
for url in urls:
try:
res = fetch(url)
item = parse_linkedin_job(res.text, url=url)
f.write(json.dumps(item, ensure_ascii=False) + "\n")
print("ok", item.get("title"), "|", item.get("company"))
except Exception as e:
print("fail", url, type(e).__name__, str(e)[:200])
if __name__ == "__main__":
run(URLS)
Making it “scale-ready”
When you go from 20 job pages to 20,000, add these upgrades:
- Queue + worker model (e.g., multiprocessing or a job queue)
- Persistent storage for seen URLs + last-scraped timestamps
- Structured logs (error rate, retry count, status code distribution)
- Backoff policy that slows down when failures rise
And yes: once you see repeated 429/403 patterns, a proxy layer becomes important.
ProxiesAPI is a good fit as the network layer, because you can route requests through a consistent endpoint and keep your scraper code unchanged.
QA checklist
- Title matches the visible role name on the page
- Company is not
Nonefor most URLs - Location is extracted for most URLs
- Posted is populated (either absolute or relative)
- JSONL contains one object per line
Next upgrades
- Extract job description and normalize it (remove boilerplate)
- Add a canonical job id extractor from the URL
- Add “change detection” so you only notify when a job meaningfully changes
- Export to a DB (Postgres/SQLite) instead of JSONL
Public job pages can be inconsistent at scale: timeouts, transient failures, and soft blocks. ProxiesAPI gives you a clean proxy layer so retries don’t collapse your pipeline.