How to Scrape LinkedIn Job Postings (Public Jobs) with Python + ProxiesAPI

LinkedIn has a public-facing jobs surface where individual job pages can be viewed without authentication.

That makes it tempting to scrape. The catch is that job pages:

  • vary by region and experiment cohort
  • may include dynamic fragments
  • can intermittently return different markup

In this tutorial we’ll build a practical LinkedIn public job scraper that:

  • fetches job pages with a session + timeouts
  • retries with exponential backoff
  • extracts title, company, location, and posted date
  • exports JSONL so you can run it nightly

LinkedIn public job page (we’ll extract title, company, location, posted date)

Keep job-page crawls stable with ProxiesAPI

Public job pages can be inconsistent at scale: timeouts, transient failures, and soft blocks. ProxiesAPI gives you a clean proxy layer so retries don’t collapse your pipeline.


Important notes (read this before you run a crawler)

  • This guide focuses on public pages (no login).
  • Don’t attempt to scrape private areas behind authentication.
  • Keep request rates modest, and build backoff.

If your use case is “alerts for a handful of companies,” you probably don’t need a massive crawler.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

Step 1: Build a fetch layer that survives the real world

At small scale, you can requests.get() and move on.

At any real scale, you want:

  • explicit connect/read timeouts
  • retry on transient failures
  • headers that look like a normal browser request
import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 35)

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def build_session() -> requests.Session:
    s = requests.Session()

    # If you use ProxiesAPI as an HTTP proxy, wire it here.
    # Example pattern (adjust to your ProxiesAPI docs/account):
    # PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
    # if PROXY_URL:
    #     s.proxies.update({"http": PROXY_URL, "https": PROXY_URL})

    s.headers.update({
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
        "Upgrade-Insecure-Requests": "1",
    })
    return s


session = build_session()


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch(url: str) -> FetchResult:
    time.sleep(random.uniform(0.2, 0.7))

    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Referer": "https://www.linkedin.com/jobs/",
    }

    r = session.get(url, headers=headers, timeout=TIMEOUT)
    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)

Step 2: Understand LinkedIn public job URL patterns

Common public job URLs include:

  • https://www.linkedin.com/jobs/view/ROLE-TITLE-at-COMPANY-<JOB_ID>
  • https://www.linkedin.com/jobs/view/<JOB_ID>

In practice, you’ll collect these URLs from:

  • your own curated list
  • search result pages (which change frequently)
  • sitemap-like sources (if available)

This post focuses on parsing the job detail page once you have the URL.


Step 3: Parse job fields robustly

LinkedIn’s markup changes. Avoid relying on one brittle selector.

A robust strategy:

  1. Try high-signal elements first (h1, company link, location blocks)
  2. Fall back to scanning for known labels
  3. Keep the parsing functions small and testable
import re
from typing import Optional

from bs4 import BeautifulSoup


def text_or_none(el) -> Optional[str]:
    if not el:
        return None
    t = el.get_text(" ", strip=True)
    return t if t else None


def clean_whitespace(s: Optional[str]) -> Optional[str]:
    if not s:
        return None
    return re.sub(r"\s+", " ", s).strip()


def parse_linkedin_job(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Title
    title = None
    h1 = soup.select_one("h1")
    title = clean_whitespace(text_or_none(h1))

    # Company
    company = None
    # Common pattern: company name in a topcard link
    company_el = (
        soup.select_one('a[href*="/company/"]')
        or soup.select_one('a[href*="linkedin.com/company/"]')
    )
    company = clean_whitespace(text_or_none(company_el))

    # Location
    location = None
    # Often near the top card; try a few patterns
    loc_el = (
        soup.select_one('[class*="topcard"] [class*="location"]')
        or soup.select_one('[data-test="job-location"]')
    )
    location = clean_whitespace(text_or_none(loc_el))

    if not location:
        # fallback: look for a common "·" separated line near top
        header_text = clean_whitespace(text_or_none(soup.select_one("main")))
        if header_text:
            # best-effort: find something that resembles "City, State" or "Area"
            m = re.search(r"\b([A-Z][a-zA-Z .'-]+,\s*[A-Z]{2,}(?:\s*[A-Z][a-z]+)?)\b", header_text)
            if m:
                location = m.group(1)

    # Posted date
    posted = None
    # Some pages have <time datetime="..."> or visible text like "2 days ago"
    time_el = soup.select_one("time")
    posted = clean_whitespace(text_or_none(time_el))

    if not posted:
        # fallback: search for phrases
        txt = soup.get_text("\n", strip=True)
        m = re.search(r"\b(\d+\s+(?:minute|hour|day|week|month)s?\s+ago)\b", txt, re.I)
        if m:
            posted = m.group(1)

    return {
        "url": url,
        "title": title,
        "company": company,
        "location": location,
        "posted": posted,
    }

Selector philosophy (why this is written this way)

Instead of guessing one CSS class (which will break), we:

  • use semantic tags like h1 and time
  • use URL patterns like /company/
  • implement a final text-based fallback

This won’t be perfect, but it will fail gracefully and keep your pipeline moving.


Step 4: Run it on a list of job URLs (JSONL export)

import json

URLS = [
    # Replace with real LinkedIn public job URLs
    "https://www.linkedin.com/jobs/view/1234567890/",
]


def run(urls: list[str]) -> None:
    with open("linkedin_jobs.jsonl", "w", encoding="utf-8") as f:
        for url in urls:
            try:
                res = fetch(url)
                item = parse_linkedin_job(res.text, url=url)
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
                print("ok", item.get("title"), "|", item.get("company"))
            except Exception as e:
                print("fail", url, type(e).__name__, str(e)[:200])


if __name__ == "__main__":
    run(URLS)

Making it “scale-ready”

When you go from 20 job pages to 20,000, add these upgrades:

  • Queue + worker model (e.g., multiprocessing or a job queue)
  • Persistent storage for seen URLs + last-scraped timestamps
  • Structured logs (error rate, retry count, status code distribution)
  • Backoff policy that slows down when failures rise

And yes: once you see repeated 429/403 patterns, a proxy layer becomes important.

ProxiesAPI is a good fit as the network layer, because you can route requests through a consistent endpoint and keep your scraper code unchanged.


QA checklist

  • Title matches the visible role name on the page
  • Company is not None for most URLs
  • Location is extracted for most URLs
  • Posted is populated (either absolute or relative)
  • JSONL contains one object per line

Next upgrades

  • Extract job description and normalize it (remove boilerplate)
  • Add a canonical job id extractor from the URL
  • Add “change detection” so you only notify when a job meaningfully changes
  • Export to a DB (Postgres/SQLite) instead of JSONL
Keep job-page crawls stable with ProxiesAPI

Public job pages can be inconsistent at scale: timeouts, transient failures, and soft blocks. ProxiesAPI gives you a clean proxy layer so retries don’t collapse your pipeline.

Related guides

How to Scrape Walmart Product Data at Scale (Python + ProxiesAPI)
Extract product title, price, availability, and rating from Walmart product pages using a session + retry strategy. Includes a real screenshot and production-ready parsing patterns.
tutorial#python#walmart#web-scraping
Scrape Book Data from Goodreads (Titles, Authors, Ratings, and Reviews)
A practical Goodreads scraper in Python: collect book title/author/rating count/review count + key metadata using robust selectors, ProxiesAPI in the fetch layer, and export to JSON/CSV.
tutorial#python#goodreads#books
Scrape Restaurant Data from TripAdvisor (Reviews, Ratings, and Locations)
Build a practical TripAdvisor scraper in Python: discover restaurant listing URLs, extract name/rating/review count/address, and export clean CSV/JSON with ProxiesAPI in the fetch layer.
tutorial#python#web-scraping#beautifulsoup
How to Scrape Cars.com Used Car Prices (Python + ProxiesAPI)
Extract listing title, price, mileage, location, and dealer info from Cars.com search results + detail pages. Includes selector notes, pagination, and a polite crawl plan.
tutorial#python#cars.com#price-scraping