How to Scrape LinkedIn Job Postings (Public Jobs) with Python + ProxiesAPI

Mar 28, 2026 · tutorial · #python, #linkedin, #jobs, #web-scraping, #beautifulsoup, #lxml, #proxies

LinkedIn has a public-facing jobs surface where individual job pages can be viewed without authentication.

That makes it tempting to scrape. The catch is that job pages:

vary by region and experiment cohort
may include dynamic fragments
can intermittently return different markup

In this tutorial we’ll build a practical LinkedIn public job scraper that:

fetches job pages with a session + timeouts
retries with exponential backoff
extracts title, company, location, and posted date
exports JSONL so you can run it nightly

Keep job-page crawls stable with ProxiesAPI

Public job pages can be inconsistent at scale: timeouts, transient failures, and soft blocks. ProxiesAPI gives you a clean proxy layer so retries don’t collapse your pipeline.

Get 1,000 free API calls View pricing

Important notes (read this before you run a crawler)

This guide focuses on public pages (no login).
Don’t attempt to scrape private areas behind authentication.
Keep request rates modest, and build backoff.

If your use case is “alerts for a handful of companies,” you probably don’t need a massive crawler.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

Step 1: Build a fetch layer that survives the real world

At small scale, you can requests.get() and move on.

At any real scale, you want:

explicit connect/read timeouts
retry on transient failures
headers that look like a normal browser request

import random
import time
from dataclasses import dataclass

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

TIMEOUT = (10, 35)

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str


def build_session() -> requests.Session:
    s = requests.Session()

    # If you use ProxiesAPI as an HTTP proxy, wire it here.
    # Example pattern (adjust to your ProxiesAPI docs/account):
    # PROXY_URL = os.getenv("PROXIESAPI_PROXY_URL")
    # if PROXY_URL:
    #     s.proxies.update({"http": PROXY_URL, "https": PROXY_URL})

    s.headers.update({
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
        "Upgrade-Insecure-Requests": "1",
    })
    return s


session = build_session()


@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=20),
    retry=retry_if_exception_type((requests.RequestException,)),
)
def fetch(url: str) -> FetchResult:
    time.sleep(random.uniform(0.2, 0.7))

    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Referer": "https://www.linkedin.com/jobs/",
    }

    r = session.get(url, headers=headers, timeout=TIMEOUT)
    r.raise_for_status()
    return FetchResult(url=url, status_code=r.status_code, text=r.text)

Step 2: Understand LinkedIn public job URL patterns

Common public job URLs include:

https://www.linkedin.com/jobs/view/ROLE-TITLE-at-COMPANY-<JOB_ID>
https://www.linkedin.com/jobs/view/<JOB_ID>

In practice, you’ll collect these URLs from:

your own curated list
search result pages (which change frequently)
sitemap-like sources (if available)

This post focuses on parsing the job detail page once you have the URL.

Step 3: Parse job fields robustly

LinkedIn’s markup changes. Avoid relying on one brittle selector.

A robust strategy:

Try high-signal elements first (h1, company link, location blocks)
Fall back to scanning for known labels
Keep the parsing functions small and testable

import re
from typing import Optional

from bs4 import BeautifulSoup


def text_or_none(el) -> Optional[str]:
    if not el:
        return None
    t = el.get_text(" ", strip=True)
    return t if t else None


def clean_whitespace(s: Optional[str]) -> Optional[str]:
    if not s:
        return None
    return re.sub(r"\s+", " ", s).strip()


def parse_linkedin_job(html: str, url: str) -> dict:
    soup = BeautifulSoup(html, "lxml")

    # Title
    title = None
    h1 = soup.select_one("h1")
    title = clean_whitespace(text_or_none(h1))

    # Company
    company = None
    # Common pattern: company name in a topcard link
    company_el = (
        soup.select_one('a[href*="/company/"]')
        or soup.select_one('a[href*="linkedin.com/company/"]')
    )
    company = clean_whitespace(text_or_none(company_el))

    # Location
    location = None
    # Often near the top card; try a few patterns
    loc_el = (
        soup.select_one('[class*="topcard"] [class*="location"]')
        or soup.select_one('[data-test="job-location"]')
    )
    location = clean_whitespace(text_or_none(loc_el))

    if not location:
        # fallback: look for a common "·" separated line near top
        header_text = clean_whitespace(text_or_none(soup.select_one("main")))
        if header_text:
            # best-effort: find something that resembles "City, State" or "Area"
            m = re.search(r"\b([A-Z][a-zA-Z .'-]+,\s*[A-Z]{2,}(?:\s*[A-Z][a-z]+)?)\b", header_text)
            if m:
                location = m.group(1)

    # Posted date
    posted = None
    # Some pages have <time datetime="..."> or visible text like "2 days ago"
    time_el = soup.select_one("time")
    posted = clean_whitespace(text_or_none(time_el))

    if not posted:
        # fallback: search for phrases
        txt = soup.get_text("\n", strip=True)
        m = re.search(r"\b(\d+\s+(?:minute|hour|day|week|month)s?\s+ago)\b", txt, re.I)
        if m:
            posted = m.group(1)

    return {
        "url": url,
        "title": title,
        "company": company,
        "location": location,
        "posted": posted,
    }

Selector philosophy (why this is written this way)

Instead of guessing one CSS class (which will break), we:

use semantic tags like h1 and time
use URL patterns like /company/
implement a final text-based fallback

This won’t be perfect, but it will fail gracefully and keep your pipeline moving.

Step 4: Run it on a list of job URLs (JSONL export)

import json

URLS = [
    # Replace with real LinkedIn public job URLs
    "https://www.linkedin.com/jobs/view/1234567890/",
]


def run(urls: list[str]) -> None:
    with open("linkedin_jobs.jsonl", "w", encoding="utf-8") as f:
        for url in urls:
            try:
                res = fetch(url)
                item = parse_linkedin_job(res.text, url=url)
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
                print("ok", item.get("title"), "|", item.get("company"))
            except Exception as e:
                print("fail", url, type(e).__name__, str(e)[:200])


if __name__ == "__main__":
    run(URLS)

Making it “scale-ready”

When you go from 20 job pages to 20,000, add these upgrades:

Queue + worker model (e.g., multiprocessing or a job queue)
Persistent storage for seen URLs + last-scraped timestamps
Structured logs (error rate, retry count, status code distribution)
Backoff policy that slows down when failures rise

And yes: once you see repeated 429/403 patterns, a proxy layer becomes important.

ProxiesAPI is a good fit as the network layer, because you can route requests through a consistent endpoint and keep your scraper code unchanged.

QA checklist

Title matches the visible role name on the page
Company is not None for most URLs
Location is extracted for most URLs
Posted is populated (either absolute or relative)
JSONL contains one object per line

Next upgrades

Extract job description and normalize it (remove boilerplate)
Add a canonical job id extractor from the URL
Add “change detection” so you only notify when a job meaningfully changes
Export to a DB (Postgres/SQLite) instead of JSONL

Keep job-page crawls stable with ProxiesAPI

Public job pages can be inconsistent at scale: timeouts, transient failures, and soft blocks. ProxiesAPI gives you a clean proxy layer so retries don’t collapse your pipeline.

Get 1,000 free API calls View pricing

Paginate tag feeds, fetch question pages, and parse title/votes/accepted answer into a clean dataset — with a screenshot proof and production-grade Python.

tutorial#python#stack-overflow#web-scraping

How to Scrape Stack Overflow Questions and Accepted Answers with Python (By Tag)

Build a resilient Stack Overflow scraper: crawl tag pages, extract question metadata, follow links, and parse accepted answers. Includes retries, dedupe, and ProxiesAPI-ready requests + a screenshot of the tag page.

tutorial#python#stack-overflow#web-scraping

Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)

Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.

tutorial#python#stack-overflow#web-scraping

How to Scrape Walmart Product Data at Scale (Python + ProxiesAPI)

Extract product title, price, availability, and rating from Walmart product pages using a session + retry strategy. Includes a real screenshot and production-ready parsing patterns.

tutorial#python#walmart#web-scraping

How to Scrape LinkedIn Job Postings (Public Jobs) with Python + ProxiesAPI

Related guides