How to Scrape ArXiv Papers (Search + Metadata + PDFs) with Python + ProxiesAPI

arXiv is a rare gift for scrapers.

Unlike most sites in 2026, arXiv has:

  • stable URLs
  • predictable IDs
  • an official Atom feed for search results
  • direct PDF endpoints

That means you can build a scraper that’s fast, reliable, and ethically constrained (rate limiting, caching, and narrow queries).

In this guide we’ll build a Python pipeline that:

  1. searches arXiv for a query (via Atom)
  2. extracts metadata (title, authors, categories, abstract, published date)
  3. downloads PDFs for each paper
  4. saves everything to disk + JSONL
  5. uses a network layer that can optionally route requests through ProxiesAPI

arXiv search results (we’ll parse Atom entries, then fetch PDFs)

Make PDF downloads reliable with ProxiesAPI

Metadata is easy; downloading hundreds of PDFs is where timeouts and throttles show up. ProxiesAPI can help keep long runs stable with consistent retries and IP rotation as you scale.


arXiv endpoints: the three URLs you need

You can do almost everything with these:

1) Search feed (Atom)

arXiv exposes a search API that returns Atom XML.

Base:

  • https://export.arxiv.org/api/query

Common parameters:

  • search_query= (e.g. all:transformer or cat:cs.CL)
  • start= offset
  • max_results=
  • sortBy= (submittedDate, lastUpdatedDate, relevance)
  • sortOrder= (ascending, descending)

Example:

https://export.arxiv.org/api/query?search_query=cat:cs.CL+AND+all:retrieval&start=0&max_results=25&sortBy=submittedDate&sortOrder=descending

2) Abstract page

  • https://arxiv.org/abs/<id>

3) PDF

  • https://arxiv.org/pdf/<id>.pdf

You don’t need a headless browser for this pipeline.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests tenacity feedparser

We’ll use:

  • requests for HTTP
  • feedparser to parse Atom cleanly
  • tenacity for retries

Step 1: Build a robust HTTP client (optional ProxiesAPI)

You’ll make two kinds of requests:

  • small metadata fetches (Atom)
  • large binary downloads (PDF)

Both benefit from:

  • timeouts
  • retries
  • streaming downloads
from __future__ import annotations

import os
import time
import random
from dataclasses import dataclass
from typing import Optional

import requests
from requests import Response
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type


META_TIMEOUT = (10, 30)
PDF_TIMEOUT = (10, 120)


@dataclass
class HttpConfig:
    proxiesapi_url: Optional[str] = None
    user_agent: str = (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    )


class HttpClient:
    def __init__(self, cfg: HttpConfig):
        self.cfg = cfg
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": cfg.user_agent,
            "Accept": "*/*",
        })

    def _via_proxiesapi(self, target_url: str) -> str:
        """Wrap a URL through ProxiesAPI if configured.

        Keep this explicit: ProxiesAPI deployments vary.
        If your ProxiesAPI expects a different parameter name or path,
        change it here.
        """
        if not self.cfg.proxiesapi_url:
            return target_url
        from urllib.parse import urlencode
        return self.cfg.proxiesapi_url.rstrip("/") + "?" + urlencode({"url": target_url})

    @retry(
        reraise=True,
        stop=stop_after_attempt(5),
        wait=wait_exponential_jitter(initial=1, max=20),
        retry=retry_if_exception_type(requests.RequestException),
    )
    def get(self, url: str, *, params: dict | None = None, timeout=META_TIMEOUT, stream: bool = False) -> Response:
        fetch_url = self._via_proxiesapi(url)
        r = self.session.get(fetch_url, params=params, timeout=timeout, stream=stream)
        if r.status_code in (429, 500, 502, 503, 504):
            raise requests.RequestException(f"Transient status {r.status_code} for {url}")
        r.raise_for_status()
        return r


def polite_sleep(min_s: float = 0.8, max_s: float = 2.0) -> None:
    time.sleep(random.uniform(min_s, max_s))

Step 2: Search arXiv via Atom (no brittle HTML parsing)

We’ll build a search function that returns a list of paper records.

import feedparser


ARXIV_API = "https://export.arxiv.org/api/query"


def search_arxiv(query: str, start: int = 0, max_results: int = 25, sort_by: str = "submittedDate") -> str:
    params = {
        "search_query": query,
        "start": start,
        "max_results": max_results,
        "sortBy": sort_by,
        "sortOrder": "descending",
    }
    # return the URL for visibility/debugging
    from urllib.parse import urlencode
    return ARXIV_API + "?" + urlencode(params)


def parse_atom(atom_xml: str) -> list[dict]:
    feed = feedparser.parse(atom_xml)
    out: list[dict] = []

    for e in feed.entries:
        # e.id looks like: http://arxiv.org/abs/2501.01234v2
        abs_url = e.get("id")
        title = (e.get("title") or "").strip().replace("\n", " ")
        summary = (e.get("summary") or "").strip().replace("\n", " ")

        authors = [a.get("name") for a in e.get("authors", []) if a.get("name")]

        # arXiv-specific tags
        categories = [t.get("term") for t in e.get("tags", []) if t.get("term")]

        published = e.get("published")
        updated = e.get("updated")

        # find pdf link
        pdf_url = None
        for link in e.get("links", []):
            if link.get("type") == "application/pdf":
                pdf_url = link.get("href")
                break

        out.append({
            "abs_url": abs_url,
            "title": title,
            "abstract": summary,
            "authors": authors,
            "categories": categories,
            "published": published,
            "updated": updated,
            "pdf_url": pdf_url,
        })

    return out


http = HttpClient(HttpConfig(proxiesapi_url=None))

url = search_arxiv("cat:cs.CL AND all:retrieval", start=0, max_results=10)
xml = http.get(url, timeout=META_TIMEOUT).text
papers = parse_atom(xml)

print("papers", len(papers))
print(papers[0]["title"])
print(papers[0]["pdf_url"])

This approach is more reliable than scraping the HTML search page because:

  • Atom schema changes slowly
  • fields like author/title/abstract are first-class
  • you avoid brittle DOM selectors

Step 3: Normalize arXiv IDs (so you can name files)

arXiv abstract URLs often include a version suffix (v1, v2).

For filenames, you usually want both:

  • base id: 2501.01234
  • version: v2
import re


def extract_arxiv_id(abs_url: str) -> dict:
    # abs_url: http(s)://arxiv.org/abs/2501.01234v2
    m = re.search(r"/abs/([^/]+)$", abs_url or "")
    raw = m.group(1) if m else ""

    mv = re.match(r"^(?P<base>\d{4}\.\d{4,5})(?P<ver>v\d+)?$", raw)
    if not mv:
        return {"raw": raw, "base": raw, "ver": None}

    return {
        "raw": raw,
        "base": mv.group("base"),
        "ver": mv.group("ver"),
    }


print(extract_arxiv_id("https://arxiv.org/abs/2501.01234v2"))

Step 4: Download PDFs safely (streaming + retries)

PDF downloads are where scraping gets flaky:

  • large responses
  • occasional resets
  • transient 5xx

We’ll implement a streaming downloader that:

  • writes to a temp file
  • renames when complete
  • skips if already downloaded
from pathlib import Path


def download_pdf(http: HttpClient, pdf_url: str, out_path: Path) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)

    if out_path.exists() and out_path.stat().st_size > 50_000:
        # quick skip if file is already there and non-trivial
        return

    tmp_path = out_path.with_suffix(out_path.suffix + ".part")

    r = http.get(pdf_url, timeout=PDF_TIMEOUT, stream=True)

    with open(tmp_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024 * 128):
            if chunk:
                f.write(chunk)

    os.replace(tmp_path, out_path)


def download_many(http: HttpClient, papers: list[dict], out_dir: str = "arxiv_pdfs", limit: int = 25) -> list[dict]:
    out = []
    for i, p in enumerate(papers[:limit], start=1):
        pdf_url = p.get("pdf_url")
        abs_url = p.get("abs_url")
        if not pdf_url or not abs_url:
            continue

        aid = extract_arxiv_id(abs_url)
        file_name = aid["raw"].replace("/", "_") + ".pdf"
        out_path = Path(out_dir) / file_name

        download_pdf(http, pdf_url, out_path)

        out.append({**p, "arxiv_id": aid, "pdf_path": str(out_path)})
        print(f"{i}/{min(limit, len(papers))} downloaded {out_path.name}")

        polite_sleep(0.8, 2.2)

    return out

Step 5: Export metadata as JSONL (easy to pipeline)

import json


def write_jsonl(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")


url = search_arxiv("cat:cs.CL AND all:retrieval", start=0, max_results=25)
xml = http.get(url, timeout=META_TIMEOUT).text
papers = parse_atom(xml)

enriched = download_many(http, papers, out_dir="arxiv_pdfs", limit=10)
write_jsonl(enriched, "arxiv_papers.jsonl")
print("wrote arxiv_papers.jsonl", len(enriched))

Rate limiting and etiquette

Even with a friendly target like arXiv, be a good citizen:

  • keep max_results small and paginate slowly
  • cache results locally
  • avoid parallelizing PDF downloads aggressively
  • prefer a narrow category query (cat:cs.CL) over scraping the entire corpus

Where ProxiesAPI fits (honestly)

If you’re downloading a handful of papers, you may not need proxies.

But when you run:

  • hundreds of Atom pages (many queries)
  • thousands of PDFs
  • scheduled nightly updates

…your failures will come from network volatility more than parsing.

ProxiesAPI can help as a stable network layer: consistent retry behavior and IP rotation when you’re rate-limited.


QA checklist

  • Atom query returns XML (not HTML)
  • Parsed papers contain title, authors, pdf_url
  • PDFs download and open in a PDF viewer
  • JSONL contains one record per downloaded paper
  • Reruns skip already-downloaded PDFs

Next upgrades

  • store metadata in SQLite with updated timestamps
  • detect new versions (v3) and re-download
  • add a checksum validation step
  • build a simple search UI over your local corpus
Make PDF downloads reliable with ProxiesAPI

Metadata is easy; downloading hundreds of PDFs is where timeouts and throttles show up. ProxiesAPI can help keep long runs stable with consistent retries and IP rotation as you scale.

Related guides

How to Scrape Craigslist Listings by Category and City (Python + ProxiesAPI)
Pull Craigslist listings for a chosen city + category, normalize fields, follow listing pages for details, and export clean CSV with retries and anti-block tips.
tutorial#python#craigslist#web-scraping
How to Scrape AutoTrader Used Car Listings with Python (Make/Model/Price/Mileage)
Scrape AutoTrader search results into a clean dataset: title, price, mileage, year, location, and dealer vs private hints. Includes ProxiesAPI fetch, robust selectors, and export to JSON.
tutorial#python#autotrader#cars
How to Scrape Booking.com Hotel Prices with Python (Using ProxiesAPI)
Extract hotel names, nightly prices, review scores, and basic availability fields from Booking.com search results using Python + BeautifulSoup, with ProxiesAPI for more reliable fetching.
tutorial#python#booking#price-scraping
Scrape Product Data from Amazon (with Python + ProxiesAPI)
Extract Amazon product title, price, rating, and availability from a product page using requests + BeautifulSoup, with retries and proxy-backed fetching via ProxiesAPI.
tutorial#python#amazon#web-scraping