How to Scrape ArXiv Papers (Search + Metadata + PDFs) with Python + ProxiesAPI

Mar 22, 2026 · tutorial · #python, #arxiv, #web-scraping, #requests, #feedparser, #xml, #pdf, #proxies

arXiv is a rare gift for scrapers.

Unlike most sites in 2026, arXiv has:

stable URLs
predictable IDs
an official Atom feed for search results
direct PDF endpoints

That means you can build a scraper that’s fast, reliable, and ethically constrained (rate limiting, caching, and narrow queries).

In this guide we’ll build a Python pipeline that:

searches arXiv for a query (via Atom)
extracts metadata (title, authors, categories, abstract, published date)
downloads PDFs for each paper
saves everything to disk + JSONL
uses a network layer that can optionally route requests through ProxiesAPI

Make PDF downloads reliable with ProxiesAPI

Metadata is easy; downloading hundreds of PDFs is where timeouts and throttles show up. ProxiesAPI can help keep long runs stable with consistent retries and IP rotation as you scale.

Get 1,000 free API calls View pricing

arXiv endpoints: the three URLs you need

You can do almost everything with these:

1) Search feed (Atom)

arXiv exposes a search API that returns Atom XML.

Base:

https://export.arxiv.org/api/query

Common parameters:

search_query= (e.g. all:transformer or cat:cs.CL)
start= offset
max_results=
sortBy= (submittedDate, lastUpdatedDate, relevance)
sortOrder= (ascending, descending)

Example:

https://export.arxiv.org/api/query?search_query=cat:cs.CL+AND+all:retrieval&start=0&max_results=25&sortBy=submittedDate&sortOrder=descending

2) Abstract page

https://arxiv.org/abs/<id>

3) PDF

https://arxiv.org/pdf/<id>.pdf

You don’t need a headless browser for this pipeline.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests tenacity feedparser

We’ll use:

requests for HTTP
feedparser to parse Atom cleanly
tenacity for retries

Step 1: Build a robust HTTP client (optional ProxiesAPI)

You’ll make two kinds of requests:

small metadata fetches (Atom)
large binary downloads (PDF)

Both benefit from:

timeouts
retries
streaming downloads

from __future__ import annotations

import os
import time
import random
from dataclasses import dataclass
from typing import Optional

import requests
from requests import Response
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type


META_TIMEOUT = (10, 30)
PDF_TIMEOUT = (10, 120)


@dataclass
class HttpConfig:
    proxiesapi_url: Optional[str] = None
    user_agent: str = (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    )


class HttpClient:
    def __init__(self, cfg: HttpConfig):
        self.cfg = cfg
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": cfg.user_agent,
            "Accept": "*/*",
        })

    def _via_proxiesapi(self, target_url: str) -> str:
        """Wrap a URL through ProxiesAPI if configured.

        Keep this explicit: ProxiesAPI deployments vary.
        If your ProxiesAPI expects a different parameter name or path,
        change it here.
        """
        if not self.cfg.proxiesapi_url:
            return target_url
        from urllib.parse import urlencode
        return self.cfg.proxiesapi_url.rstrip("/") + "?" + urlencode({"url": target_url})

    @retry(
        reraise=True,
        stop=stop_after_attempt(5),
        wait=wait_exponential_jitter(initial=1, max=20),
        retry=retry_if_exception_type(requests.RequestException),
    )
    def get(self, url: str, *, params: dict | None = None, timeout=META_TIMEOUT, stream: bool = False) -> Response:
        fetch_url = self._via_proxiesapi(url)
        r = self.session.get(fetch_url, params=params, timeout=timeout, stream=stream)
        if r.status_code in (429, 500, 502, 503, 504):
            raise requests.RequestException(f"Transient status {r.status_code} for {url}")
        r.raise_for_status()
        return r


def polite_sleep(min_s: float = 0.8, max_s: float = 2.0) -> None:
    time.sleep(random.uniform(min_s, max_s))

Step 2: Search arXiv via Atom (no brittle HTML parsing)

We’ll build a search function that returns a list of paper records.

import feedparser


ARXIV_API = "https://export.arxiv.org/api/query"


def search_arxiv(query: str, start: int = 0, max_results: int = 25, sort_by: str = "submittedDate") -> str:
    params = {
        "search_query": query,
        "start": start,
        "max_results": max_results,
        "sortBy": sort_by,
        "sortOrder": "descending",
    }
    # return the URL for visibility/debugging
    from urllib.parse import urlencode
    return ARXIV_API + "?" + urlencode(params)


def parse_atom(atom_xml: str) -> list[dict]:
    feed = feedparser.parse(atom_xml)
    out: list[dict] = []

    for e in feed.entries:
        # e.id looks like: http://arxiv.org/abs/2501.01234v2
        abs_url = e.get("id")
        title = (e.get("title") or "").strip().replace("\n", " ")
        summary = (e.get("summary") or "").strip().replace("\n", " ")

        authors = [a.get("name") for a in e.get("authors", []) if a.get("name")]

        # arXiv-specific tags
        categories = [t.get("term") for t in e.get("tags", []) if t.get("term")]

        published = e.get("published")
        updated = e.get("updated")

        # find pdf link
        pdf_url = None
        for link in e.get("links", []):
            if link.get("type") == "application/pdf":
                pdf_url = link.get("href")
                break

        out.append({
            "abs_url": abs_url,
            "title": title,
            "abstract": summary,
            "authors": authors,
            "categories": categories,
            "published": published,
            "updated": updated,
            "pdf_url": pdf_url,
        })

    return out


http = HttpClient(HttpConfig(proxiesapi_url=None))

url = search_arxiv("cat:cs.CL AND all:retrieval", start=0, max_results=10)
xml = http.get(url, timeout=META_TIMEOUT).text
papers = parse_atom(xml)

print("papers", len(papers))
print(papers[0]["title"])
print(papers[0]["pdf_url"])

This approach is more reliable than scraping the HTML search page because:

Atom schema changes slowly
fields like author/title/abstract are first-class
you avoid brittle DOM selectors

Step 3: Normalize arXiv IDs (so you can name files)

arXiv abstract URLs often include a version suffix (v1, v2).

For filenames, you usually want both:

base id: 2501.01234
version: v2

import re


def extract_arxiv_id(abs_url: str) -> dict:
    # abs_url: http(s)://arxiv.org/abs/2501.01234v2
    m = re.search(r"/abs/([^/]+)$", abs_url or "")
    raw = m.group(1) if m else ""

    mv = re.match(r"^(?P<base>\d{4}\.\d{4,5})(?P<ver>v\d+)?$", raw)
    if not mv:
        return {"raw": raw, "base": raw, "ver": None}

    return {
        "raw": raw,
        "base": mv.group("base"),
        "ver": mv.group("ver"),
    }


print(extract_arxiv_id("https://arxiv.org/abs/2501.01234v2"))

Step 4: Download PDFs safely (streaming + retries)

PDF downloads are where scraping gets flaky:

large responses
occasional resets
transient 5xx

We’ll implement a streaming downloader that:

writes to a temp file
renames when complete
skips if already downloaded

from pathlib import Path


def download_pdf(http: HttpClient, pdf_url: str, out_path: Path) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)

    if out_path.exists() and out_path.stat().st_size > 50_000:
        # quick skip if file is already there and non-trivial
        return

    tmp_path = out_path.with_suffix(out_path.suffix + ".part")

    r = http.get(pdf_url, timeout=PDF_TIMEOUT, stream=True)

    with open(tmp_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024 * 128):
            if chunk:
                f.write(chunk)

    os.replace(tmp_path, out_path)


def download_many(http: HttpClient, papers: list[dict], out_dir: str = "arxiv_pdfs", limit: int = 25) -> list[dict]:
    out = []
    for i, p in enumerate(papers[:limit], start=1):
        pdf_url = p.get("pdf_url")
        abs_url = p.get("abs_url")
        if not pdf_url or not abs_url:
            continue

        aid = extract_arxiv_id(abs_url)
        file_name = aid["raw"].replace("/", "_") + ".pdf"
        out_path = Path(out_dir) / file_name

        download_pdf(http, pdf_url, out_path)

        out.append({**p, "arxiv_id": aid, "pdf_path": str(out_path)})
        print(f"{i}/{min(limit, len(papers))} downloaded {out_path.name}")

        polite_sleep(0.8, 2.2)

    return out

Step 5: Export metadata as JSONL (easy to pipeline)

import json


def write_jsonl(rows: list[dict], path: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")


url = search_arxiv("cat:cs.CL AND all:retrieval", start=0, max_results=25)
xml = http.get(url, timeout=META_TIMEOUT).text
papers = parse_atom(xml)

enriched = download_many(http, papers, out_dir="arxiv_pdfs", limit=10)
write_jsonl(enriched, "arxiv_papers.jsonl")
print("wrote arxiv_papers.jsonl", len(enriched))

Rate limiting and etiquette

Even with a friendly target like arXiv, be a good citizen:

keep max_results small and paginate slowly
cache results locally
avoid parallelizing PDF downloads aggressively
prefer a narrow category query (cat:cs.CL) over scraping the entire corpus

Where ProxiesAPI fits (honestly)

If you’re downloading a handful of papers, you may not need proxies.

But when you run:

hundreds of Atom pages (many queries)
thousands of PDFs
scheduled nightly updates

…your failures will come from network volatility more than parsing.

ProxiesAPI can help as a stable network layer: consistent retry behavior and IP rotation when you’re rate-limited.

QA checklist

Atom query returns XML (not HTML)
Parsed papers contain title, authors, pdf_url
PDFs download and open in a PDF viewer
JSONL contains one record per downloaded paper
Reruns skip already-downloaded PDFs

Next upgrades

store metadata in SQLite with updated timestamps
detect new versions (v3) and re-download
add a checksum validation step
build a simple search UI over your local corpus

Make PDF downloads reliable with ProxiesAPI

Metadata is easy; downloading hundreds of PDFs is where timeouts and throttles show up. ProxiesAPI can help keep long runs stable with consistent retries and IP rotation as you scale.

Get 1,000 free API calls View pricing

A practical guide to extracting flight price quotes from Google Flights responsibly: capture share URLs, fetch server-rendered HTML, parse price cards, and export clean JSON. Includes ProxiesAPI-backed requests + a screenshot.

tutorial#python#google-flights#travel

How to Scrape Stack Overflow Questions and Accepted Answers with Python (By Tag)

Build a resilient Stack Overflow scraper: crawl tag pages, extract question metadata, follow links, and parse accepted answers. Includes retries, dedupe, and ProxiesAPI-ready requests + a screenshot of the tag page.

tutorial#python#stack-overflow#web-scraping

Scrape Government Contract Data from SAM.gov (Opportunities + Details)

Extract live SAM.gov contract opportunities and enrich them with detail pages (filters, pagination, retries). Includes a production-ready Python scraper and export to JSON/CSV.

tutorial#python#sam-gov#government-contracts

Scrape Google Maps Business Data with Python (Name, Rating, Address, Website)

A practical (and honest) guide to extracting business listing fields using Google Maps links + place pages: parse name, rating, address, phone, and website with Python, and use ProxiesAPI to keep requests stable as you scale. Includes a proof screenshot.

tutorial#python#google-maps#local-business

How to Scrape ArXiv Papers (Search + Metadata + PDFs) with Python + ProxiesAPI

Related guides