How to Scrape ArXiv Papers (Search + Metadata + PDFs) with Python + ProxiesAPI
arXiv is a rare gift for scrapers.
Unlike most sites in 2026, arXiv has:
- stable URLs
- predictable IDs
- an official Atom feed for search results
- direct PDF endpoints
That means you can build a scraper that’s fast, reliable, and ethically constrained (rate limiting, caching, and narrow queries).
In this guide we’ll build a Python pipeline that:
- searches arXiv for a query (via Atom)
- extracts metadata (title, authors, categories, abstract, published date)
- downloads PDFs for each paper
- saves everything to disk + JSONL
- uses a network layer that can optionally route requests through ProxiesAPI

Metadata is easy; downloading hundreds of PDFs is where timeouts and throttles show up. ProxiesAPI can help keep long runs stable with consistent retries and IP rotation as you scale.
arXiv endpoints: the three URLs you need
You can do almost everything with these:
1) Search feed (Atom)
arXiv exposes a search API that returns Atom XML.
Base:
https://export.arxiv.org/api/query
Common parameters:
search_query=(e.g.all:transformerorcat:cs.CL)start=offsetmax_results=sortBy=(submittedDate,lastUpdatedDate,relevance)sortOrder=(ascending,descending)
Example:
https://export.arxiv.org/api/query?search_query=cat:cs.CL+AND+all:retrieval&start=0&max_results=25&sortBy=submittedDate&sortOrder=descending
2) Abstract page
https://arxiv.org/abs/<id>
3) PDF
https://arxiv.org/pdf/<id>.pdf
You don’t need a headless browser for this pipeline.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests tenacity feedparser
We’ll use:
requestsfor HTTPfeedparserto parse Atom cleanlytenacityfor retries
Step 1: Build a robust HTTP client (optional ProxiesAPI)
You’ll make two kinds of requests:
- small metadata fetches (Atom)
- large binary downloads (PDF)
Both benefit from:
- timeouts
- retries
- streaming downloads
from __future__ import annotations
import os
import time
import random
from dataclasses import dataclass
from typing import Optional
import requests
from requests import Response
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
META_TIMEOUT = (10, 30)
PDF_TIMEOUT = (10, 120)
@dataclass
class HttpConfig:
proxiesapi_url: Optional[str] = None
user_agent: str = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
)
class HttpClient:
def __init__(self, cfg: HttpConfig):
self.cfg = cfg
self.session = requests.Session()
self.session.headers.update({
"User-Agent": cfg.user_agent,
"Accept": "*/*",
})
def _via_proxiesapi(self, target_url: str) -> str:
"""Wrap a URL through ProxiesAPI if configured.
Keep this explicit: ProxiesAPI deployments vary.
If your ProxiesAPI expects a different parameter name or path,
change it here.
"""
if not self.cfg.proxiesapi_url:
return target_url
from urllib.parse import urlencode
return self.cfg.proxiesapi_url.rstrip("/") + "?" + urlencode({"url": target_url})
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential_jitter(initial=1, max=20),
retry=retry_if_exception_type(requests.RequestException),
)
def get(self, url: str, *, params: dict | None = None, timeout=META_TIMEOUT, stream: bool = False) -> Response:
fetch_url = self._via_proxiesapi(url)
r = self.session.get(fetch_url, params=params, timeout=timeout, stream=stream)
if r.status_code in (429, 500, 502, 503, 504):
raise requests.RequestException(f"Transient status {r.status_code} for {url}")
r.raise_for_status()
return r
def polite_sleep(min_s: float = 0.8, max_s: float = 2.0) -> None:
time.sleep(random.uniform(min_s, max_s))
Step 2: Search arXiv via Atom (no brittle HTML parsing)
We’ll build a search function that returns a list of paper records.
import feedparser
ARXIV_API = "https://export.arxiv.org/api/query"
def search_arxiv(query: str, start: int = 0, max_results: int = 25, sort_by: str = "submittedDate") -> str:
params = {
"search_query": query,
"start": start,
"max_results": max_results,
"sortBy": sort_by,
"sortOrder": "descending",
}
# return the URL for visibility/debugging
from urllib.parse import urlencode
return ARXIV_API + "?" + urlencode(params)
def parse_atom(atom_xml: str) -> list[dict]:
feed = feedparser.parse(atom_xml)
out: list[dict] = []
for e in feed.entries:
# e.id looks like: http://arxiv.org/abs/2501.01234v2
abs_url = e.get("id")
title = (e.get("title") or "").strip().replace("\n", " ")
summary = (e.get("summary") or "").strip().replace("\n", " ")
authors = [a.get("name") for a in e.get("authors", []) if a.get("name")]
# arXiv-specific tags
categories = [t.get("term") for t in e.get("tags", []) if t.get("term")]
published = e.get("published")
updated = e.get("updated")
# find pdf link
pdf_url = None
for link in e.get("links", []):
if link.get("type") == "application/pdf":
pdf_url = link.get("href")
break
out.append({
"abs_url": abs_url,
"title": title,
"abstract": summary,
"authors": authors,
"categories": categories,
"published": published,
"updated": updated,
"pdf_url": pdf_url,
})
return out
http = HttpClient(HttpConfig(proxiesapi_url=None))
url = search_arxiv("cat:cs.CL AND all:retrieval", start=0, max_results=10)
xml = http.get(url, timeout=META_TIMEOUT).text
papers = parse_atom(xml)
print("papers", len(papers))
print(papers[0]["title"])
print(papers[0]["pdf_url"])
This approach is more reliable than scraping the HTML search page because:
- Atom schema changes slowly
- fields like author/title/abstract are first-class
- you avoid brittle DOM selectors
Step 3: Normalize arXiv IDs (so you can name files)
arXiv abstract URLs often include a version suffix (v1, v2).
For filenames, you usually want both:
- base id:
2501.01234 - version:
v2
import re
def extract_arxiv_id(abs_url: str) -> dict:
# abs_url: http(s)://arxiv.org/abs/2501.01234v2
m = re.search(r"/abs/([^/]+)$", abs_url or "")
raw = m.group(1) if m else ""
mv = re.match(r"^(?P<base>\d{4}\.\d{4,5})(?P<ver>v\d+)?$", raw)
if not mv:
return {"raw": raw, "base": raw, "ver": None}
return {
"raw": raw,
"base": mv.group("base"),
"ver": mv.group("ver"),
}
print(extract_arxiv_id("https://arxiv.org/abs/2501.01234v2"))
Step 4: Download PDFs safely (streaming + retries)
PDF downloads are where scraping gets flaky:
- large responses
- occasional resets
- transient 5xx
We’ll implement a streaming downloader that:
- writes to a temp file
- renames when complete
- skips if already downloaded
from pathlib import Path
def download_pdf(http: HttpClient, pdf_url: str, out_path: Path) -> None:
out_path.parent.mkdir(parents=True, exist_ok=True)
if out_path.exists() and out_path.stat().st_size > 50_000:
# quick skip if file is already there and non-trivial
return
tmp_path = out_path.with_suffix(out_path.suffix + ".part")
r = http.get(pdf_url, timeout=PDF_TIMEOUT, stream=True)
with open(tmp_path, "wb") as f:
for chunk in r.iter_content(chunk_size=1024 * 128):
if chunk:
f.write(chunk)
os.replace(tmp_path, out_path)
def download_many(http: HttpClient, papers: list[dict], out_dir: str = "arxiv_pdfs", limit: int = 25) -> list[dict]:
out = []
for i, p in enumerate(papers[:limit], start=1):
pdf_url = p.get("pdf_url")
abs_url = p.get("abs_url")
if not pdf_url or not abs_url:
continue
aid = extract_arxiv_id(abs_url)
file_name = aid["raw"].replace("/", "_") + ".pdf"
out_path = Path(out_dir) / file_name
download_pdf(http, pdf_url, out_path)
out.append({**p, "arxiv_id": aid, "pdf_path": str(out_path)})
print(f"{i}/{min(limit, len(papers))} downloaded {out_path.name}")
polite_sleep(0.8, 2.2)
return out
Step 5: Export metadata as JSONL (easy to pipeline)
import json
def write_jsonl(rows: list[dict], path: str) -> None:
with open(path, "w", encoding="utf-8") as f:
for r in rows:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
url = search_arxiv("cat:cs.CL AND all:retrieval", start=0, max_results=25)
xml = http.get(url, timeout=META_TIMEOUT).text
papers = parse_atom(xml)
enriched = download_many(http, papers, out_dir="arxiv_pdfs", limit=10)
write_jsonl(enriched, "arxiv_papers.jsonl")
print("wrote arxiv_papers.jsonl", len(enriched))
Rate limiting and etiquette
Even with a friendly target like arXiv, be a good citizen:
- keep
max_resultssmall and paginate slowly - cache results locally
- avoid parallelizing PDF downloads aggressively
- prefer a narrow category query (
cat:cs.CL) over scraping the entire corpus
Where ProxiesAPI fits (honestly)
If you’re downloading a handful of papers, you may not need proxies.
But when you run:
- hundreds of Atom pages (many queries)
- thousands of PDFs
- scheduled nightly updates
…your failures will come from network volatility more than parsing.
ProxiesAPI can help as a stable network layer: consistent retry behavior and IP rotation when you’re rate-limited.
QA checklist
- Atom query returns XML (not HTML)
- Parsed papers contain
title,authors,pdf_url - PDFs download and open in a PDF viewer
- JSONL contains one record per downloaded paper
- Reruns skip already-downloaded PDFs
Next upgrades
- store metadata in SQLite with
updatedtimestamps - detect new versions (
v3) and re-download - add a checksum validation step
- build a simple search UI over your local corpus
Metadata is easy; downloading hundreds of PDFs is where timeouts and throttles show up. ProxiesAPI can help keep long runs stable with consistent retries and IP rotation as you scale.