Scrape Google Scholar Search Results with Python (Authors, Citations, and Year)
Google Scholar is incredibly useful for:
- finding papers for a topic
- monitoring new publications
- building a literature dataset for analysis
…but it’s also one of the more automation-sensitive Google properties.
In this tutorial we’ll build a careful, repeatable Scholar scraper in Python that extracts:
- title
- link
- authors
- publication venue / snippet
- year (best-effort)
- citation count
We’ll also paginate results for a query.
Important: you should keep your crawl volume reasonable and expect occasional blocks. This guide focuses on a defensive approach and an exportable dataset.

Scholar is sensitive to automation. When you need repeatable runs, ProxiesAPI can help keep your request success rate stable with better network hygiene and fewer hard blocks.
What we’re scraping (Scholar structure)
A Scholar search URL looks like:
https://scholar.google.com/scholar?q=graph+neural+networks
Pagination is controlled by the start parameter:
- page 1:
start=0 - page 2:
start=10 - page 3:
start=20
Scholar result blocks are typically contained in elements with ids / classes like gs_res_ccl_mid and individual results like div.gs_r.
We’ll parse what’s visible in the HTML rather than guessing hidden APIs.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
Step 1: Fetch HTML safely (headers + timeouts + backoff)
import time
import random
from typing import Optional
import requests
TIMEOUT = (10, 30)
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
session = requests.Session()
def fetch(url: str, *, proxy_url: Optional[str] = None, max_retries: int = 5) -> str:
proxies = None
if proxy_url:
proxies = {"http": proxy_url, "https": proxy_url}
last_err = None
for attempt in range(1, max_retries + 1):
try:
r = session.get(url, headers=HEADERS, timeout=TIMEOUT, proxies=proxies)
# Scholar often responds with 429/503 when it dislikes automation.
if r.status_code in (429, 503, 500, 502, 504):
time.sleep(min(30, (2 ** attempt) + random.random()))
continue
r.raise_for_status()
return r.text
except Exception as e:
last_err = e
time.sleep(min(30, (2 ** attempt) + random.random()))
raise RuntimeError(f"fetch failed: {last_err}")
Where ProxiesAPI fits
If you use a ProxiesAPI endpoint that behaves like an outbound HTTP proxy, set proxy_url.
Be conservative with frequency even with proxies: Scholar may still challenge or block.
Step 2: Parse results (title, authors, year, citations)
Each result has a few consistent pieces:
- a title link
- a metadata line with authors and venue
- an “Cited by N” link
We’ll parse these fields with BeautifulSoup.
import re
from bs4 import BeautifulSoup
def parse_int(text: str) -> int | None:
m = re.search(r"(\d+)", text or "")
return int(m.group(1)) if m else None
def extract_year(text: str) -> int | None:
# Scholar snippets often contain a 4-digit year.
m = re.search(r"\b(19\d{2}|20\d{2})\b", text or "")
return int(m.group(1)) if m else None
def parse_scholar_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
out = []
for r in soup.select("div.gs_r"):
title_a = r.select_one("h3.gs_rt a")
title = title_a.get_text(" ", strip=True) if title_a else None
link = title_a.get("href") if title_a else None
meta = r.select_one("div.gs_a")
meta_text = meta.get_text(" ", strip=True) if meta else ""
# gs_a usually looks like:
# "A Author, B Author - Venue, 2021 - publisher.com"
year = extract_year(meta_text)
# Best-effort author split: authors are before the first '-' separator
authors = None
if meta_text and " - " in meta_text:
authors = meta_text.split(" - ", 1)[0].strip()
# Citations
cited_by = 0
for a in r.select("div.gs_fl a"):
t = a.get_text(" ", strip=True)
if t.lower().startswith("cited by"):
cited_by = parse_int(t) or 0
snippet = None
snip = r.select_one("div.gs_rs")
if snip:
snippet = snip.get_text(" ", strip=True)
out.append({
"title": title,
"link": link,
"authors": authors,
"meta": meta_text or None,
"year": year,
"cited_by": cited_by,
"snippet": snippet,
})
return out
Step 3: Paginate with start= (0, 10, 20…)
from urllib.parse import urlencode
BASE = "https://scholar.google.com/scholar"
def build_search_url(query: str, start: int = 0) -> str:
qs = urlencode({"q": query, "start": start})
return f"{BASE}?{qs}"
def crawl_scholar(query: str, *, pages: int = 3, proxy_url: str | None = None) -> list[dict]:
all_rows = []
for p in range(pages):
start = p * 10
url = build_search_url(query, start=start)
html = fetch(url, proxy_url=proxy_url)
rows = parse_scholar_page(html)
print(f"page {p+1}: rows {len(rows)}")
all_rows.extend(rows)
# pacing matters on Scholar
time.sleep(6 + random.random() * 3)
if not rows:
break
return all_rows
Step 4: Export to CSV
import csv
def write_csv(rows: list[dict], path: str = "scholar_results.csv"):
fields = ["title", "link", "authors", "year", "cited_by", "meta", "snippet"]
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fields)
w.writeheader()
for r in rows:
w.writerow({k: r.get(k) for k in fields})
if __name__ == "__main__":
proxy_url = None # set ProxiesAPI proxy endpoint if you have one
rows = crawl_scholar("graph neural networks", pages=2, proxy_url=proxy_url)
print("total rows:", len(rows))
write_csv(rows)
print("wrote scholar_results.csv")
How to make this work in practice (without pain)
Scholar scraping breaks for predictable reasons:
- Too fast → 429 / captcha interstitial
- Too many pages for a query → block
- Same IP/user-agent pattern repeatedly → block
A pragmatic playbook:
- Keep runs small (10–50 results)
- Cache results so you don’t re-fetch the same pages every run
- Use randomized delays
- Use ProxiesAPI for better network hygiene when you need repeatability
QA checklist
- Your parser returns ~10 results per page (varies)
-
cited_bymatches the “Cited by” value -
yearis present for most results - CSV opens cleanly in Excel/Sheets
Next upgrades
- Add a SQLite store keyed by
linkso you can track citations over time - For each result, visit the “Cited by” page to build a citation network (carefully)
- Add error snapshots: save HTML when you get blocked so you can recognize interstitial pages
Scholar is sensitive to automation. When you need repeatable runs, ProxiesAPI can help keep your request success rate stable with better network hygiene and fewer hard blocks.