Scrape Google Scholar Search Results with Python (Authors, Citations, and Year)

Google Scholar is incredibly useful for:

  • finding papers for a topic
  • monitoring new publications
  • building a literature dataset for analysis

…but it’s also one of the more automation-sensitive Google properties.

In this tutorial we’ll build a careful, repeatable Scholar scraper in Python that extracts:

  • title
  • link
  • authors
  • publication venue / snippet
  • year (best-effort)
  • citation count

We’ll also paginate results for a query.

Important: you should keep your crawl volume reasonable and expect occasional blocks. This guide focuses on a defensive approach and an exportable dataset.

Google Scholar homepage (we’ll scrape search results)

Make Scholar crawling more reliable with ProxiesAPI

Scholar is sensitive to automation. When you need repeatable runs, ProxiesAPI can help keep your request success rate stable with better network hygiene and fewer hard blocks.


What we’re scraping (Scholar structure)

A Scholar search URL looks like:

  • https://scholar.google.com/scholar?q=graph+neural+networks

Pagination is controlled by the start parameter:

  • page 1: start=0
  • page 2: start=10
  • page 3: start=20

Scholar result blocks are typically contained in elements with ids / classes like gs_res_ccl_mid and individual results like div.gs_r.

We’ll parse what’s visible in the HTML rather than guessing hidden APIs.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML safely (headers + timeouts + backoff)

import time
import random
from typing import Optional

import requests

TIMEOUT = (10, 30)

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()


def fetch(url: str, *, proxy_url: Optional[str] = None, max_retries: int = 5) -> str:
    proxies = None
    if proxy_url:
        proxies = {"http": proxy_url, "https": proxy_url}

    last_err = None
    for attempt in range(1, max_retries + 1):
        try:
            r = session.get(url, headers=HEADERS, timeout=TIMEOUT, proxies=proxies)

            # Scholar often responds with 429/503 when it dislikes automation.
            if r.status_code in (429, 503, 500, 502, 504):
                time.sleep(min(30, (2 ** attempt) + random.random()))
                continue

            r.raise_for_status()
            return r.text
        except Exception as e:
            last_err = e
            time.sleep(min(30, (2 ** attempt) + random.random()))

    raise RuntimeError(f"fetch failed: {last_err}")

Where ProxiesAPI fits

If you use a ProxiesAPI endpoint that behaves like an outbound HTTP proxy, set proxy_url.

Be conservative with frequency even with proxies: Scholar may still challenge or block.


Step 2: Parse results (title, authors, year, citations)

Each result has a few consistent pieces:

  • a title link
  • a metadata line with authors and venue
  • an “Cited by N” link

We’ll parse these fields with BeautifulSoup.

import re
from bs4 import BeautifulSoup


def parse_int(text: str) -> int | None:
    m = re.search(r"(\d+)", text or "")
    return int(m.group(1)) if m else None


def extract_year(text: str) -> int | None:
    # Scholar snippets often contain a 4-digit year.
    m = re.search(r"\b(19\d{2}|20\d{2})\b", text or "")
    return int(m.group(1)) if m else None


def parse_scholar_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    for r in soup.select("div.gs_r"):
        title_a = r.select_one("h3.gs_rt a")
        title = title_a.get_text(" ", strip=True) if title_a else None
        link = title_a.get("href") if title_a else None

        meta = r.select_one("div.gs_a")
        meta_text = meta.get_text(" ", strip=True) if meta else ""

        # gs_a usually looks like:
        # "A Author, B Author - Venue, 2021 - publisher.com"
        year = extract_year(meta_text)

        # Best-effort author split: authors are before the first '-' separator
        authors = None
        if meta_text and " - " in meta_text:
            authors = meta_text.split(" - ", 1)[0].strip()

        # Citations
        cited_by = 0
        for a in r.select("div.gs_fl a"):
            t = a.get_text(" ", strip=True)
            if t.lower().startswith("cited by"):
                cited_by = parse_int(t) or 0

        snippet = None
        snip = r.select_one("div.gs_rs")
        if snip:
            snippet = snip.get_text(" ", strip=True)

        out.append({
            "title": title,
            "link": link,
            "authors": authors,
            "meta": meta_text or None,
            "year": year,
            "cited_by": cited_by,
            "snippet": snippet,
        })

    return out

Step 3: Paginate with start= (0, 10, 20…)

from urllib.parse import urlencode

BASE = "https://scholar.google.com/scholar"


def build_search_url(query: str, start: int = 0) -> str:
    qs = urlencode({"q": query, "start": start})
    return f"{BASE}?{qs}"


def crawl_scholar(query: str, *, pages: int = 3, proxy_url: str | None = None) -> list[dict]:
    all_rows = []

    for p in range(pages):
        start = p * 10
        url = build_search_url(query, start=start)

        html = fetch(url, proxy_url=proxy_url)
        rows = parse_scholar_page(html)
        print(f"page {p+1}: rows {len(rows)}")

        all_rows.extend(rows)

        # pacing matters on Scholar
        time.sleep(6 + random.random() * 3)

        if not rows:
            break

    return all_rows

Step 4: Export to CSV

import csv


def write_csv(rows: list[dict], path: str = "scholar_results.csv"):
    fields = ["title", "link", "authors", "year", "cited_by", "meta", "snippet"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fields})


if __name__ == "__main__":
    proxy_url = None  # set ProxiesAPI proxy endpoint if you have one

    rows = crawl_scholar("graph neural networks", pages=2, proxy_url=proxy_url)
    print("total rows:", len(rows))

    write_csv(rows)
    print("wrote scholar_results.csv")

How to make this work in practice (without pain)

Scholar scraping breaks for predictable reasons:

  1. Too fast → 429 / captcha interstitial
  2. Too many pages for a query → block
  3. Same IP/user-agent pattern repeatedly → block

A pragmatic playbook:

  • Keep runs small (10–50 results)
  • Cache results so you don’t re-fetch the same pages every run
  • Use randomized delays
  • Use ProxiesAPI for better network hygiene when you need repeatability

QA checklist

  • Your parser returns ~10 results per page (varies)
  • cited_by matches the “Cited by” value
  • year is present for most results
  • CSV opens cleanly in Excel/Sheets

Next upgrades

  • Add a SQLite store keyed by link so you can track citations over time
  • For each result, visit the “Cited by” page to build a citation network (carefully)
  • Add error snapshots: save HTML when you get blocked so you can recognize interstitial pages
Make Scholar crawling more reliable with ProxiesAPI

Scholar is sensitive to automation. When you need repeatable runs, ProxiesAPI can help keep your request success rate stable with better network hygiene and fewer hard blocks.

Related guides

Scrape TripAdvisor Hotel Reviews with Python (Pagination + Rate Limits)
Extract TripAdvisor hotel review text, ratings, dates, and reviewer metadata with a resilient Python scraper (pagination, retries, and a proxy-backed fetch layer via ProxiesAPI).
tutorial#python#tripadvisor#reviews
Scrape Vinted Listings with Python: Search, Prices, Images, and Pagination
Build a dataset from Vinted search results (title, price, size, condition, seller, images) with a production-minded Python scraper + a proxy-backed fetch layer via ProxiesAPI.
tutorial#python#vinted#ecommerce
Scrape Reddit Forum Data with Python: Posts, Comments, and Pagination
Scrape subreddit listing pages and comment threads with Python (requests + BeautifulSoup) using the old.reddit.com HTML, plus safe pagination, retry/backoff, and ProxiesAPI-friendly request patterns. Includes a screenshot.
tutorial#python#reddit#web-scraping
How to Scrape Etsy Product Listings with Python (ProxiesAPI + Pagination)
Extract title, price, rating, and shop info from Etsy search pages reliably with rotating proxies, retries, and pagination.
tutorial#python#etsy#web-scraping