Scrape Google Scholar Search Results with Python (Authors, Citations, and Year)

Apr 16, 2026 · tutorial · #python, #google-scholar, #web-scraping, #requests, #beautifulsoup, #pagination, #research

Google Scholar is incredibly useful for:

finding papers for a topic
monitoring new publications
building a literature dataset for analysis

…but it’s also one of the more automation-sensitive Google properties.

In this tutorial we’ll build a careful, repeatable Scholar scraper in Python that extracts:

title
link
authors
publication venue / snippet
year (best-effort)
citation count

We’ll also paginate results for a query.

Important: you should keep your crawl volume reasonable and expect occasional blocks. This guide focuses on a defensive approach and an exportable dataset.

Make Scholar crawling more reliable with ProxiesAPI

Scholar is sensitive to automation. When you need repeatable runs, ProxiesAPI can help keep your request success rate stable with better network hygiene and fewer hard blocks.

Get 1,000 free API calls View pricing

What we’re scraping (Scholar structure)

A Scholar search URL looks like:

https://scholar.google.com/scholar?q=graph+neural+networks

Pagination is controlled by the start parameter:

page 1: start=0
page 2: start=10
page 3: start=20

Scholar result blocks are typically contained in elements with ids / classes like gs_res_ccl_mid and individual results like div.gs_r.

We’ll parse what’s visible in the HTML rather than guessing hidden APIs.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

Step 1: Fetch HTML safely (headers + timeouts + backoff)

import time
import random
from typing import Optional

import requests

TIMEOUT = (10, 30)

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()


def fetch(url: str, *, proxy_url: Optional[str] = None, max_retries: int = 5) -> str:
    proxies = None
    if proxy_url:
        proxies = {"http": proxy_url, "https": proxy_url}

    last_err = None
    for attempt in range(1, max_retries + 1):
        try:
            r = session.get(url, headers=HEADERS, timeout=TIMEOUT, proxies=proxies)

            # Scholar often responds with 429/503 when it dislikes automation.
            if r.status_code in (429, 503, 500, 502, 504):
                time.sleep(min(30, (2 ** attempt) + random.random()))
                continue

            r.raise_for_status()
            return r.text
        except Exception as e:
            last_err = e
            time.sleep(min(30, (2 ** attempt) + random.random()))

    raise RuntimeError(f"fetch failed: {last_err}")

Where ProxiesAPI fits

If you use a ProxiesAPI endpoint that behaves like an outbound HTTP proxy, set proxy_url.

Be conservative with frequency even with proxies: Scholar may still challenge or block.

Step 2: Parse results (title, authors, year, citations)

Each result has a few consistent pieces:

a title link
a metadata line with authors and venue
an “Cited by N” link

We’ll parse these fields with BeautifulSoup.

import re
from bs4 import BeautifulSoup


def parse_int(text: str) -> int | None:
    m = re.search(r"(\d+)", text or "")
    return int(m.group(1)) if m else None


def extract_year(text: str) -> int | None:
    # Scholar snippets often contain a 4-digit year.
    m = re.search(r"\b(19\d{2}|20\d{2})\b", text or "")
    return int(m.group(1)) if m else None


def parse_scholar_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    out = []
    for r in soup.select("div.gs_r"):
        title_a = r.select_one("h3.gs_rt a")
        title = title_a.get_text(" ", strip=True) if title_a else None
        link = title_a.get("href") if title_a else None

        meta = r.select_one("div.gs_a")
        meta_text = meta.get_text(" ", strip=True) if meta else ""

        # gs_a usually looks like:
        # "A Author, B Author - Venue, 2021 - publisher.com"
        year = extract_year(meta_text)

        # Best-effort author split: authors are before the first '-' separator
        authors = None
        if meta_text and " - " in meta_text:
            authors = meta_text.split(" - ", 1)[0].strip()

        # Citations
        cited_by = 0
        for a in r.select("div.gs_fl a"):
            t = a.get_text(" ", strip=True)
            if t.lower().startswith("cited by"):
                cited_by = parse_int(t) or 0

        snippet = None
        snip = r.select_one("div.gs_rs")
        if snip:
            snippet = snip.get_text(" ", strip=True)

        out.append({
            "title": title,
            "link": link,
            "authors": authors,
            "meta": meta_text or None,
            "year": year,
            "cited_by": cited_by,
            "snippet": snippet,
        })

    return out

Step 3: Paginate with start= (0, 10, 20…)

from urllib.parse import urlencode

BASE = "https://scholar.google.com/scholar"


def build_search_url(query: str, start: int = 0) -> str:
    qs = urlencode({"q": query, "start": start})
    return f"{BASE}?{qs}"


def crawl_scholar(query: str, *, pages: int = 3, proxy_url: str | None = None) -> list[dict]:
    all_rows = []

    for p in range(pages):
        start = p * 10
        url = build_search_url(query, start=start)

        html = fetch(url, proxy_url=proxy_url)
        rows = parse_scholar_page(html)
        print(f"page {p+1}: rows {len(rows)}")

        all_rows.extend(rows)

        # pacing matters on Scholar
        time.sleep(6 + random.random() * 3)

        if not rows:
            break

    return all_rows

Step 4: Export to CSV

import csv


def write_csv(rows: list[dict], path: str = "scholar_results.csv"):
    fields = ["title", "link", "authors", "year", "cited_by", "meta", "snippet"]
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fields})


if __name__ == "__main__":
    proxy_url = None  # set ProxiesAPI proxy endpoint if you have one

    rows = crawl_scholar("graph neural networks", pages=2, proxy_url=proxy_url)
    print("total rows:", len(rows))

    write_csv(rows)
    print("wrote scholar_results.csv")

How to make this work in practice (without pain)

Scholar scraping breaks for predictable reasons:

Too fast → 429 / captcha interstitial
Too many pages for a query → block
Same IP/user-agent pattern repeatedly → block

A pragmatic playbook:

Keep runs small (10–50 results)
Cache results so you don’t re-fetch the same pages every run
Use randomized delays
Use ProxiesAPI for better network hygiene when you need repeatability

QA checklist

Your parser returns ~10 results per page (varies)
cited_by matches the “Cited by” value
year is present for most results
CSV opens cleanly in Excel/Sheets

Next upgrades

Add a SQLite store keyed by link so you can track citations over time
For each result, visit the “Cited by” page to build a citation network (carefully)
Add error snapshots: save HTML when you get blocked so you can recognize interstitial pages

Make Scholar crawling more reliable with ProxiesAPI

Scholar is sensitive to automation. When you need repeatable runs, ProxiesAPI can help keep your request success rate stable with better network hygiene and fewer hard blocks.

Get 1,000 free API calls View pricing

Related guides

Scrape eBay Listings and Prices

Build an eBay scraper that captures titles, prices, item URLs, and pagination into CSV-ready output.

tutorial#python#ebay#web-scraping

Scrape Book Reviews and Ratings from Goodreads

Extract Goodreads review text, star ratings, review counts, pagination cursors, and reviewer metadata into a clean book-sentiment dataset.

tutorial#python#goodreads#web-scraping

Scrape Stack Overflow with Python: Tag Pages + Question Threads + Q/A Export

Build a production-ready Stack Overflow scraper: crawl tag pages, follow question links, extract question + answers + votes, and export JSON/CSV. Includes a screenshot and ProxiesAPI integration hooks.

tutorial#stack overflow#python#web-scraping

Python BeautifulSoup Tutorial: Scraping Your First Website (2026)

A beginner-friendly BeautifulSoup tutorial: fetch HTML with requests, parse elements with CSS selectors, handle pagination, avoid common pitfalls, and export results. Includes an honest ProxiesAPI section for when you scale.

tutorial#python beautifulsoup tutorial#python#beautifulsoup