How to Scrape Google Search Results with Python
If you want to scrape Google with Python, the hard part is not writing requests.get(). The hard part is handling all the ways Google SERPs differ by country, query, consent state, result modules, and anti-bot checks.
So the right goal is not “perfect forever parser.” The right goal is a defensive workflow that:
- fetches result pages with retries
- detects obvious blocks and interstitials
- extracts organic titles, URLs, and snippets
- validates output before you trust it
- keeps the proxy layer separate from the parser
Important: review Google’s terms for your use case. If you need guaranteed, high-volume SERP data, a dedicated SERP provider is usually a better operational fit than DIY scraping.
Google results pages shift often and block aggressively. ProxiesAPI will not solve parser quality for you, but it does make retries and IP rotation much easier once you are testing this at meaningful volume.
What changed in 2026
Google now mixes classic organic results with more AI-heavy modules and richer answer surfaces. That means the page can contain:
- ads
- AI or answer modules
- “People also ask”
- videos
- local packs
- traditional organic links
If your script grabs the first a[href] from each block, it will produce junk. The parser has to be selective.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We will use:
requestsfor HTTPBeautifulSoupfor parsing- CSV/JSON export for inspection
Step 1: Fetch pages and detect obvious blocking
import os
import random
import time
from urllib.parse import quote_plus
import requests
TIMEOUT = (10, 30)
MAX_RETRIES = 5
UA = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0 Safari/537.36"
)
session = requests.Session()
session.headers.update(
{
"User-Agent": UA,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
)
def proxiesapi_url(target_url: str) -> str:
key = os.environ.get("PROXIESAPI_KEY")
if not key:
return target_url
return f"http://api.proxiesapi.com/?auth_key={key}&url={quote_plus(target_url)}"
def looks_blocked(html: str) -> bool:
text = (html or "").lower()
return any(
phrase in text
for phrase in [
"our systems have detected unusual traffic",
"/sorry/index",
"to continue, please verify",
"captcha",
]
)
def fetch(url: str) -> str:
last_error = None
for attempt in range(1, MAX_RETRIES + 1):
try:
response = session.get(proxiesapi_url(url), timeout=TIMEOUT)
if response.status_code in (429, 503):
raise RuntimeError(f"transient status {response.status_code}")
response.raise_for_status()
html = response.text or ""
if looks_blocked(html):
raise RuntimeError("google block or interstitial detected")
return html
except Exception as exc:
last_error = exc
if attempt == MAX_RETRIES:
break
time.sleep(min(30, 2 ** (attempt - 1)) + random.uniform(0, 0.7))
raise RuntimeError(f"failed to fetch SERP: {last_error}")
The point of this function is not to brute-force Google. It is to fail clearly when you are blocked, instead of silently parsing garbage.
Step 2: Generate a predictable search URL
def google_search_url(query: str, start: int = 0, hl: str = "en", gl: str = "us", num: int = 10) -> str:
return (
"https://www.google.com/search?"
f"q={quote_plus(query)}&start={start}&num={num}&hl={hl}&gl={gl}&pws=0"
)
These parameters help keep tests more stable:
hl=ensets interface languagegl=usnudges geographypws=0reduces personalization
They do not make SERPs perfectly deterministic, but they reduce some noise.
Step 3: Parse organic results defensively
from bs4 import BeautifulSoup
from urllib.parse import urlparse
def is_google_internal(href: str | None) -> bool:
if not href:
return True
if href.startswith("/"):
return True
host = urlparse(href).netloc.lower()
return host.endswith("google.com") or host.endswith("googleusercontent.com")
def parse_serp(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
scope = soup.select_one("div#search") or soup
rows = []
seen = set()
for block in scope.select("div"):
link = block.select_one("a[href]")
title_el = block.select_one("h3")
if not link or not title_el:
continue
href = link.get("href")
if is_google_internal(href):
continue
title = title_el.get_text(" ", strip=True)
snippet_el = block.select_one("div.VwiC3b") or block.select_one("span.aCOpRe")
snippet = snippet_el.get_text(" ", strip=True) if snippet_el else None
if not title or href in seen:
continue
seen.add(href)
rows.append(
{
"title": title,
"url": href,
"snippet": snippet,
}
)
return rows
The two most important filters are:
- require an
h3 - ignore internal Google URLs
Those two checks remove a surprising amount of junk.
Step 4: Paginate and export
import csv
import json
def crawl_query(query: str, pages: int = 2) -> list[dict]:
all_rows = []
seen = set()
for page in range(pages):
url = google_search_url(query, start=page * 10)
html = fetch(url)
batch = parse_serp(html)
for row in batch:
if row["url"] in seen:
continue
seen.add(row["url"])
all_rows.append(row)
return all_rows
def write_csv(path: str, rows: list[dict]) -> None:
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "url", "snippet"])
writer.writeheader()
for row in rows:
writer.writerow(row)
def write_json(path: str, rows: list[dict]) -> None:
with open(path, "w", encoding="utf-8") as f:
json.dump(rows, f, ensure_ascii=False, indent=2)
if __name__ == "__main__":
rows = crawl_query("best web scraping tools", pages=2)
write_csv("google_serp.csv", rows)
write_json("google_serp.json", rows)
print(f"wrote {len(rows)} results")
Always inspect a sample of the output manually before scaling. SERP scraping is one of those jobs where “ran without crashing” does not mean “data is correct.”
DIY scraping vs using a SERP API
The real choice is operational, not ideological.
| Approach | Best for | Main downside |
|---|---|---|
| DIY Python scraper | experiments, low-volume research, learning | brittle selectors and blocks |
| SERP API/provider | production SEO pipelines, scale, geo variation | extra cost |
If you only need occasional snapshots, Python is fine. If your business depends on stable SERP data every day, a provider is usually cheaper than babysitting breakages.
Practical advice that saves time
- Cache HTML while debugging selectors.
- Keep request volume low while testing.
- Validate that URLs are external and titles are plausible.
- Expect markup drift and write parser fallbacks early.
- Treat ProxiesAPI as transport help, not a substitute for clean parsing.
That is the honest version of Google scraping. It is possible, but it rewards defensive engineering much more than clever one-liners.
Google results pages shift often and block aggressively. ProxiesAPI will not solve parser quality for you, but it does make retries and IP rotation much easier once you are testing this at meaningful volume.