Scrape Flight Prices from Google Flights (Python + ProxiesAPI)
Google Flights is one of those pages everyone wants to scrape:
- fast market research (route demand, typical price bands)
- alerts + monitoring (price drops)
- building a routes → prices dataset for analysis
It’s also one of the places that will punish naive scraping quickly.
In this tutorial, we’ll take an honest, production-first approach:
- capture a proof screenshot of the target page (so you know what you’re scraping)
- fetch the HTML reliably (timeouts, retries, stable headers)
- parse whatever the server returns (and fail loudly when it’s not parseable)
- export a clean dataset
- show where ProxiesAPI fits in (network stability + IP rotation)

Flight pricing pages are high-friction targets (rate limits, bot detection, and location variance). ProxiesAPI helps you rotate egress IPs and keep your crawl’s network layer consistent as volume grows.
Important reality check (Google Flights is JS-heavy)
Google Flights is largely rendered client-side, and its HTML can vary by:
- geo / locale
- device hints (headers, viewport)
- cookies / consent
- bot detection
That means there are two common scraping paths:
- Path A (HTML parsing): works sometimes for lightweight extraction when the server returns usable HTML.
- Path B (browser automation): Playwright/Selenium, extracting DOM after JS runs.
This guide focuses on Path A (requests + parsing) because it’s cheaper, faster, and good for many datasets.
If you consistently get empty HTML / interstitials, jump to the “When to switch to Playwright” section.
What we’re scraping
A typical Google Flights “explore / search results” view shows cards with:
- airline / itinerary summary
- departure/arrival times
- duration and stops
- price (the key)
Our goal is a dataset like:
{
"from": "BOM",
"to": "DEL",
"depart_date": "2026-05-05",
"return_date": null,
"currency": "INR",
"price": 6123,
"raw_price_text": "₹6,123",
"scraped_at": "2026-04-18T16:00:00Z",
"source_url": "https://www.google.com/travel/flights?..."
}
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dateutil
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsing (more forgiving than html.parser)tenacityfor retries (with backoff)
Step 1: Build a stable fetch() (timeouts, retries, headers)
Even before proxies, do the basics:
- timeouts so you don’t hang
- retries with exponential backoff
- a realistic User-Agent
- consistent Accept-Language
import os
import random
import time
from dataclasses import dataclass
from typing import Optional
import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
TIMEOUT = (10, 40) # connect, read
USER_AGENTS = [
# Keep a short rotation of real desktop UAs.
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
@dataclass
class FetchResult:
url: str
status_code: int
text: str
final_url: str
def make_session() -> requests.Session:
s = requests.Session()
s.headers.update(
{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
}
)
return s
@retry(stop=stop_after_attempt(4), wait=wait_exponential_jitter(initial=1, max=12))
def fetch_html(session: requests.Session, url: str, proxies: Optional[dict] = None) -> FetchResult:
# Light UA rotation
session.headers["User-Agent"] = random.choice(USER_AGENTS)
r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)
return FetchResult(url=url, status_code=r.status_code, text=r.text, final_url=str(r.url))
Step 2: Construct a Google Flights URL (practical approach)
Google’s flight URLs are not a stable public API.
The most reliable “engineering” workflow is:
- open Google Flights in a browser
- perform your search (route + date)
- copy the resulting URL
- parameterize the parts you control (origin/destination/dates) in your own code
For demo purposes, we’ll keep it simple: you provide a template URL for each route/date.
Example (yours will differ):
https://www.google.com/travel/flights?hl=en&gl=US&curr=USD#flt=BOM.DEL.2026-05-05;c:INR;e:1;sd:1;t:f
Notes:
hlaffects languageglaffects regioncurraffects currency display
Step 3: Parse prices from the returned HTML
When Google returns parseable HTML, you’ll often see price text in the response.
Instead of betting on one brittle selector, we use a layered strategy:
- look for common “₹123” / “$123” price-like strings in visible text
- optionally, try a few selectors (if present)
- keep the raw extraction evidence so you can debug quickly
import re
from bs4 import BeautifulSoup
PRICE_RE = re.compile(r"(?:(₹|\$|€|£)\s?)([0-9][0-9,\.]+)")
def parse_prices(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
# Quick sanity: if we got an interstitial, bail.
title = (soup.title.get_text(strip=True) if soup.title else "").lower()
if "unusual traffic" in title or "sorry" in title:
raise RuntimeError(f"Blocked/interstitial detected: title={title!r}")
text = soup.get_text("\n", strip=True)
matches = []
for m in PRICE_RE.finditer(text):
currency = m.group(1)
raw = f"{currency}{m.group(2)}"
# Normalize
num = m.group(2).replace(",", "")
try:
value = int(float(num))
except ValueError:
continue
matches.append({"currency": currency, "raw_price_text": raw, "price": value})
# De-dupe while preserving order
seen = set()
out = []
for x in matches:
key = (x["currency"], x["price"])
if key in seen:
continue
seen.add(key)
out.append(x)
return out
This is intentionally conservative: Google Flights can include many prices on the page (filters, “typical prices”, etc.).
So in production you usually refine extraction by scoping to a section of the DOM or by using browser automation.
For an MVP dataset, you can take the lowest observed price as the “from price” signal.
Step 4: End-to-end scrape for a single search
from datetime import datetime, timezone
def scrape_search(url: str, route: dict, proxies: dict | None = None) -> dict:
session = make_session()
res = fetch_html(session, url, proxies=proxies)
prices = parse_prices(res.text)
if not prices:
return {
**route,
"source_url": url,
"final_url": res.final_url,
"scraped_at": datetime.now(timezone.utc).isoformat(),
"ok": False,
"error": "No prices found in HTML. This is common on JS-rendered pages.",
}
best = min(prices, key=lambda x: x["price"])
return {
**route,
"source_url": url,
"final_url": res.final_url,
"scraped_at": datetime.now(timezone.utc).isoformat(),
"ok": True,
"currency": best["currency"],
"price": best["price"],
"raw_price_text": best["raw_price_text"],
"samples": prices[:25],
}
Step 5: Add ProxiesAPI (honestly)
ProxiesAPI is useful here for one reason: Google will rate-limit / block by IP once you scale beyond casual browsing.
What ProxiesAPI does not do:
- it doesn’t magically turn a JS app into server-rendered HTML
- it doesn’t bypass all bot checks
What it can do:
- rotate egress IPs
- reduce correlation between requests
- keep your crawler from dying when one IP gets throttled
Using ProxiesAPI with requests
You’ll typically configure a proxy endpoint (HTTP/HTTPS) and pass it via proxies=.
Example pattern (adjust to your ProxiesAPI credentials and endpoint):
import os
PROXIESAPI_PROXY = os.getenv("PROXIESAPI_PROXY_URL")
def proxiesapi_dict() -> dict | None:
if not PROXIESAPI_PROXY:
return None
return {
"http": PROXIESAPI_PROXY,
"https": PROXIESAPI_PROXY,
}
route = {"from": "BOM", "to": "DEL", "depart_date": "2026-05-05", "return_date": None}
url = "https://www.google.com/travel/flights?hl=en&gl=US&curr=USD#flt=BOM.DEL.2026-05-05;c:INR;e:1;sd:1;t:f"
row = scrape_search(url, route, proxies=proxiesapi_dict())
print(row["ok"], row.get("price"), row.get("error"))
Crawl multiple routes/dates + export CSV
import csv
import json
def export_csv(rows: list[dict], path: str = "flights_prices.csv"):
if not rows:
return
keys = sorted({k for r in rows for k in r.keys() if k not in {"samples"}})
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=keys)
w.writeheader()
for r in rows:
rr = dict(r)
rr.pop("samples", None)
w.writerow(rr)
def export_json(rows: list[dict], path: str = "flights_prices.json"):
with open(path, "w", encoding="utf-8") as f:
json.dump(rows, f, ensure_ascii=False, indent=2)
routes = [
{
"from": "BOM",
"to": "DEL",
"depart_date": "2026-05-05",
"return_date": None,
"url": "https://www.google.com/travel/flights?hl=en&gl=US&curr=USD#flt=BOM.DEL.2026-05-05;c:INR;e:1;sd:1;t:f",
},
# Add more rows here.
]
rows = []
for r in routes:
row = scrape_search(r["url"], {k: r[k] for k in ["from", "to", "depart_date", "return_date"]}, proxies=proxiesapi_dict())
rows.append(row)
time.sleep(random.uniform(2.0, 5.0))
export_csv(rows)
export_json(rows)
print("done", len(rows))
When to switch to Playwright (and still use ProxiesAPI)
If most of your requests produce:
- empty pages
- consent pages
- “unusual traffic” interstitials
- HTML with no price content
…then you need a browser automation layer.
A pragmatic setup is:
- Playwright to render the page and query the DOM
- ProxiesAPI to provide stable proxy routing per browser context
(That’s a separate guide, but this is the escalation path that works.)
QA checklist
- You can fetch the URL with realistic headers + timeouts
- You detect interstitials and fail loudly
- You persist raw evidence (
final_url, sample prices) - You rate-limit and jitter requests
- You use ProxiesAPI only as the network stability layer — not as a “magic bypass”
Flight pricing pages are high-friction targets (rate limits, bot detection, and location variance). ProxiesAPI helps you rotate egress IPs and keep your crawl’s network layer consistent as volume grows.