Scrape Flight Prices from Google Flights (Python + ProxiesAPI)

Apr 17, 2026 · tutorial · #python, #google-flights, #web-scraping, #playwright, #proxies, #csv

Google Flights is one of the best places to see real-world airfare pricing — and one of the harder places to scrape reliably.

Why it’s harder:

it’s heavily JavaScript-driven (you won’t get useful HTML with plain requests)
the UI changes and can A/B test selectors
it’s sensitive to aggressive crawling (you’ll hit blocks faster than “friendly HTML” sites)

In this tutorial we’ll build a practical Google Flights scraper that:

opens a route (origin → destination)
selects departure date (optionally return)
extracts the visible price cards (airline, times, stops, price)
retries safely
exports to CSV/JSON
shows where ProxiesAPI fits in (without pretending it magically bypasses everything)

We’ll use Playwright + Python to render the page and scrape from the DOM.

Google Flights results page (we’ll scrape the flight result cards)

Keep flight price crawls stable with ProxiesAPI

Google Flights is a dynamic, high-change target. ProxiesAPI helps you spread load, rotate IPs, and keep your crawling layer reliable as you scale routes, dates, and refresh frequency.

Get 1,000 free API calls View pricing

Important note (read this first)

Google’s properties can enforce bot detection and rate limits. Scraping may violate a site’s Terms of Service depending on your use case.

This guide is for educational purposes and for building robust scraping engineering skills (timeouts, retries, data models, pacing). If you need guaranteed access for a production use-case, consider official/partner APIs.

What we’re scraping

We’ll target a Google Flights results URL pattern that opens a route + date.

In practice, Google Flights URLs can look like:

https://www.google.com/travel/flights?hl=en#flt=DEL.BOM.2026-05-10;c:INR;e:1;sd:1;t:f

The exact hash structure can vary, but you don’t need to fully reverse-engineer it if you automate the UI steps (fill origin/destination/date) and then scrape cards.

Data we’ll extract per flight

airline name
departure time
arrival time
number of stops
duration
price (currency + amount)

Setup

python -m venv .venv
source .venv/bin/activate
pip install playwright pandas
playwright install

Why Playwright:

it renders the JS app reliably
it gives you stable waiting primitives
it can take screenshots (mandatory for A-track)

Step 1: Launch a browser and open Google Flights

Create scrape_google_flights.py:

from __future__ import annotations

import csv
import json
import re
import time
from dataclasses import asdict, dataclass
from typing import Optional

from playwright.sync_api import sync_playwright, TimeoutError as PWTimeout


@dataclass
class FlightResult:
    airline: Optional[str]
    depart_time: Optional[str]
    arrive_time: Optional[str]
    stops: Optional[str]
    duration: Optional[str]
    price: Optional[str]


def norm_space(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def safe_text(el) -> Optional[str]:
    if not el:
        return None
    try:
        return norm_space(el.inner_text())
    except Exception:
        return None


def goto_with_retries(page, url: str, tries: int = 3, base_sleep: float = 2.0) -> None:
    last = None
    for i in range(1, tries + 1):
        try:
            page.goto(url, wait_until="domcontentloaded", timeout=60_000)
            return
        except Exception as e:
            last = e
            time.sleep(base_sleep * i)
    raise last


def main():
    # Example route/date; you can parameterize these.
    url = "https://www.google.com/travel/flights?hl=en"

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            viewport={"width": 1400, "height": 900},
            locale="en-US",
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
        )
        page = context.new_page()

        goto_with_retries(page, url)

        # Wait for the Flights UI shell.
        page.wait_for_timeout(2000)

        # Screenshot the landing page (useful for debugging)
        page.screenshot(path="google-flights-home.jpg", full_page=True)

        context.close()
        browser.close()


if __name__ == "__main__":
    main()

Run:

python scrape_google_flights.py

At this point you should have google-flights-home.jpg (a quick sanity check).

Step 2: Automate the UI (origin, destination, date)

Google Flights is a component-heavy UI. The safest approach is:

click the “Where from?” field and type an airport/city code (e.g. DEL)
select the first suggestion
do the same for “Where to?” (BOM)
open the date picker and choose a departure date

Selectors can change, so we’ll rely on accessible labels/roles as much as possible.

from playwright.sync_api import expect


def set_route_and_date(page, origin: str, dest: str, depart_yyyy_mm_dd: str):
    # Open origin field
    # Note: labels can vary by locale; keep your locale en-US.
    page.get_by_role("combobox").first.click()

    # Google Flights often has multiple comboboxes; we search by placeholder-ish text.
    # Fallback: click text.
    try:
        page.get_by_text("Where from?").click(timeout=2000)
    except Exception:
        pass

    # Fill origin
    page.keyboard.press("Control+A")
    page.keyboard.type(origin, delay=50)
    page.wait_for_timeout(800)
    page.keyboard.press("Enter")

    # Fill destination
    try:
        page.get_by_text("Where to?").click(timeout=2000)
    except Exception:
        # Tab often moves focus from origin to destination
        page.keyboard.press("Tab")

    page.keyboard.press("Control+A")
    page.keyboard.type(dest, delay=50)
    page.wait_for_timeout(800)
    page.keyboard.press("Enter")

    # Open date picker
    # In many builds there’s a button with text like "Departure".
    page.get_by_text("Departure").click(timeout=5000)

    # Choose date. Google uses aria-labels like "May 10, 2026".
    y, m, d = depart_yyyy_mm_dd.split("-")
    # Build a best-effort label; you may need to adapt for month names.
    # For reliability, you can instead navigate the calendar UI.
    target = depart_yyyy_mm_dd

    # Try clicking a cell containing the day number after ensuring correct month is visible.
    # Minimal approach: click the day number; for real production use, implement month nav.
    page.get_by_role("grid").get_by_text(str(int(d))).first.click(timeout=10_000)

    # Close date picker
    page.get_by_role("button", name="Done").click(timeout=5000)

    # Trigger search if required
    try:
        page.get_by_role("button", name=re.compile("Search", re.I)).click(timeout=3000)
    except Exception:
        pass

    # Wait for results to load
    page.wait_for_timeout(4000)

This is intentionally a “good enough” automation scaffold. In production you’d:

implement calendar month navigation
harden selectors using page.get_by_role(..., name=...)
add detection for “no flights found” and for block/interstitial pages

Step 3: Extract flight cards from the results

Once results load, you’ll typically see a scrollable list of cards.

We’ll use CSS selectors that work on many builds:

container list: div[role="list"] or similar
card: often div[role="listitem"]

Then read text blocks and parse.


def parse_results(page, limit: int = 25) -> list[FlightResult]:
    out: list[FlightResult] = []

    # Best-effort: wait for list items
    try:
        page.wait_for_selector("div[role='listitem']", timeout=20_000)
    except PWTimeout:
        return out

    cards = page.locator("div[role='listitem']")
    n = min(cards.count(), limit)

    for i in range(n):
        card = cards.nth(i)

        # These subselectors are heuristic; verify with Playwright inspector.
        airline = safe_text(card.locator("div[aria-label*='Airline']").first)
        if not airline:
            # Fallback: first strong-ish text line
            airline = safe_text(card.locator("span").first)

        times = safe_text(card.locator("span").filter(has_text=re.compile(r"\d{1,2}:\d{2}"))).first)
        price = safe_text(card.locator("span").filter(has_text=re.compile(r"₹|\$|€"))).first)

        # Duration/stops often show like "2 hr 5 min" and "Nonstop" / "1 stop"
        duration = safe_text(card.locator("span").filter(has_text=re.compile(r"hr|min"))).first)
        stops = safe_text(card.locator("span").filter(has_text=re.compile(r"Nonstop|stop", re.I))).first)

        out.append(FlightResult(
            airline=airline,
            depart_time=times,  # keep raw; you can split later
            arrive_time=None,
            stops=stops,
            duration=duration,
            price=price,
        ))

    return out

Because the UI changes, treat these selectors as a starting point, not a promise.

The key engineering pattern is stable:

wait for an element that means “results exist”
locate cards
extract a small set of fields
log failures and keep going

Step 4: End-to-end script (route → scrape → export)

Here’s a full script you can run and adapt:

from __future__ import annotations

import csv
import json
from dataclasses import asdict

from playwright.sync_api import sync_playwright


def export_csv(rows, path: str):
    if not rows:
        return
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
        w.writeheader()
        for r in rows:
            w.writerow(r)


def export_json(rows, path: str):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


def run(origin: str, dest: str, depart: str):
    url = "https://www.google.com/travel/flights?hl=en"

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(viewport={"width": 1400, "height": 900}, locale="en-US")
        page = context.new_page()

        page.goto(url, wait_until="domcontentloaded", timeout=60_000)
        page.wait_for_timeout(2000)

        set_route_and_date(page, origin=origin, dest=dest, depart_yyyy_mm_dd=depart)

        # Mandatory screenshot (A-track)
        page.screenshot(path="google-flights-results.jpg", full_page=True)

        results = parse_results(page, limit=30)
        rows = [asdict(r) for r in results]

        export_csv(rows, "flights.csv")
        export_json(rows, "flights.json")

        context.close()
        browser.close()

    return rows


if __name__ == "__main__":
    data = run("DEL", "BOM", "2026-05-10")
    print("scraped", len(data), "results")

You now have:

google-flights-results.jpg
flights.csv
flights.json

Where ProxiesAPI fits

For a JS-heavy site like Google Flights, you typically use proxies at one of two layers:

Browser traffic (Playwright running through a proxy)
Pre-/post-flight requests (e.g. downloading auxiliary HTML/JSON endpoints)

Option A: Run Playwright through a proxy

Playwright supports proxy settings at browser launch.

If ProxiesAPI gives you an HTTP proxy endpoint, you can pass it like:

browser = p.chromium.launch(
    headless=True,
    proxy={
        "server": "http://YOUR_PROXIESAPI_PROXY:PORT",
        "username": "YOUR_USER",
        "password": "YOUR_PASS",
    },
)

Keep it honest:

proxies don’t remove the need for pacing
rotating IPs helps reduce correlated failures
if you slam the site with browser sessions, you’ll still get blocked

Option B: Use ProxiesAPI as the fetch layer (when HTML is available)

On sites where requests works, a proxy-backed fetch is simple:

import requests

r = requests.get(
    "https://example.com",
    proxies={"http": "http://YOUR_PROXY", "https": "http://YOUR_PROXY"},
    timeout=(10, 30),
)

Google Flights specifically is not a great candidate for plain-HTML scraping — hence Playwright.

Reliability checklist

Use a real viewport + locale (reduce UI variability)
Use timeouts everywhere (no hanging)
Add retries around navigation and selector waits
Keep request rate low (sleep between routes)
Take screenshots when selectors fail (debug faster)

Next upgrades

make the calendar selection deterministic (month navigation + exact date labels)
scroll the results list and collect more than the first screen
run multiple routes/dates with a queue (and a backoff policy)
store results in SQLite so refresh runs only update deltas

Keep flight price crawls stable with ProxiesAPI

Google Flights is a dynamic, high-change target. ProxiesAPI helps you spread load, rotate IPs, and keep your crawling layer reliable as you scale routes, dates, and refresh frequency.

Get 1,000 free API calls View pricing

Show how to extract Vinted listing titles, prices, brands, sizes, and image URLs from the public catalog with real selectors and a screenshot.

tutorial#python#vinted#web-scraping

Scrape IMDb Top Box Office and Release Data with Python

Collect the live IMDb Top Box Office chart into a clean dataset with title URLs, weekend gross, total gross, and weeks released. Includes a real screenshot and a Playwright scraper wired for ProxiesAPI.

tutorial#python#imdb#box-office

Scrape Trustpilot Company Search Results and Ratings with Python

Pull company names, star ratings, review counts, profile URLs, and visible location text from Trustpilot search results. Includes a real screenshot and a Playwright scraper that can run through ProxiesAPI.

tutorial#python#trustpilot#reviews

Scrape Numbeo Quality of Life Index by City with Python

Extract Numbeo's city-level quality-of-life scores, safety, traffic, pollution, and climate indicators into a clean dataset with Python and ProxiesAPI.

tutorial#python#numbeo#web-scraping

Scrape Flight Prices from Google Flights (Python + ProxiesAPI)

Related guides