How to Scrape Google Flights Prices with Python (Routes, Dates, and Price Quotes)

May 06, 2026 · tutorial · #python, #google-flights, #travel, #price-scraping, #web-scraping, #requests, #beautifulsoup, #proxies

Google Flights is one of the best “real world” scraping targets because it’s a high-value dataset (prices change constantly) and it’s also a site that will punish sloppy scrapers.

In this tutorial we’ll build a production-minded Python scraper that:

captures a shareable Google Flights results URL (you choose the route + dates)
fetches HTML safely (timeouts, retries, and a session)
parses flight result cards into structured data (airline, times, duration, stops, price)
exports JSON you can use for alerts, dashboards, or analysis
shows where ProxiesAPI fits when you scale beyond “a few manual checks”

We’ll also include a screenshot of the page we’re scraping so you can visually match selectors.

Google Flights results page (example route/date)

Keep Google Flights requests stable with ProxiesAPI

Google surfaces anti-bot defenses quickly when you scale beyond a handful of requests. ProxiesAPI gives you a clean proxy layer (rotation + reputation) so your scraper can keep running without burning a single IP.

Get 1,000 free API calls View pricing

Important note (what we are and aren’t doing)

Google Flights is heavily dynamic and personalized. There are many ways to “scrape Google Flights”, and some are brittle or cross lines you might not want to cross.

This guide focuses on a pragmatic, ethical approach:

You generate a results page (route + dates) in your browser.
You use a share URL that loads a results page.
We fetch the HTML and extract the visible quote cards.

If you need deep automation (searching thousands of date combinations), treat this as the baseline and then add:

caching
queueing + backoff
incremental refresh
stronger fingerprinting defenses (often via a real browser)

What we’re scraping (page anatomy)

When you open Google Flights results, you’ll typically see a list of options. Each option contains:

a price (e.g. “₹24,531”)
departure/arrival times
airline(s)
duration and stops

The HTML structure changes. So instead of hardcoding one brittle selector, we’ll:

Locate result “cards” by looking for repeating blocks that contain a price
Extract fields using relative selectors within each card
Keep the parser tolerant of missing fields

This is the same approach you’ll use on most complex sites: identify a repeated item container, then parse inside it.

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup(lxml) for parsing
tenacity for retries with backoff

Go to https://www.google.com/travel/flights
Enter your origin, destination, and dates
Apply any filters you care about (e.g. “1 stop or fewer”)
Copy the URL from the address bar

Tip: If the URL is extremely long, that’s fine. We’ll store it in a config file.

Create config.py:

# config.py
FLIGHTS_URL = "PASTE_YOUR_GOOGLE_FLIGHTS_RESULTS_URL_HERE"

Step 2: Fetch HTML reliably (timeouts + retries)

Google will sometimes return:

an interstitial
an error page
truncated HTML

So we want:

connect/read timeouts
retries with exponential backoff
a stable session (cookies)

import random
import time
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

TIMEOUT = (10, 30)  # connect, read

BASE_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
}

session = requests.Session()


def polite_sleep(min_s=0.7, max_s=1.6):
    time.sleep(random.uniform(min_s, max_s))


@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=20))
def fetch_html(url: str) -> str:
    r = session.get(url, headers=BASE_HEADERS, timeout=TIMEOUT, allow_redirects=True)
    r.raise_for_status()

    text = r.text

    # lightweight sanity checks
    if "captcha" in text.lower() or "unusual traffic" in text.lower():
        raise RuntimeError("Blocked (captcha/unusual traffic)")

    if len(text) < 50_000:
        # Results pages are typically much larger; small HTML often means interstitial.
        raise RuntimeError(f"Suspiciously small HTML: {len(text)} bytes")

    return text

Step 3: Parse results cards into structured quotes

Instead of assuming exact class names, we’ll:

extract all price-like strings
walk upward to find a container node
then parse times/airlines/duration inside that container

This is not perfect, but it’s surprisingly effective when the page is mostly server-rendered.

import re
from bs4 import BeautifulSoup

PRICE_RE = re.compile(r"(₹|\$|€|£)\s?\d[\d,\.]*")
TIME_RE = re.compile(r"\b\d{1,2}:\d{2}\s?(AM|PM)?\b", re.IGNORECASE)


def clean_text(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip())


def find_price_nodes(soup: BeautifulSoup):
    # any element whose text looks like a price
    out = []
    for el in soup.find_all(text=True):
        t = str(el)
        if PRICE_RE.search(t):
            out.append(el.parent)
    return out


def parse_quote_from_container(container) -> dict:
    text = clean_text(container.get_text(" ", strip=True))

    # price
    m = PRICE_RE.search(text)
    price = m.group(0) if m else None

    # times (often two per result)
    times = TIME_RE.findall(text)

    # heuristic fields
    airline = None
    duration = None
    stops = None

    # try to capture common tokens
    dur_m = re.search(r"\b(\d+\s?h\s?\d*\s?m|\d+\s?m)\b", text, re.IGNORECASE)
    if dur_m:
        duration = dur_m.group(1)

    stops_m = re.search(r"\b(nonstop|\d+\s?stop(s)?)\b", text, re.IGNORECASE)
    if stops_m:
        stops = stops_m.group(1)

    # airline guess: take first capitalized word sequence before duration/stops/price
    # (keeps this tolerant; you can refine once you inspect your target HTML)
    airline_m = re.search(r"\b([A-Z][A-Za-z&\-\.]+(?:\s+[A-Z][A-Za-z&\-\.]+){0,3})\b", text)
    if airline_m:
        airline = airline_m.group(1)

    return {
        "price": price,
        "times": times,
        "duration": duration,
        "stops": stops,
        "airline_guess": airline,
        "raw": text[:500],
    }


def parse_google_flights(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # find many candidate price nodes, then dedupe by container identity
    price_nodes = find_price_nodes(soup)

    quotes = []
    seen = set()

    for node in price_nodes:
        # climb up a few levels to get a stable “card”-ish container
        container = node
        for _ in range(5):
            if container.parent:
                container = container.parent

        key = id(container)
        if key in seen:
            continue
        seen.add(key)

        q = parse_quote_from_container(container)
        if q.get("price"):
            quotes.append(q)

    # light cleanup: keep only the best-looking quotes
    # (cards that contain at least one time token)
    quotes = [q for q in quotes if len(q.get("times") or []) >= 1]

    return quotes

This parser is intentionally conservative. Once you run it once, you can look at the raw field and tighten selectors.

Step 4: Put it together (fetch → parse → export)

import json
from config import FLIGHTS_URL


def main():
    html = fetch_html(FLIGHTS_URL)
    polite_sleep()

    quotes = parse_google_flights(html)

    out = {
        "url": FLIGHTS_URL,
        "count": len(quotes),
        "quotes": quotes[:50],
    }

    with open("google_flights_quotes.json", "w", encoding="utf-8") as f:
        json.dump(out, f, ensure_ascii=False, indent=2)

    print("quotes:", len(quotes))
    if quotes:
        print("example:", quotes[0])


if __name__ == "__main__":
    main()

Run:

python scrape_google_flights.py

Where ProxiesAPI fits (honestly)

If you run this once or twice from your laptop, you may be fine.

But price monitoring is rarely “one request”. You typically want to:

poll multiple routes
poll multiple dates
refresh daily/hourly

That’s where blocks and throttling appear.

With ProxiesAPI, you route your requests through a stable proxy layer and rotate IPs so:

bursts don’t come from one IP
retries don’t look like a tight bot loop
you avoid burning a single home/office IP

Minimal integration pattern

You can integrate ProxiesAPI at the network layer by sending requests via a proxy URL.

PROXIES = {
    "http": "http://YOUR_PROXIESAPI_PROXY",
    "https": "http://YOUR_PROXIESAPI_PROXY",
}

r = session.get(FLIGHTS_URL, headers=BASE_HEADERS, proxies=PROXIES, timeout=TIMEOUT)

(Use the proxy endpoint and auth details from your ProxiesAPI dashboard. Keep them in env vars, not hardcoded.)

Practical tips to avoid getting blocked

Use a session (requests.Session()) so cookies persist.
Add jitter (random delays) between runs.
Cache results so you don’t re-fetch the same URL too often.
Fail fast on interstitials/captcha pages and back off.
If HTML is inconsistent, switch to a browser-based fetch for the initial capture.