Scrape Flight Prices from Google Flights (Python + ProxiesAPI)

Google Flights is one of those pages everyone wants to scrape:

  • fast market research (route demand, typical price bands)
  • alerts + monitoring (price drops)
  • building a routes → prices dataset for analysis

It’s also one of the places that will punish naive scraping quickly.

In this tutorial, we’ll take an honest, production-first approach:

  1. capture a proof screenshot of the target page (so you know what you’re scraping)
  2. fetch the HTML reliably (timeouts, retries, stable headers)
  3. parse whatever the server returns (and fail loudly when it’s not parseable)
  4. export a clean dataset
  5. show where ProxiesAPI fits in (network stability + IP rotation)

Google Flights results page (we’ll extract the visible price cards)

Keep price crawls stable with ProxiesAPI

Flight pricing pages are high-friction targets (rate limits, bot detection, and location variance). ProxiesAPI helps you rotate egress IPs and keep your crawl’s network layer consistent as volume grows.


Important reality check (Google Flights is JS-heavy)

Google Flights is largely rendered client-side, and its HTML can vary by:

  • geo / locale
  • device hints (headers, viewport)
  • cookies / consent
  • bot detection

That means there are two common scraping paths:

  • Path A (HTML parsing): works sometimes for lightweight extraction when the server returns usable HTML.
  • Path B (browser automation): Playwright/Selenium, extracting DOM after JS runs.

This guide focuses on Path A (requests + parsing) because it’s cheaper, faster, and good for many datasets.

If you consistently get empty HTML / interstitials, jump to the “When to switch to Playwright” section.


What we’re scraping

A typical Google Flights “explore / search results” view shows cards with:

  • airline / itinerary summary
  • departure/arrival times
  • duration and stops
  • price (the key)

Our goal is a dataset like:

{
  "from": "BOM",
  "to": "DEL",
  "depart_date": "2026-05-05",
  "return_date": null,
  "currency": "INR",
  "price": 6123,
  "raw_price_text": "₹6,123",
  "scraped_at": "2026-04-18T16:00:00Z",
  "source_url": "https://www.google.com/travel/flights?..."
}

Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml tenacity python-dateutil

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing (more forgiving than html.parser)
  • tenacity for retries (with backoff)

Step 1: Build a stable fetch() (timeouts, retries, headers)

Even before proxies, do the basics:

  • timeouts so you don’t hang
  • retries with exponential backoff
  • a realistic User-Agent
  • consistent Accept-Language
import os
import random
import time
from dataclasses import dataclass
from typing import Optional

import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

TIMEOUT = (10, 40)  # connect, read

USER_AGENTS = [
    # Keep a short rotation of real desktop UAs.
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]


@dataclass
class FetchResult:
    url: str
    status_code: int
    text: str
    final_url: str


def make_session() -> requests.Session:
    s = requests.Session()
    s.headers.update(
        {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Cache-Control": "no-cache",
            "Pragma": "no-cache",
        }
    )
    return s


@retry(stop=stop_after_attempt(4), wait=wait_exponential_jitter(initial=1, max=12))
def fetch_html(session: requests.Session, url: str, proxies: Optional[dict] = None) -> FetchResult:
    # Light UA rotation
    session.headers["User-Agent"] = random.choice(USER_AGENTS)

    r = session.get(url, timeout=TIMEOUT, allow_redirects=True, proxies=proxies)
    return FetchResult(url=url, status_code=r.status_code, text=r.text, final_url=str(r.url))

Step 2: Construct a Google Flights URL (practical approach)

Google’s flight URLs are not a stable public API.

The most reliable “engineering” workflow is:

  1. open Google Flights in a browser
  2. perform your search (route + date)
  3. copy the resulting URL
  4. parameterize the parts you control (origin/destination/dates) in your own code

For demo purposes, we’ll keep it simple: you provide a template URL for each route/date.

Example (yours will differ):

https://www.google.com/travel/flights?hl=en&gl=US&curr=USD#flt=BOM.DEL.2026-05-05;c:INR;e:1;sd:1;t:f

Notes:

  • hl affects language
  • gl affects region
  • curr affects currency display

Step 3: Parse prices from the returned HTML

When Google returns parseable HTML, you’ll often see price text in the response.

Instead of betting on one brittle selector, we use a layered strategy:

  1. look for common “₹123” / “$123” price-like strings in visible text
  2. optionally, try a few selectors (if present)
  3. keep the raw extraction evidence so you can debug quickly
import re
from bs4 import BeautifulSoup

PRICE_RE = re.compile(r"(?:(₹|\$|€|£)\s?)([0-9][0-9,\.]+)")


def parse_prices(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    # Quick sanity: if we got an interstitial, bail.
    title = (soup.title.get_text(strip=True) if soup.title else "").lower()
    if "unusual traffic" in title or "sorry" in title:
        raise RuntimeError(f"Blocked/interstitial detected: title={title!r}")

    text = soup.get_text("\n", strip=True)
    matches = []

    for m in PRICE_RE.finditer(text):
        currency = m.group(1)
        raw = f"{currency}{m.group(2)}"
        # Normalize
        num = m.group(2).replace(",", "")
        try:
            value = int(float(num))
        except ValueError:
            continue

        matches.append({"currency": currency, "raw_price_text": raw, "price": value})

    # De-dupe while preserving order
    seen = set()
    out = []
    for x in matches:
        key = (x["currency"], x["price"])
        if key in seen:
            continue
        seen.add(key)
        out.append(x)

    return out

This is intentionally conservative: Google Flights can include many prices on the page (filters, “typical prices”, etc.).

So in production you usually refine extraction by scoping to a section of the DOM or by using browser automation.

For an MVP dataset, you can take the lowest observed price as the “from price” signal.


from datetime import datetime, timezone


def scrape_search(url: str, route: dict, proxies: dict | None = None) -> dict:
    session = make_session()
    res = fetch_html(session, url, proxies=proxies)

    prices = parse_prices(res.text)
    if not prices:
        return {
            **route,
            "source_url": url,
            "final_url": res.final_url,
            "scraped_at": datetime.now(timezone.utc).isoformat(),
            "ok": False,
            "error": "No prices found in HTML. This is common on JS-rendered pages.",
        }

    best = min(prices, key=lambda x: x["price"])

    return {
        **route,
        "source_url": url,
        "final_url": res.final_url,
        "scraped_at": datetime.now(timezone.utc).isoformat(),
        "ok": True,
        "currency": best["currency"],
        "price": best["price"],
        "raw_price_text": best["raw_price_text"],
        "samples": prices[:25],
    }

Step 5: Add ProxiesAPI (honestly)

ProxiesAPI is useful here for one reason: Google will rate-limit / block by IP once you scale beyond casual browsing.

What ProxiesAPI does not do:

  • it doesn’t magically turn a JS app into server-rendered HTML
  • it doesn’t bypass all bot checks

What it can do:

  • rotate egress IPs
  • reduce correlation between requests
  • keep your crawler from dying when one IP gets throttled

Using ProxiesAPI with requests

You’ll typically configure a proxy endpoint (HTTP/HTTPS) and pass it via proxies=.

Example pattern (adjust to your ProxiesAPI credentials and endpoint):

import os

PROXIESAPI_PROXY = os.getenv("PROXIESAPI_PROXY_URL")


def proxiesapi_dict() -> dict | None:
    if not PROXIESAPI_PROXY:
        return None
    return {
        "http": PROXIESAPI_PROXY,
        "https": PROXIESAPI_PROXY,
    }


route = {"from": "BOM", "to": "DEL", "depart_date": "2026-05-05", "return_date": None}
url = "https://www.google.com/travel/flights?hl=en&gl=US&curr=USD#flt=BOM.DEL.2026-05-05;c:INR;e:1;sd:1;t:f"

row = scrape_search(url, route, proxies=proxiesapi_dict())
print(row["ok"], row.get("price"), row.get("error"))

Crawl multiple routes/dates + export CSV

import csv
import json


def export_csv(rows: list[dict], path: str = "flights_prices.csv"):
    if not rows:
        return
    keys = sorted({k for r in rows for k in r.keys() if k not in {"samples"}})

    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=keys)
        w.writeheader()
        for r in rows:
            rr = dict(r)
            rr.pop("samples", None)
            w.writerow(rr)


def export_json(rows: list[dict], path: str = "flights_prices.json"):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)


routes = [
    {
        "from": "BOM",
        "to": "DEL",
        "depart_date": "2026-05-05",
        "return_date": None,
        "url": "https://www.google.com/travel/flights?hl=en&gl=US&curr=USD#flt=BOM.DEL.2026-05-05;c:INR;e:1;sd:1;t:f",
    },
    # Add more rows here.
]

rows = []
for r in routes:
    row = scrape_search(r["url"], {k: r[k] for k in ["from", "to", "depart_date", "return_date"]}, proxies=proxiesapi_dict())
    rows.append(row)
    time.sleep(random.uniform(2.0, 5.0))

export_csv(rows)
export_json(rows)
print("done", len(rows))

When to switch to Playwright (and still use ProxiesAPI)

If most of your requests produce:

  • empty pages
  • consent pages
  • “unusual traffic” interstitials
  • HTML with no price content

…then you need a browser automation layer.

A pragmatic setup is:

  • Playwright to render the page and query the DOM
  • ProxiesAPI to provide stable proxy routing per browser context

(That’s a separate guide, but this is the escalation path that works.)


QA checklist

  • You can fetch the URL with realistic headers + timeouts
  • You detect interstitials and fail loudly
  • You persist raw evidence (final_url, sample prices)
  • You rate-limit and jitter requests
  • You use ProxiesAPI only as the network stability layer — not as a “magic bypass”
Keep price crawls stable with ProxiesAPI

Flight pricing pages are high-friction targets (rate limits, bot detection, and location variance). ProxiesAPI helps you rotate egress IPs and keep your crawl’s network layer consistent as volume grows.

Related guides

Scrape Costco Product Prices with Python (Search + Pagination + SKU Variants)
Pull product name, price, unit size, and availability from Costco listings into a clean CSV using ProxiesAPI + requests. Includes pagination and variant normalization patterns.
tutorial#python#costco#price-scraping
How to Scrape Cars.com Used Car Prices (Python + ProxiesAPI)
Extract listing title, price, mileage, location, and dealer info from Cars.com search results + detail pages. Includes selector notes, pagination, and a polite crawl plan.
tutorial#python#cars.com#price-scraping
How to Scrape Booking.com Hotel Prices with Python (Using ProxiesAPI)
Extract hotel names, nightly prices, review scores, and basic availability fields from Booking.com search results using Python + BeautifulSoup, with ProxiesAPI for more reliable fetching.
tutorial#python#booking#price-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping