Scrape Weather Data for Any City (Open-Meteo)
Sometimes the easiest “scraping” project isn’t HTML at all — it’s turning a public API into a repeatable dataset pipeline.
Open-Meteo is a great example: you can fetch detailed hourly/daily weather forecasts (as JSON) without an API key.
In this guide, you’ll build a small but production-shaped pipeline:
- take a city name ("Mumbai", "Berlin", "Austin")
- geocode it to latitude/longitude
- call Open-Meteo’s forecast API
- add retries, timeouts, and on-disk caching
- export the result to JSON and a tidy CSV
We’ll also show how to route requests through ProxiesAPI when you want a consistent fetch layer across many jobs.
Even when you’re calling “friendly” APIs, network flakiness and rate limits show up at scale. ProxiesAPI gives you a single fetch interface you can standardize across scrapers and data jobs.
What we’re fetching
We’ll call two endpoints:
- Geocoding (Open-Meteo Geocoding API)
- used to turn a city name into coordinates
- returns multiple matches, so you can choose the best
- Forecast (Open-Meteo Forecast API)
- takes
latitude+longitude - returns time series for hourly/daily variables
We’ll keep it simple and fetch:
- hourly: temperature, precipitation, wind
- daily: max/min temp
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install requests
We’ll use the standard library for caching and CSV export.
Step 1: A fetch function (direct + ProxiesAPI)
Even for JSON APIs, you want:
- timeouts (no hanging requests)
- retries (transient failures happen)
- consistent headers
Direct fetch
import requests
TIMEOUT = (10, 30)
def fetch_json_direct(url: str, params: dict | None = None) -> dict:
r = requests.get(
url,
params=params,
timeout=TIMEOUT,
headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"},
)
r.raise_for_status()
return r.json()
Fetch via ProxiesAPI
ProxiesAPI works for any URL. The simplest mental model is: you pass a URL, you get the response body back.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://geocoding-api.open-meteo.com/v1/search?name=Mumbai&count=1" | head
In Python, we URL-encode the full target URL (including querystring):
import urllib.parse
import requests
PROXIESAPI_KEY = "API_KEY"
TIMEOUT = (10, 60)
def fetch_json_via_proxiesapi(url: str, params: dict | None = None) -> dict:
if params:
url = url + ("&" if "?" in url else "?") + urllib.parse.urlencode(params)
api = "http://api.proxiesapi.com/"
req_url = api + "?" + urllib.parse.urlencode({"key": PROXIESAPI_KEY, "url": url})
r = requests.get(
req_url,
timeout=TIMEOUT,
headers={"User-Agent": "Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)"},
)
r.raise_for_status()
return r.json()
Step 2: Geocode a city name
Open-Meteo’s geocoder returns a results list.
We’ll request up to 5 matches and pick the first one.
GEOCODE_URL = "https://geocoding-api.open-meteo.com/v1/search"
def geocode_city(name: str) -> dict:
data = fetch_json_direct(GEOCODE_URL, params={
"name": name,
"count": 5,
"language": "en",
"format": "json",
})
results = data.get("results") or []
if not results:
raise ValueError(f"No geocoding results for: {name}")
r0 = results[0]
return {
"name": r0.get("name"),
"country": r0.get("country"),
"admin1": r0.get("admin1"),
"latitude": r0.get("latitude"),
"longitude": r0.get("longitude"),
"timezone": r0.get("timezone"),
}
loc = geocode_city("Mumbai")
print(loc)
Typical output:
{'name': 'Mumbai', 'country': 'India', 'admin1': 'Maharashtra', 'latitude': 19.07283, 'longitude': 72.88261, 'timezone': 'Asia/Kolkata'}
Step 3: Fetch a forecast for that location
Now we call the forecast endpoint.
FORECAST_URL = "https://api.open-meteo.com/v1/forecast"
def fetch_forecast(lat: float, lon: float, tz: str = "auto") -> dict:
return fetch_json_direct(FORECAST_URL, params={
"latitude": lat,
"longitude": lon,
"hourly": "temperature_2m,precipitation,wind_speed_10m",
"daily": "temperature_2m_max,temperature_2m_min",
"timezone": tz,
})
forecast = fetch_forecast(loc["latitude"], loc["longitude"], tz=loc["timezone"])
print("keys:", forecast.keys())
print("hourly points:", len((forecast.get("hourly") or {}).get("time") or []))
Step 4: Add retries (so your pipeline doesn’t fall over)
Even with public APIs, you’ll sometimes see:
- a timeout
- a transient 5xx
- a short-lived network error
Here’s a lightweight retry wrapper with exponential backoff:
import time
import random
def with_retries(fn, *args, attempts: int = 4, **kwargs):
last = None
for i in range(1, attempts + 1):
try:
return fn(*args, **kwargs)
except Exception as e:
last = e
sleep = min(20, (2 ** i) + random.random())
print(f"failed attempt {i}/{attempts}: {e}; sleeping {sleep:.1f}s")
time.sleep(sleep)
raise last
Use it like:
loc = with_retries(geocode_city, "Berlin")
forecast = with_retries(fetch_forecast, loc["latitude"], loc["longitude"], loc["timezone"])
Step 5: Add on-disk caching (so reruns are fast)
If you run the same job repeatedly (daily dashboards, refreshes, tests), caching saves you time and reduces unnecessary calls.
We’ll cache responses keyed by a safe filename.
import json
import hashlib
from pathlib import Path
CACHE_DIR = Path(".cache_openmeteo")
CACHE_DIR.mkdir(exist_ok=True)
def cache_key(url: str, params: dict | None) -> str:
raw = url + "?" + ("" if not params else json.dumps(params, sort_keys=True))
return hashlib.sha256(raw.encode("utf-8")).hexdigest()
def fetch_json_cached(url: str, params: dict | None = None, ttl_seconds: int = 3600) -> dict:
key = cache_key(url, params)
path = CACHE_DIR / f"{key}.json"
if path.exists():
age = (time.time() - path.stat().st_mtime)
if age < ttl_seconds:
return json.loads(path.read_text(encoding="utf-8"))
data = fetch_json_direct(url, params=params)
path.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")
return data
Now just swap the fetch calls.
Step 6: Export a tidy hourly CSV
Open-Meteo returns arrays (parallel lists). We’ll turn the hourly section into row-wise data.
import csv
def export_hourly_csv(forecast: dict, out_path: str = "hourly.csv"):
hourly = forecast.get("hourly") or {}
times = hourly.get("time") or []
cols = {
"temperature_2m": hourly.get("temperature_2m") or [],
"precipitation": hourly.get("precipitation") or [],
"wind_speed_10m": hourly.get("wind_speed_10m") or [],
}
with open(out_path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=["time", *cols.keys()])
w.writeheader()
for i, t in enumerate(times):
row = {"time": t}
for k, arr in cols.items():
row[k] = arr[i] if i < len(arr) else None
w.writerow(row)
print("wrote", out_path, "rows", len(times))
Run end-to-end:
loc = geocode_city("Austin")
forecast = fetch_forecast(loc["latitude"], loc["longitude"], tz=loc["timezone"])
with open("forecast.json", "w", encoding="utf-8") as f:
import json
json.dump(forecast, f, ensure_ascii=False, indent=2)
export_hourly_csv(forecast, "austin_hourly.csv")
Where ProxiesAPI fits (honestly)
Open-Meteo is easy to use directly.
But if you’re building multiple data jobs (HTML scrapers + JSON APIs + enrichment steps), a consistent network layer helps:
- same timeout strategy
- same retry strategy
- one place to standardize headers
That’s where ProxiesAPI is useful: you treat every target as “a URL that returns content”, and your pipeline stays uniform.
Checklist
- geocoder returns multiple matches; you pick one deterministically
- forecast call returns hourly arrays with the same length
- retries prevent flaky failures
- caching makes reruns fast
- CSV export is row-wise (not column-wise)
Even when you’re calling “friendly” APIs, network flakiness and rate limits show up at scale. ProxiesAPI gives you a single fetch interface you can standardize across scrapers and data jobs.