GraphQL Scraping: How to Extract Clean Data from Modern Web Apps

Jun 19, 2026 · tutorial · #graphql scraping, #graphql, #python, #playwright, #httpx, #web-scraping

graphql scraping is often easier than traditional scraping, but only after you stop thinking like a DOM parser.

Modern React, Next.js, and app-shell frontends frequently render UI from GraphQL responses that are already structured, paginated, and closer to the business data you actually want.

That makes GraphQL attractive for scraping because it can replace:

brittle CSS selectors
multi-step pagination clicks
nested client-side hydration parsing

In this guide, we will cover:

how to find GraphQL requests in DevTools
how to replay them with Python
how persisted queries change the workflow
when GraphQL scraping beats DOM scraping

Use cleaner request pipelines for modern app data

When the browser UI is just a thin client over GraphQL, scraping the DOM is usually the expensive path. A ProxiesAPI-ready request layer helps you capture and replay the underlying data calls with fewer brittle selectors.

Get 1,000 free API calls View pricing

Why GraphQL is often the cleaner data source

With plain HTML scraping, you usually need to reconstruct the data model from visible markup.

With GraphQL, the data model often arrives already organized into:

nodes / edges
product or listing objects
cursors for pagination
filters and sort options in one request body

That means less guesswork and fewer selector failures when frontend teams redesign the page.

Step 1: Confirm the app is really using GraphQL

Open DevTools and check the Network tab while the page loads or while you apply a filter.

Common clues:

requests to /graphql
POST requests with operationName, query, or variables
GET requests with extensions={"persistedQuery": ...}
JSON responses under data

According to the current GraphQL-over-HTTP guidance, GraphQL servers commonly operate over regular HTTP and may accept both POST and GET depending on how the client is configured. That matters because your replay script should match the real method instead of assuming everything is a POST.

Step 2: Capture the operation name, variables, and headers

The most important fields are usually:

request method
URL
operationName
variables
request headers that are actually required

Do not cargo-cult the entire browser header set. Start with the minimum viable replay:

content-type
accept
auth or session headers if the page genuinely needs them

For public pages, that is often enough.

Step 3: Replay a standard GraphQL POST with Python

Here is the simplest useful pattern.

from __future__ import annotations

import httpx


GRAPHQL_URL = "https://example.com/graphql"

payload = {
    "operationName": "SearchProducts",
    "variables": {
        "query": "running shoes",
        "first": 24,
        "after": None,
    },
    "query": """
    query SearchProducts($query: String!, $first: Int!, $after: String) {
      search(query: $query, first: $first, after: $after) {
        pageInfo { hasNextPage endCursor }
        nodes {
          id
          name
          price
          brand
          inStock
        }
      }
    }
    """,
}

headers = {
    "content-type": "application/json",
    "accept": "application/json",
    "user-agent": "Mozilla/5.0",
}

with httpx.Client(timeout=30.0, headers=headers) as client:
    response = client.post(GRAPHQL_URL, json=payload)
    response.raise_for_status()
    data = response.json()

items = data["data"]["search"]["nodes"]
page_info = data["data"]["search"]["pageInfo"]

print(items[0])
print(page_info)

This is already better than scraping cards out of a heavily hydrated product grid.

Step 4: Handle cursor pagination instead of clicking "next"

One of the biggest GraphQL advantages is that pagination is usually explicit.

def paginate_search(client: httpx.Client, query: str) -> list[dict]:
    after = None
    all_nodes = []

    while True:
        payload["variables"]["query"] = query
        payload["variables"]["after"] = after

        response = client.post(GRAPHQL_URL, json=payload)
        response.raise_for_status()
        result = response.json()["data"]["search"]

        all_nodes.extend(result["nodes"])
        if not result["pageInfo"]["hasNextPage"]:
            break

        after = result["pageInfo"]["endCursor"]

    return all_nodes

That is much cleaner than clicking pagination controls in a browser and hoping nothing lazy-loads in a different order.

Step 5: Understand persisted queries

Some apps do not send the raw GraphQL query text every time.

Instead, they send a hash and an operationName, often in an extensions.persistedQuery object. This is usually called a persisted query or APQ flow.

Typical shapes:

{
  "operationName": "SearchProducts",
  "variables": {"query": "running shoes"},
  "extensions": {
    "persistedQuery": {
      "version": 1,
      "sha256Hash": "abc123..."
    }
  }
}

What this means for scraping:

you may not see the raw query text in the request
the hash must match a server-known query
replay still works if you preserve the same operation name, variables, and hash

That is why saving full request payloads during discovery is so useful.

Step 6: Capture requests directly with Playwright

For modern apps, Playwright is an excellent discovery tool even if your production scraper eventually uses httpx.

from playwright.sync_api import sync_playwright


def capture_graphql_calls(url: str) -> list[dict]:
    calls = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        def on_request(req):
            if "graphql" in req.url and req.method in {"GET", "POST"}:
                calls.append(
                    {
                        "url": req.url,
                        "method": req.method,
                        "headers": req.headers,
                        "post_data": req.post_data,
                    }
                )

        page.on("request", on_request)
        page.goto(url, wait_until="networkidle")
        page.wait_for_timeout(3000)
        browser.close()

    return calls

Use this to answer three questions fast:

which endpoint matters?
which operation names are stable?
which variables control filters and pagination?

GraphQL vs DOM scraping

Task	GraphQL replay	DOM scraping
product / listing data	usually excellent	often brittle
pagination	explicit cursors	click/load-more logic
field selection	precise	whatever the UI exposes
resilience to redesigns	often better	usually worse
auth-heavy private data	still difficult	still difficult