Scrape Pinterest Images and Pins (Search + Board URLs) with Python + ProxiesAPI

Pinterest is one of those sites where “just curl the HTML” works for quick exploration… until it doesn’t.

  • some pages are server-rendered enough to parse
  • some content is hydrated by JS
  • anti-bot can kick in when you paginate or hit multiple boards quickly

In this guide we’ll build a practical Pinterest scraper in Python that supports two common workflows:

  1. Search pages (e.g. “kitchen design”)
  2. Board pages (e.g. https://www.pinterest.com/<user>/<board>/)

We’ll extract:

  • pin title / alt text
  • best image URL we can find (often i.pinimg.com)
  • pin URL
  • outbound “destination” URL when available
  • board metadata (name, followers if visible)

We’ll also add:

  • pagination / continuation (best-effort)
  • retries + backoff
  • dedupe
  • JSONL export

Pinterest search UI (we’ll extract pin cards + image URLs)

Keep Pinterest scrapes stable with ProxiesAPI

Pinterest can throttle aggressively once you scale beyond a few requests. ProxiesAPI helps you keep a consistent network layer (retries, rotation, higher success rates) while your parser stays the same.


Important note (what’s realistic)

Pinterest changes its markup frequently and uses heavy client-side rendering.

This tutorial focuses on a robust, best-effort HTML approach:

  • It works when the response contains enough HTML / embedded JSON
  • It may require tweaks when Pinterest ships UI changes

If you need guaranteed long-term stability, you typically move to:

  • a browser pipeline (Playwright) with careful throttling, or
  • a data partner / official API

But for many use cases (collecting inspiration pins from specific boards, quick monitoring, internal datasets), HTML + defensive parsing is still useful.


Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup(lxml) for parsing

You’ll plug ProxiesAPI into a single fetch() function. Everything else (parsing, pagination, export) stays the same.

Below is a template you can adapt to your ProxiesAPI account.

import os
import time
import random
import requests

TIMEOUT = (15, 45)  # connect, read

# Put your key in env: export PROXIESAPI_KEY="..."
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")

session = requests.Session()


def fetch(url: str, *, use_proxiesapi: bool = True, max_retries: int = 5) -> str:
    """Fetch a URL with retries + exponential backoff.

    If you use ProxiesAPI, keep this function as the only place that knows about it.
    """
    last_err = None

    for attempt in range(1, max_retries + 1):
        try:
            if use_proxiesapi:
                if not PROXIESAPI_KEY:
                    raise RuntimeError("Missing PROXIESAPI_KEY env var")

                # Example pattern: call ProxiesAPI with the upstream URL.
                # Replace endpoint/params with the exact ProxiesAPI format you use.
                r = session.get(
                    "https://api.proxiesapi.com",
                    params={
                        "auth_key": PROXIESAPI_KEY,
                        "url": url,
                        # Optional knobs (names depend on your ProxiesAPI plan):
                        # "country": "US",
                        # "render": "false",
                    },
                    timeout=TIMEOUT,
                    headers={
                        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0 Safari/537.36",
                        "Accept-Language": "en-US,en;q=0.9",
                    },
                )
            else:
                r = session.get(
                    url,
                    timeout=TIMEOUT,
                    headers={
                        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0 Safari/537.36",
                        "Accept-Language": "en-US,en;q=0.9",
                    },
                )

            r.raise_for_status()
            return r.text

        except Exception as e:
            last_err = e
            # Backoff with jitter
            sleep_s = min(30, (2 ** (attempt - 1)) + random.random())
            time.sleep(sleep_s)

    raise RuntimeError(f"Failed to fetch after {max_retries} retries: {url}") from last_err

What Pinterest pages look like (what to parse)

Pinterest pages often contain:

  • visible HTML with some pin cards
  • embedded JSON blobs in <script> tags that contain richer pin data

We’ll implement two extraction strategies:

  1. HTML image + link scraping (fast, sometimes enough)
  2. Embedded JSON discovery (more stable when present)

Step 1: Parse pins from HTML (cards → images)

On many Pinterest pages, you can find images that look like:

  • https://i.pinimg.com/.../xxx.jpg

We’ll treat each candidate image as a “pin-ish” item and then try to locate a nearby link.

from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://www.pinterest.com"


def normalize_pin_url(href: str | None) -> str | None:
    if not href:
        return None
    if href.startswith("http"):
        return href
    return urljoin(BASE, href)


def parse_pins_from_html(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")

    pins = []

    # Pinterest images typically come from i.pinimg.com
    for img in soup.select('img[src*="i.pinimg.com"]'):
        src = img.get("src")
        alt = (img.get("alt") or "").strip() or None

        # Heuristic: closest anchor up the tree
        a = img.find_parent("a")
        href = a.get("href") if a else None

        pins.append({
            "title": alt,
            "image_url": src,
            "pin_url": normalize_pin_url(href),
        })

    # Dedupe by (image_url or pin_url)
    seen = set()
    out = []
    for p in pins:
        key = p.get("pin_url") or p.get("image_url")
        if not key or key in seen:
            continue
        seen.add(key)
        out.append(p)

    return out

This alone can give you a usable dataset (title + image URL + pin URL).

But if you want outbound destination URLs (the link a pin points to), you’ll usually need embedded JSON.


Step 2: Extract embedded JSON (best-effort)

Pinterest often embeds JSON data in script tags.

We’ll:

  • scan scripts
  • find JSON-ish blobs
  • parse objects that look like pins

Because this varies, we’ll implement a conservative extractor that doesn’t assume a single schema.

import json
import re


def iter_json_blobs(html: str):
    # crude but effective: look for large JSON blobs in <script> tags
    for m in re.finditer(r"<script[^>]*>(.*?)</script>", html, flags=re.S | re.I):
        s = m.group(1).strip()
        if not s:
            continue
        # Quick filter
        if "{" not in s and "[" not in s:
            continue

        # Sometimes Pinterest assigns JSON to a variable; try to locate the first { ... } block
        # This won't catch everything, but it avoids overfitting.
        start = s.find("{")
        if start == -1:
            start = s.find("[")
        if start == -1:
            continue

        candidate = s[start:]

        # Try JSON parse directly
        try:
            yield json.loads(candidate)
        except Exception:
            continue


def walk(obj):
    if isinstance(obj, dict):
        yield obj
        for v in obj.values():
            yield from walk(v)
    elif isinstance(obj, list):
        for it in obj:
            yield from walk(it)


def parse_pins_from_embedded_json(html: str) -> list[dict]:
    pins = []

    for blob in iter_json_blobs(html):
        for node in walk(blob):
            # Heuristic: nodes that look like pin objects may have an id + images
            pin_id = node.get("id") or node.get("pin_id")
            images = node.get("images") or node.get("image")

            # Try to locate an image URL in common shapes
            image_url = None
            if isinstance(images, dict):
                # Pinterest sometimes offers multiple sizes
                for key in ["orig", "original", "736x", "564x", "474x"]:
                    if key in images and isinstance(images[key], dict):
                        image_url = images[key].get("url")
                        if image_url:
                            break
                if not image_url:
                    # any dict value with url
                    for v in images.values():
                        if isinstance(v, dict) and v.get("url"):
                            image_url = v.get("url")
                            break

            title = node.get("title") or node.get("grid_title") or node.get("description")
            link = node.get("link") or node.get("url")

            # Destination URL is often in fields like "link" or "destination_url"
            destination = node.get("destination_url") or node.get("outbound_link")

            if (pin_id or image_url) and (image_url or link):
                pins.append({
                    "id": str(pin_id) if pin_id else None,
                    "title": (title or "").strip() or None,
                    "image_url": image_url,
                    "pin_url": link if isinstance(link, str) else None,
                    "destination_url": destination if isinstance(destination, str) else None,
                })

    # Dedupe
    seen = set()
    out = []
    for p in pins:
        key = p.get("id") or p.get("pin_url") or p.get("image_url")
        if not key or key in seen:
            continue
        seen.add(key)
        out.append(p)

    return out

You’ll notice we didn’t hardcode a single schema. That’s intentional.


Step 3: Scrape Pinterest search results

Search URLs look like:

  • https://www.pinterest.com/search/pins/?q=kitchen%20design

Pinterest pagination can be complex. For a tutorial that stays maintainable, we’ll:

  • fetch the first page
  • parse pins from HTML + embedded JSON
  • optionally attempt to fetch “more” by using ?rs=typed or additional parameters (best-effort)
from urllib.parse import urlencode


def build_search_url(query: str) -> str:
    qs = urlencode({"q": query})
    return f"https://www.pinterest.com/search/pins/?{qs}"


def scrape_search(query: str, *, pages: int = 1) -> list[dict]:
    all_pins = []
    seen = set()

    url = build_search_url(query)

    for page in range(1, pages + 1):
        html = fetch(url)

        batch = []
        batch.extend(parse_pins_from_embedded_json(html))
        batch.extend(parse_pins_from_html(html))

        for p in batch:
            key = p.get("id") or p.get("pin_url") or p.get("image_url")
            if not key or key in seen:
                continue
            seen.add(key)
            all_pins.append(p)

        print("page", page, "pins", len(batch), "total unique", len(all_pins))

        # Best-effort pagination: Pinterest often needs continuation tokens.
        # For a production pipeline, you’d capture their internal next-page JSON calls.
        # Here we stop after the first page unless you extend this section.
        break

    return all_pins

For many use cases, you can run the search repeatedly (different keywords) instead of deep-paginating one keyword.


Step 4: Scrape a Board URL

Board URLs are commonly:

  • https://www.pinterest.com/<username>/<board>/

Boards are a great target because the intent is clear: “pins in this collection.”


def scrape_board(board_url: str) -> list[dict]:
    html = fetch(board_url)

    pins = []
    pins.extend(parse_pins_from_embedded_json(html))
    pins.extend(parse_pins_from_html(html))

    # Board metadata (best-effort)
    soup = BeautifulSoup(html, "lxml")
    h1 = soup.select_one("h1")
    board_name = h1.get_text(" ", strip=True) if h1 else None

    for p in pins:
        p["board_url"] = board_url
        p["board_name"] = board_name

    # Dedupe
    seen = set()
    out = []
    for p in pins:
        key = p.get("id") or p.get("pin_url") or p.get("image_url")
        if not key or key in seen:
            continue
        seen.add(key)
        out.append(p)

    return out

Export to JSONL

import json

def write_jsonl(path: str, rows: list[dict]):
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")


if __name__ == "__main__":
    pins = scrape_search("kitchen design", pages=1)
    write_jsonl("pinterest_search_kitchen_design.jsonl", pins)
    print("wrote", len(pins))

    # Example board (replace with a real board you own / is public)
    # board_pins = scrape_board("https://www.pinterest.com/<user>/<board>/")
    # write_jsonl("pinterest_board.jsonl", board_pins)

QA checklist

  • First page returns 20+ pins (varies)
  • Image URLs are i.pinimg.com and load in a browser
  • Pin URLs look like Pinterest URLs (not None)
  • Dedupe reduces duplicates from HTML + JSON extraction overlap
  • Your fetch layer uses timeouts + retries

Where ProxiesAPI helps (honestly)

Pinterest is one of the sites where the network layer is your main pain:

  • throttling increases with pagination
  • intermittent 403/429 responses
  • inconsistent HTML/JS payloads

ProxiesAPI helps keep fetches reliable while your parser stays focused on structure and data quality.

If you extend this guide, the next step is to capture the internal continuation requests (tokens) and implement deep pagination — ProxiesAPI makes that significantly less flaky.

Keep Pinterest scrapes stable with ProxiesAPI

Pinterest can throttle aggressively once you scale beyond a few requests. ProxiesAPI helps you keep a consistent network layer (retries, rotation, higher success rates) while your parser stays the same.

Related guides

Scrape Netflix Catalogue Data with Python + ProxiesAPI (Titles, Genres, Availability)
Build a repeatable Netflix title dataset from listing pages: extract title rows, handle pagination defensively, dedupe, and export clean JSONL. Includes a screenshot of the target UI.
tutorial#python#netflix#web-scraping
Scrape Stack Overflow Questions and Answers by Tag (Python + ProxiesAPI)
Extract Stack Overflow question lists and accepted answers for a tag with robust retries, respectful rate limits, and a validation screenshot. Export to JSON/CSV.
tutorial#python#stack-overflow#web-scraping
Scrape Patreon Creator Data with Python (Profiles, Tiers, Posts)
Extract Patreon creator metadata, membership tiers, and recent public posts with a screenshot-first workflow, robust retries, and ProxiesAPI-backed requests.
tutorial#python#patreon#web-scraping
Scrape NBA Scores and Standings from ESPN with Python (Box Scores + Schedule)
Build a clean dataset of today’s NBA games and standings from ESPN pages using robust selectors and proxy-safe requests.
tutorial#python#nba#espn