Scrape Pinterest Images and Pins (Search + Board URLs) with Python + ProxiesAPI
Pinterest is one of those sites where “just curl the HTML” works for quick exploration… until it doesn’t.
- some pages are server-rendered enough to parse
- some content is hydrated by JS
- anti-bot can kick in when you paginate or hit multiple boards quickly
In this guide we’ll build a practical Pinterest scraper in Python that supports two common workflows:
- Search pages (e.g. “kitchen design”)
- Board pages (e.g.
https://www.pinterest.com/<user>/<board>/)
We’ll extract:
- pin title / alt text
- best image URL we can find (often
i.pinimg.com) - pin URL
- outbound “destination” URL when available
- board metadata (name, followers if visible)
We’ll also add:
- pagination / continuation (best-effort)
- retries + backoff
- dedupe
- JSONL export

Pinterest can throttle aggressively once you scale beyond a few requests. ProxiesAPI helps you keep a consistent network layer (retries, rotation, higher success rates) while your parser stays the same.
Important note (what’s realistic)
Pinterest changes its markup frequently and uses heavy client-side rendering.
This tutorial focuses on a robust, best-effort HTML approach:
- It works when the response contains enough HTML / embedded JSON
- It may require tweaks when Pinterest ships UI changes
If you need guaranteed long-term stability, you typically move to:
- a browser pipeline (Playwright) with careful throttling, or
- a data partner / official API
But for many use cases (collecting inspiration pins from specific boards, quick monitoring, internal datasets), HTML + defensive parsing is still useful.
Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoup(lxml)for parsing
ProxiesAPI request wrapper (recommended)
You’ll plug ProxiesAPI into a single fetch() function. Everything else (parsing, pagination, export) stays the same.
Below is a template you can adapt to your ProxiesAPI account.
import os
import time
import random
import requests
TIMEOUT = (15, 45) # connect, read
# Put your key in env: export PROXIESAPI_KEY="..."
PROXIESAPI_KEY = os.getenv("PROXIESAPI_KEY")
session = requests.Session()
def fetch(url: str, *, use_proxiesapi: bool = True, max_retries: int = 5) -> str:
"""Fetch a URL with retries + exponential backoff.
If you use ProxiesAPI, keep this function as the only place that knows about it.
"""
last_err = None
for attempt in range(1, max_retries + 1):
try:
if use_proxiesapi:
if not PROXIESAPI_KEY:
raise RuntimeError("Missing PROXIESAPI_KEY env var")
# Example pattern: call ProxiesAPI with the upstream URL.
# Replace endpoint/params with the exact ProxiesAPI format you use.
r = session.get(
"https://api.proxiesapi.com",
params={
"auth_key": PROXIESAPI_KEY,
"url": url,
# Optional knobs (names depend on your ProxiesAPI plan):
# "country": "US",
# "render": "false",
},
timeout=TIMEOUT,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
)
else:
r = session.get(
url,
timeout=TIMEOUT,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
)
r.raise_for_status()
return r.text
except Exception as e:
last_err = e
# Backoff with jitter
sleep_s = min(30, (2 ** (attempt - 1)) + random.random())
time.sleep(sleep_s)
raise RuntimeError(f"Failed to fetch after {max_retries} retries: {url}") from last_err
What Pinterest pages look like (what to parse)
Pinterest pages often contain:
- visible HTML with some pin cards
- embedded JSON blobs in
<script>tags that contain richer pin data
We’ll implement two extraction strategies:
- HTML image + link scraping (fast, sometimes enough)
- Embedded JSON discovery (more stable when present)
Step 1: Parse pins from HTML (cards → images)
On many Pinterest pages, you can find images that look like:
https://i.pinimg.com/.../xxx.jpg
We’ll treat each candidate image as a “pin-ish” item and then try to locate a nearby link.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://www.pinterest.com"
def normalize_pin_url(href: str | None) -> str | None:
if not href:
return None
if href.startswith("http"):
return href
return urljoin(BASE, href)
def parse_pins_from_html(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
pins = []
# Pinterest images typically come from i.pinimg.com
for img in soup.select('img[src*="i.pinimg.com"]'):
src = img.get("src")
alt = (img.get("alt") or "").strip() or None
# Heuristic: closest anchor up the tree
a = img.find_parent("a")
href = a.get("href") if a else None
pins.append({
"title": alt,
"image_url": src,
"pin_url": normalize_pin_url(href),
})
# Dedupe by (image_url or pin_url)
seen = set()
out = []
for p in pins:
key = p.get("pin_url") or p.get("image_url")
if not key or key in seen:
continue
seen.add(key)
out.append(p)
return out
This alone can give you a usable dataset (title + image URL + pin URL).
But if you want outbound destination URLs (the link a pin points to), you’ll usually need embedded JSON.
Step 2: Extract embedded JSON (best-effort)
Pinterest often embeds JSON data in script tags.
We’ll:
- scan scripts
- find JSON-ish blobs
- parse objects that look like pins
Because this varies, we’ll implement a conservative extractor that doesn’t assume a single schema.
import json
import re
def iter_json_blobs(html: str):
# crude but effective: look for large JSON blobs in <script> tags
for m in re.finditer(r"<script[^>]*>(.*?)</script>", html, flags=re.S | re.I):
s = m.group(1).strip()
if not s:
continue
# Quick filter
if "{" not in s and "[" not in s:
continue
# Sometimes Pinterest assigns JSON to a variable; try to locate the first { ... } block
# This won't catch everything, but it avoids overfitting.
start = s.find("{")
if start == -1:
start = s.find("[")
if start == -1:
continue
candidate = s[start:]
# Try JSON parse directly
try:
yield json.loads(candidate)
except Exception:
continue
def walk(obj):
if isinstance(obj, dict):
yield obj
for v in obj.values():
yield from walk(v)
elif isinstance(obj, list):
for it in obj:
yield from walk(it)
def parse_pins_from_embedded_json(html: str) -> list[dict]:
pins = []
for blob in iter_json_blobs(html):
for node in walk(blob):
# Heuristic: nodes that look like pin objects may have an id + images
pin_id = node.get("id") or node.get("pin_id")
images = node.get("images") or node.get("image")
# Try to locate an image URL in common shapes
image_url = None
if isinstance(images, dict):
# Pinterest sometimes offers multiple sizes
for key in ["orig", "original", "736x", "564x", "474x"]:
if key in images and isinstance(images[key], dict):
image_url = images[key].get("url")
if image_url:
break
if not image_url:
# any dict value with url
for v in images.values():
if isinstance(v, dict) and v.get("url"):
image_url = v.get("url")
break
title = node.get("title") or node.get("grid_title") or node.get("description")
link = node.get("link") or node.get("url")
# Destination URL is often in fields like "link" or "destination_url"
destination = node.get("destination_url") or node.get("outbound_link")
if (pin_id or image_url) and (image_url or link):
pins.append({
"id": str(pin_id) if pin_id else None,
"title": (title or "").strip() or None,
"image_url": image_url,
"pin_url": link if isinstance(link, str) else None,
"destination_url": destination if isinstance(destination, str) else None,
})
# Dedupe
seen = set()
out = []
for p in pins:
key = p.get("id") or p.get("pin_url") or p.get("image_url")
if not key or key in seen:
continue
seen.add(key)
out.append(p)
return out
You’ll notice we didn’t hardcode a single schema. That’s intentional.
Step 3: Scrape Pinterest search results
Search URLs look like:
https://www.pinterest.com/search/pins/?q=kitchen%20design
Pinterest pagination can be complex. For a tutorial that stays maintainable, we’ll:
- fetch the first page
- parse pins from HTML + embedded JSON
- optionally attempt to fetch “more” by using
?rs=typedor additional parameters (best-effort)
from urllib.parse import urlencode
def build_search_url(query: str) -> str:
qs = urlencode({"q": query})
return f"https://www.pinterest.com/search/pins/?{qs}"
def scrape_search(query: str, *, pages: int = 1) -> list[dict]:
all_pins = []
seen = set()
url = build_search_url(query)
for page in range(1, pages + 1):
html = fetch(url)
batch = []
batch.extend(parse_pins_from_embedded_json(html))
batch.extend(parse_pins_from_html(html))
for p in batch:
key = p.get("id") or p.get("pin_url") or p.get("image_url")
if not key or key in seen:
continue
seen.add(key)
all_pins.append(p)
print("page", page, "pins", len(batch), "total unique", len(all_pins))
# Best-effort pagination: Pinterest often needs continuation tokens.
# For a production pipeline, you’d capture their internal next-page JSON calls.
# Here we stop after the first page unless you extend this section.
break
return all_pins
For many use cases, you can run the search repeatedly (different keywords) instead of deep-paginating one keyword.
Step 4: Scrape a Board URL
Board URLs are commonly:
https://www.pinterest.com/<username>/<board>/
Boards are a great target because the intent is clear: “pins in this collection.”
def scrape_board(board_url: str) -> list[dict]:
html = fetch(board_url)
pins = []
pins.extend(parse_pins_from_embedded_json(html))
pins.extend(parse_pins_from_html(html))
# Board metadata (best-effort)
soup = BeautifulSoup(html, "lxml")
h1 = soup.select_one("h1")
board_name = h1.get_text(" ", strip=True) if h1 else None
for p in pins:
p["board_url"] = board_url
p["board_name"] = board_name
# Dedupe
seen = set()
out = []
for p in pins:
key = p.get("id") or p.get("pin_url") or p.get("image_url")
if not key or key in seen:
continue
seen.add(key)
out.append(p)
return out
Export to JSONL
import json
def write_jsonl(path: str, rows: list[dict]):
with open(path, "w", encoding="utf-8") as f:
for r in rows:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
if __name__ == "__main__":
pins = scrape_search("kitchen design", pages=1)
write_jsonl("pinterest_search_kitchen_design.jsonl", pins)
print("wrote", len(pins))
# Example board (replace with a real board you own / is public)
# board_pins = scrape_board("https://www.pinterest.com/<user>/<board>/")
# write_jsonl("pinterest_board.jsonl", board_pins)
QA checklist
- First page returns 20+ pins (varies)
- Image URLs are
i.pinimg.comand load in a browser - Pin URLs look like Pinterest URLs (not
None) - Dedupe reduces duplicates from HTML + JSON extraction overlap
- Your fetch layer uses timeouts + retries
Where ProxiesAPI helps (honestly)
Pinterest is one of the sites where the network layer is your main pain:
- throttling increases with pagination
- intermittent 403/429 responses
- inconsistent HTML/JS payloads
ProxiesAPI helps keep fetches reliable while your parser stays focused on structure and data quality.
If you extend this guide, the next step is to capture the internal continuation requests (tokens) and implement deep pagination — ProxiesAPI makes that significantly less flaky.
Pinterest can throttle aggressively once you scale beyond a few requests. ProxiesAPI helps you keep a consistent network layer (retries, rotation, higher success rates) while your parser stays the same.