Web Scraping with PHP: cURL + DOMDocument Tutorial (2026)
If you’ve been searching for web scraping php tutorials, here’s the straight path in 2026:
- fetch HTML reliably (timeouts, headers)
- parse it with a real HTML parser (
DOMDocument) - extract with XPath (more precise than regex)
- add retries + polite pacing
- scale the network layer (this is where ProxiesAPI helps)
This guide is hands-on: you’ll build a small scraper that grabs titles and links from a page, then expands to multiple pages.
Once your PHP scraper grows from one page to hundreds or thousands, stability becomes the problem. ProxiesAPI gives you a simple fetch URL so you can keep your PHP parsing code focused on extraction — not network weirdness.
What we’ll scrape (a simple target)
To keep the focus on PHP mechanics, we’ll scrape a “friendly” example page that’s server-rendered HTML:
https://news.ycombinator.com/(Hacker News)
It has consistent markup and doesn’t require JavaScript.
Important: The exact same PHP technique works on most HTML pages. The difference is scale and reliability.
Step 1: Fetch HTML with cURL (the right way)
PHP’s cURL is the workhorse. Use:
- a real User-Agent
- connect + total timeouts
- follow redirects
- sane error handling
Create fetch.php:
<?php
declare(strict_types=1);
function fetch_html(string $url): string {
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
'Accept: text/html,application/xhtml+xml',
],
]);
$body = curl_exec($ch);
if ($body === false) {
$err = curl_error($ch);
curl_close($ch);
throw new RuntimeException("cURL error: {$err}");
}
$status = (int) curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
curl_close($ch);
if ($status >= 400) {
throw new RuntimeException("HTTP {$status} for {$url}");
}
return $body;
}
$html = fetch_html('https://news.ycombinator.com/');
echo "bytes=" . strlen($html) . PHP_EOL;
echo substr($html, 0, 200) . PHP_EOL;
Run it:
php fetch.php
Typical output:
bytes=180000
<!doctype html>
<html lang="en" op="news">
<head>
<meta name="referrer" content="origin">
Step 2: Parse HTML with DOMDocument (and ignore warnings)
Real-world HTML is messy. DOMDocument::loadHTML() can emit warnings.
We’ll:
- suppress warnings during load
- set encoding
<?php
declare(strict_types=1);
function parse_dom(string $html): DOMDocument {
$dom = new DOMDocument();
// Many pages aren't perfect HTML; suppress warnings while parsing.
libxml_use_internal_errors(true);
// Ensure UTF-8 handling
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $html);
libxml_clear_errors();
return $dom;
}
Step 3: Extract data with XPath
XPath is the killer feature in PHP scraping.
On Hacker News, titles are inside:
span.titleline > a
XPath for that is:
//span[@class='titleline']/a
Full script scrape_hn.php:
<?php
declare(strict_types=1);
function fetch_html(string $url): string {
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
'Accept: text/html,application/xhtml+xml',
],
]);
$body = curl_exec($ch);
if ($body === false) {
$err = curl_error($ch);
curl_close($ch);
throw new RuntimeException("cURL error: {$err}");
}
$status = (int) curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
curl_close($ch);
if ($status >= 400) {
throw new RuntimeException("HTTP {$status} for {$url}");
}
return $body;
}
function parse_dom(string $html): DOMDocument {
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $html);
libxml_clear_errors();
return $dom;
}
$url = 'https://news.ycombinator.com/';
$html = fetch_html($url);
$dom = parse_dom($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//span[@class='titleline']/a");
$items = [];
foreach ($nodes as $a) {
/** @var DOMElement $a */
$title = trim($a->textContent);
$href = $a->getAttribute('href');
$items[] = [
'title' => $title,
'url' => $href,
];
}
echo "items=" . count($items) . PHP_EOL;
echo json_encode(array_slice($items, 0, 3), JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . PHP_EOL;
Run:
php scrape_hn.php
Typical output:
items=30
[
{"title":"...","url":"https://..."},
{"title":"...","url":"https://..."},
{"title":"...","url":"item?id=..."}
]
Step 4: Add pagination (scrape multiple pages)
HN supports ?p=N.
<?php
declare(strict_types=1);
function hn_url(int $page): string {
return $page <= 1 ? 'https://news.ycombinator.com/' : "https://news.ycombinator.com/?p={$page}";
}
for ($p = 1; $p <= 3; $p++) {
$url = hn_url($p);
$html = fetch_html($url);
$dom = parse_dom($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//span[@class='titleline']/a");
echo "page={$p} items=" . $nodes->length . PHP_EOL;
// polite sleep so you don't hammer the site
usleep(500000);
}
Step 5: Use ProxiesAPI from PHP
When you scale scraping, your hard problem isn’t XPath.
It’s stability:
- intermittent timeouts
- inconsistent responses
- connection resets
With ProxiesAPI, you fetch through a simple URL wrapper:
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://news.ycombinator.com/" | head
In PHP, build the wrapped URL and then call your same fetch_html():
<?php
declare(strict_types=1);
function proxiesapi_wrap(string $targetUrl, string $apiKey): string {
$base = 'http://api.proxiesapi.com/';
$query = http_build_query([
'key' => $apiKey,
'url' => $targetUrl,
]);
return $base . '?' . $query;
}
$apiKey = 'API_KEY';
$target = 'https://news.ycombinator.com/';
$wrapped = proxiesapi_wrap($target, $apiKey);
$html = fetch_html($wrapped);
echo "bytes=" . strlen($html) . PHP_EOL;
Notice the win: your parser does not change.
You’re still doing DOMDocument + XPath. Only the fetch URL changes.
Common mistakes (PHP scraping)
1) Regex-based HTML parsing
Don’t. Use DOMDocument + XPath.
2) Missing timeouts
A single hung request can stall a whole batch job.
3) Not setting a User-Agent
Some servers treat “empty UA” requests as suspicious or low priority.
4) Not handling encoding
If you see weird characters, ensure you load HTML with UTF-8:
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $html);
Where ProxiesAPI fits (honestly)
You can scrape many friendly sites directly.
But if you’re building “real” scraping infrastructure (multiple targets, frequent runs, more pages), you’ll care about stability and operational simplicity.
ProxiesAPI helps by giving you one consistent way to fetch content — while your PHP code stays focused on extraction.
Quick checklist
- cURL uses timeouts
- DOMDocument parsing works without noisy warnings
- XPath queries match the right elements
- pagination loop works
- ProxiesAPI wrapper swaps in without changing parsing code
Once your PHP scraper grows from one page to hundreds or thousands, stability becomes the problem. ProxiesAPI gives you a simple fetch URL so you can keep your PHP parsing code focused on extraction — not network weirdness.