Web Scraping with PHP: cURL + DOMDocument Tutorial (2026)

Mar 21, 2026 · guide · #php, #web-scraping, #curl, #domdocument, #xpath, #proxies, #tutorial

If you’ve been searching for web scraping php tutorials, here’s the straight path in 2026:

fetch HTML reliably (timeouts, headers)
parse it with a real HTML parser (DOMDocument)
extract with XPath (more precise than regex)
add retries + polite pacing
scale the network layer (this is where ProxiesAPI helps)

This guide is hands-on: you’ll build a small scraper that grabs titles and links from a page, then expands to multiple pages.

Scale PHP scrapers cleanly with ProxiesAPI

Once your PHP scraper grows from one page to hundreds or thousands, stability becomes the problem. ProxiesAPI gives you a simple fetch URL so you can keep your PHP parsing code focused on extraction — not network weirdness.

Get 1,000 free API calls View pricing

What we’ll scrape (a simple target)

To keep the focus on PHP mechanics, we’ll scrape a “friendly” example page that’s server-rendered HTML:

https://news.ycombinator.com/ (Hacker News)

It has consistent markup and doesn’t require JavaScript.

Important: The exact same PHP technique works on most HTML pages. The difference is scale and reliability.

Step 1: Fetch HTML with cURL (the right way)

PHP’s cURL is the workhorse. Use:

a real User-Agent
connect + total timeouts
follow redirects
sane error handling

Create fetch.php:

<?php

declare(strict_types=1);

function fetch_html(string $url): string {
    $ch = curl_init($url);

    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_MAXREDIRS => 5,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT => 30,
        CURLOPT_HTTPHEADER => [
            'User-Agent: Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
            'Accept: text/html,application/xhtml+xml',
        ],
    ]);

    $body = curl_exec($ch);
    if ($body === false) {
        $err = curl_error($ch);
        curl_close($ch);
        throw new RuntimeException("cURL error: {$err}");
    }

    $status = (int) curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
    curl_close($ch);

    if ($status >= 400) {
        throw new RuntimeException("HTTP {$status} for {$url}");
    }

    return $body;
}

$html = fetch_html('https://news.ycombinator.com/');
echo "bytes=" . strlen($html) . PHP_EOL;
echo substr($html, 0, 200) . PHP_EOL;

Run it:

php fetch.php

Typical output:

bytes=180000
<!doctype html>
<html lang="en" op="news">
  <head>
    <meta name="referrer" content="origin">

Step 2: Parse HTML with DOMDocument (and ignore warnings)

Real-world HTML is messy. DOMDocument::loadHTML() can emit warnings.

We’ll:

suppress warnings during load
set encoding

<?php

declare(strict_types=1);

function parse_dom(string $html): DOMDocument {
    $dom = new DOMDocument();

    // Many pages aren't perfect HTML; suppress warnings while parsing.
    libxml_use_internal_errors(true);

    // Ensure UTF-8 handling
    $dom->loadHTML('<?xml encoding="utf-8" ?>' . $html);

    libxml_clear_errors();
    return $dom;
}

Step 3: Extract data with XPath

XPath is the killer feature in PHP scraping.

On Hacker News, titles are inside:

span.titleline > a

XPath for that is:

//span[@class='titleline']/a

Full script scrape_hn.php:

<?php

declare(strict_types=1);

function fetch_html(string $url): string {
    $ch = curl_init($url);
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_MAXREDIRS => 5,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT => 30,
        CURLOPT_HTTPHEADER => [
            'User-Agent: Mozilla/5.0 (compatible; ProxiesAPI-Guides/1.0)',
            'Accept: text/html,application/xhtml+xml',
        ],
    ]);

    $body = curl_exec($ch);
    if ($body === false) {
        $err = curl_error($ch);
        curl_close($ch);
        throw new RuntimeException("cURL error: {$err}");
    }

    $status = (int) curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
    curl_close($ch);

    if ($status >= 400) {
        throw new RuntimeException("HTTP {$status} for {$url}");
    }

    return $body;
}

function parse_dom(string $html): DOMDocument {
    $dom = new DOMDocument();
    libxml_use_internal_errors(true);
    $dom->loadHTML('<?xml encoding="utf-8" ?>' . $html);
    libxml_clear_errors();
    return $dom;
}

$url = 'https://news.ycombinator.com/';
$html = fetch_html($url);
$dom = parse_dom($html);
$xpath = new DOMXPath($dom);

$nodes = $xpath->query("//span[@class='titleline']/a");

$items = [];
foreach ($nodes as $a) {
    /** @var DOMElement $a */
    $title = trim($a->textContent);
    $href = $a->getAttribute('href');

    $items[] = [
        'title' => $title,
        'url' => $href,
    ];
}

echo "items=" . count($items) . PHP_EOL;
echo json_encode(array_slice($items, 0, 3), JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . PHP_EOL;

Run:

php scrape_hn.php

Typical output:

items=30
[
  {"title":"...","url":"https://..."},
  {"title":"...","url":"https://..."},
  {"title":"...","url":"item?id=..."}
]

Step 4: Add pagination (scrape multiple pages)

HN supports ?p=N.

<?php

declare(strict_types=1);

function hn_url(int $page): string {
    return $page <= 1 ? 'https://news.ycombinator.com/' : "https://news.ycombinator.com/?p={$page}";
}

for ($p = 1; $p <= 3; $p++) {
    $url = hn_url($p);
    $html = fetch_html($url);
    $dom = parse_dom($html);
    $xpath = new DOMXPath($dom);

    $nodes = $xpath->query("//span[@class='titleline']/a");
    echo "page={$p} items=" . $nodes->length . PHP_EOL;

    // polite sleep so you don't hammer the site
    usleep(500000);
}

Step 5: Use ProxiesAPI from PHP

When you scale scraping, your hard problem isn’t XPath.

It’s stability:

intermittent timeouts
inconsistent responses
connection resets

With ProxiesAPI, you fetch through a simple URL wrapper:

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://news.ycombinator.com/" | head

In PHP, build the wrapped URL and then call your same fetch_html():

<?php

declare(strict_types=1);

function proxiesapi_wrap(string $targetUrl, string $apiKey): string {
    $base = 'http://api.proxiesapi.com/';
    $query = http_build_query([
        'key' => $apiKey,
        'url' => $targetUrl,
    ]);
    return $base . '?' . $query;
}

$apiKey = 'API_KEY';
$target = 'https://news.ycombinator.com/';
$wrapped = proxiesapi_wrap($target, $apiKey);

$html = fetch_html($wrapped);
echo "bytes=" . strlen($html) . PHP_EOL;

Notice the win: your parser does not change.

You’re still doing DOMDocument + XPath. Only the fetch URL changes.

Common mistakes (PHP scraping)

1) Regex-based HTML parsing

Don’t. Use DOMDocument + XPath.

2) Missing timeouts

A single hung request can stall a whole batch job.

3) Not setting a User-Agent

Some servers treat “empty UA” requests as suspicious or low priority.

4) Not handling encoding

If you see weird characters, ensure you load HTML with UTF-8:

$dom->loadHTML('<?xml encoding="utf-8" ?>' . $html);

Where ProxiesAPI fits (honestly)

You can scrape many friendly sites directly.

But if you’re building “real” scraping infrastructure (multiple targets, frequent runs, more pages), you’ll care about stability and operational simplicity.

ProxiesAPI helps by giving you one consistent way to fetch content — while your PHP code stays focused on extraction.