With PHP DOM extension, parsing HTML data is straightforward just as parsing DOM in JavaScript. This post will demonstrate how to use DOMDocument and DOMXPath to extract data in which we are interested from general HTML file, but not strictly structured as XML.

Sample file looks like this:

<!DOCTYPE html>
<html>
    <head>
        <title>王小五 - Profile</title>
        <meta charset="utf-8" />
        <link href="/css/profile.css" rel="stylesheet" type="text/css"/>
    </head>
    <body>
        <div id="wrapper">
            <div id="header">
                <h1>王小五</h1>
            </div>
            <div id="content">
                <div id="profile-box">
                    <div id="profile-img">
                        <img id="profile-img" src="/profile_img.php?uid=32153" alt="" />
                    </div>
                    <ul class="ulist">
                        <li>
                            <span class="field-name">Age:</span> 
                            <span class="field-value">31</span>
                        </li>
                        <li>
                            <span class="field-name">Gender:</span> 
                            <span class="field-value">Female</span>
                        </li>
                        <li>
                            <span class="field-name">Location:</span> 
                            <span class="field-value">Guangzhou, China</span>
                        </li>
                    </ul>
                </div>
            </div>
            <div id="footer">
                <div class="x13">© 2012 cctv.</div>
            </div>
        </div>
    </body>
</html>

1. Load HTML

The simplest code looks like this:

$html = file_get_contents('data.html');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

Notice that there are two elements with id=”profile-img” in the HTML, it’s not a “valid” HTML page, so when you try to run the code, you will encounter following PHP warning:

PHP Warning:  DOMDocument::loadHTML(): ID profile-img already defined in Entity, line: 16 in /home/james/projects/mixedlab/linux/php/xml/dom/demo/demo.php on line 6

Most of the times we’d like to ignore such warnings for it’s no likely for us to change the source html file which may generated by other PHP script written by a negligent programmer. Fortunately, this problem can be solved with just a little additional work:

libxml_use_internal_errors(true);

$html = file_get_contents('data.html');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

libxml_clear_errors();

Set libxml_use_internal_errors to true to suppress the warning, but still you can catch the errors if you want to do something beside just ignore it.

Another problem we should solve is the file encoding. The demo HTML data was encoded in UTF-8, as the meta header shows, but DOMDocument cannot recognize it. DOMDocument can only accept http-equiv meta, so we have to do some preprocess on the HTML:

$html = str_replace(
    '


', 
    '


', 
    $html
);

2. Extract Data

Although DOMDocument provides many methods for manipulating elements, such as DOMDocument::getElementsByClassName(), but I think it’s less useful and more complicated than using XPath.

XPath::query(string $expression [, DOMNode $contextnode [, bool $registerNodeNS = true ]]) always return a DOMNodeList object if the $expression is well formed and $contextnode is valid or NULL(not set).

Extract user name from header(h1):

$nodes = $xpath->query('//*[@id="header"]/h1');
$name = $nodes->item(0)->nodeValue;
echo "Name: " . $name . "\n";

In most cases, the HTML structure is more complicated than the demo, and $contextnode will help us focus on the restricted section and keep the xpath query concise.

Extract sub-node by specifying the context node(for demonstration only in this case):

$nodes = $xpath->query('//div[@id="header"]');
$headerNode = $nodes->item(0);
$nodes = $xpath->query('h1', $headerNode);
$name = $nodes->item(0)->nodeValue;
echo "Name: " . $name . "\n";

Extract user properties:

$nodes = $xpath->query('//*[@id="profile-box"]/ul/li');
foreach ($nodes as $node) {
    $childNodes = $xpath->query('span', $node);
    $key = $childNodes->item(0)->nodeValue;
    $value = $childNodes->item(1)->nodeValue;
    echo $key . ' ' . $value . "\n";
}

3. innerHTML function

Here provides a useful function for extracting the inner HTML for one node:

function innerHTML($node)
{
    $meta = '


';
    $dom = new DOMDocument();
    $dom->loadHTML($meta);
    $dom->appendChild($dom->importNode($node, true));
    $html = preg_replace(
        '#^.*</html><' . $node->nodeName . '[^>]*>(.*)</' . $node->nodeName . '>.*$#s', 
        '\1', $dom->saveHTML());
    return $html;
}

4. Reference

PHP Dom Manual: http://php.net/manual/en/book.dom.php

Full Demo for this Post: https://github.com/fwso/mixedlab/tree/master/linux/php/xml/dom/demo