Friday, June 8, 2012

How to parse and process HTML with PHP?


How can one parse HTML and extract information from it? What libraries exist for that purpose? What are their strengths and drawbacks?




This is a General Reference question for the tag



Source: Tips4all

14 comments:

  1. I prefer using one of the native XML extensions, like


    DOM or
    XMLReader.


    If you prefer a 3rd party lib, I'd suggest not to use SimpleHtmlDom, but a lib that actually uses DOM/libxml underneath instead of String Parsing:


    phpQuery,
    Zend_Dom,
    QueryPath,
    FluentDom or
    fDOMDocument


    You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like


    html5lib


    Or use a WebService like


    YQL or
    ScraperWiki.


    If you want to spend some money, have a look at


    PHP Architects Guide to Webscraping with PHP


    Last and least recommended, you can extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged. The snippets you will usually find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Once the markup changes, the Regex fails.

    You can write more reliable parsers, but writing a complete and reliable custom parser with Regular Expressions is a waste of time when the aforementioned libraries already exist and do a much better and likely faster job on this.

    Also see Parsing Html The Cthulhu Way

    ReplyDelete
  2. Try the Simple HTML Dom Parser:

    // Create DOM from URL or file
    $html = file_get_html('http://www.example.com/');

    // Find all images
    foreach($html->find('img') as $element)
    echo $element->src . '<br>';

    // Find all links
    foreach($html->find('a') as $element)
    echo $element->href . '<br>';

    ReplyDelete
  3. Why you shouldn't and when you should use regular expressions?

    First off, HTML cannot be properly parsed using regular expressions. Regexes can however extract data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or basic XML parsers are their syntactic cumbersomeness and meager reliability.

    Consider that making a somewhat reliable HTML extraction regex:

    <a\s+class="?playbutton\d?[^>]+id="(\d+)".+? <a\s+class="[\w\s]*title
    [\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?


    is way less readable than a simple phpQuery or QueryPath equivalent:

    $div->find(".stationcool a")->attr("title");


    There are however specific use cases where they can help. Most XML parsers cannot see HTML document comments <!-- which sometimes however are more useful anchors for extraction purposes. Occasionally regular expressions can save post-processing. And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.

    It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /<!--CONTENT-->(.+?)<!--END-->/ and process the remainder using the simpler HTML parser methods.

    Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
    So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.

    ReplyDelete
  4. phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're one of the easiest approaches to properly parse HTML in PHP.

    Examples for QueryPath

    Basically you first create a queryable DOM tree from a HTML string:

    $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL


    The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

    $qp->find("div.classname")->children()->...;

    foreach ($qp->find("p img") as $img) {
    print qp($img)->attr("src");
    }


    Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use xpath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularily ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

    $qp->xpath("//div/p[1]"); // get first paragraph in a div


    QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

    $qp->find("a[target=_blank]")->toggleClass("usability-blunder");


    .

    phpQuery or QueryPath?

    Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because overall less features).
    For further informations on the differences see this comparison:
    http://www.tagbytag.org/articles/phpquery-vs-querypath

    And here's a comprehensive QueryPath introduction: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP

    Advantages


    Simplicity and Reliability
    Simple to use alternatives ->find("a img, a object, div a")
    Proper data unescaping (in comparison to regular expression greping)

    ReplyDelete
  5. One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.

    But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.

    ReplyDelete
  6. This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.

    ReplyDelete
  7. This sounds like a good task description of W3C XPath technology. It's easy to express queries like "return all href attributes in img tags that are nested in <foo><bar><baz> elements." Not being a PHP buff, I can't tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath.
    For a quick intro, see http://en.wikipedia.org/wiki/XPath.

    ReplyDelete
  8. For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( http://github.com/symfony/symfony/tree/master/src/Symfony/Component/DomCrawler/ ).
    This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: http://www.slideshare.net/fabpot/news-of-the-symfony2-world.

    The component is designed to work standalone and can be used without Symfony.

    The only drawback is that it will only work with PHP 5.3 or newer.

    ReplyDelete
  9. we have created quite a few crawlers for our needs before. at the end of the day, it is usually simple regular expressions that do the thing best. while libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is more safe way to go, as you can handle also non-valid html/xhtml structures, which would fail, if loaded via most of the parsers.

    ReplyDelete
  10. 1.Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.

    ReplyDelete
  11. With PHP I would advise you to use the Simple HTML Dom Parser, the best way to learn more about it is to look for samples on the ScraperWiki website.

    ReplyDelete
  12. Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scrapping and have found it to be too vulnerable. It does the basic job but I won't recommend it anyways.

    I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.

    Kindly check out this link: http://spyderwebtech.wordpress.com/2008/08/07/scraping-websites-with-curl/

    ReplyDelete
  13. QueryPath is good, but be careful of "tracking state" cause if you didnt realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.

    what it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.

    in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.

    $results = qp("div p");
    $forename = $results->find("input[name='forename']");

    "$results" now contains the result set for "input[name='forename']" NOT the original query "div p" this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead

    $forename = $results->branch()->find("input[name='forname']")

    then $results won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.

    ReplyDelete
  14. Try it once..

    <html>
    <head>
    <script type="text/javascript">
    function showRSS(str)
    {
    if (str.length==0)
    {
    document.getElementById("rssOutput").innerHTML="";
    return;
    }
    if (window.XMLHttpRequest)
    {// code for IE7+, Firefox, Chrome, Opera, Safari
    xmlhttp=new XMLHttpRequest();
    }
    else
    {// code for IE6, IE5
    xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
    }
    xmlhttp.onreadystatechange=function()
    {
    if (xmlhttp.readyState==4 && xmlhttp.status==200)
    {
    document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
    }
    }
    xmlhttp.open("GET","getrss.php?q="+str,true);
    xmlhttp.send();
    }
    </script>
    </head>
    <body>

    <form>
    <select onchange="showRSS(this.value)">
    <option value="">Select an RSS-feed:</option>
    <option value="Google">Street Easy</option>
    <option value="MSNBC">MSNBC News</option>
    </select>
    </form>
    <br />
    <div id="rssOutput">RSS-feed will be listed here...</div>
    </body>
    </html>

    ReplyDelete