Ccna final exam - java, php, javascript, ios, cshap all in one. This is a collaboratively edited question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.
Friday, June 8, 2012
How to parse and process HTML with PHP?
How can one parse HTML and extract information from it? What libraries exist for that purpose? What are their strengths and drawbacks?
I prefer using one of the native XML extensions, like
DOM or XMLReader.
If you prefer a 3rd party lib, I'd suggest not to use SimpleHtmlDom, but a lib that actually uses DOM/libxml underneath instead of String Parsing:
phpQuery, Zend_Dom, QueryPath, FluentDom or fDOMDocument
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like
html5lib
Or use a WebService like
YQL or ScraperWiki.
If you want to spend some money, have a look at
PHP Architects Guide to Webscraping with PHP
Last and least recommended, you can extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged. The snippets you will usually find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Once the markup changes, the Regex fails.
You can write more reliable parsers, but writing a complete and reliable custom parser with Regular Expressions is a waste of time when the aforementioned libraries already exist and do a much better and likely faster job on this.
One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.
But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.
we have created quite a few crawlers for our needs before. at the end of the day, it is usually simple regular expressions that do the thing best. while libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is more safe way to go, as you can handle also non-valid html/xhtml structures, which would fail, if loaded via most of the parsers.
QueryPath is good, but be careful of "tracking state" cause if you didnt realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.
what it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.
in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.
"$results" now contains the result set for "input[name='forename']" NOT the original query "div p" this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead
then $results won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.
I prefer using one of the native XML extensions, like
ReplyDeleteDOM or
XMLReader.
If you prefer a 3rd party lib, I'd suggest not to use SimpleHtmlDom, but a lib that actually uses DOM/libxml underneath instead of String Parsing:
phpQuery,
Zend_Dom,
QueryPath,
FluentDom or
fDOMDocument
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like
html5lib
Or use a WebService like
YQL or
ScraperWiki.
If you want to spend some money, have a look at
PHP Architects Guide to Webscraping with PHP
Last and least recommended, you can extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged. The snippets you will usually find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Once the markup changes, the Regex fails.
You can write more reliable parsers, but writing a complete and reliable custom parser with Regular Expressions is a waste of time when the aforementioned libraries already exist and do a much better and likely faster job on this.
Also see Parsing Html The Cthulhu Way
Try the Simple HTML Dom Parser:
ReplyDelete// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.
ReplyDeleteBut to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.
This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.
ReplyDeletewe have created quite a few crawlers for our needs before. at the end of the day, it is usually simple regular expressions that do the thing best. while libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is more safe way to go, as you can handle also non-valid html/xhtml structures, which would fail, if loaded via most of the parsers.
ReplyDelete1.Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
ReplyDeleteWith PHP I would advise you to use the Simple HTML Dom Parser, the best way to learn more about it is to look for samples on the ScraperWiki website.
ReplyDeleteQueryPath is good, but be careful of "tracking state" cause if you didnt realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.
ReplyDeletewhat it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.
in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.
$results = qp("div p");
$forename = $results->find("input[name='forename']");
"$results" now contains the result set for "input[name='forename']" NOT the original query "div p" this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead
$forename = $results->branch()->find("input[name='forname']")
then $results won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.