What HTML parsers have the following features:
- Fast
- Thread-safe
- Reliable and bug-free
- Parses HTML and XML
- Handles erroneous HTML
- Has a DOM implementation
- Supports HTML4, JavaScript, and CSS tags
- Relatively simple, object-oriented API
What parser you think is better?
Thank you.
Source: Tips4all
Check out Web Harvest. It's both a library you can use and a data extraction tool, which sounds to me that's exactly what you want to do. You create XML script files to instruct the scraper how to extract the information you need and from where. The provided GUI is very useful to quickly test the scripts.
ReplyDeleteCheck out the project's samples page to see if it's a good fit for what you are trying to do.
The best known are NekoHTML and JTidy.
ReplyDeleteNekoHTML is based on Xerces, and provides a simple adaptable SAXParser which implements XMLReader JavaSE interface.
JTidy is more intented into formatting your html code into something XML-valid, but is still very useful as an XML parser, producing a DOM tree if needed.
You could have a look at this list for other alternatives.
Another choice could be to use hpricot through jRuby.
Validator.nu's HTML parser, definitely. It's an implementation of the HTML5 parsing algorithm, and Gecko is in the process of replacing its own HTML parser with a C++ translation of this one.
ReplyDeleteWell:
ReplyDeletethere aren't so many good HTML parsers in java as you need, but here are some alternatives:
http://java-source.net/open-source/html-parsers
Very few of them support Javascript. Actually, I think you'll have to do this part on your own using Rhino (http://www.mozilla.org/rhino/).
you probably want to look at doing something like running Mozilla in headless mode. Here is a link to get you started, I am sure you can use Google to find out more information.
ReplyDeleteI think that HTML Cleaner is what you're looking for. See its announcement on TheServerSide to see how it compare to JTidy, TagSoup, NekoHtml.
ReplyDeleteApache Tika is the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.
ReplyDeleteA brief introduction from Apache Tika web site:
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
And the supported formats are:
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format