Java XPath (Apache JAXP implementation) performance

Thursday, May 31, 2012

Java XPath (Apache JAXP implementation) performance

NOTE: If you experience this issue as well, please upvote it on Apache JIRA:

https://issues.apache.org/jira/browse/XALANJ-2540

I have come to an astonishing conclusion that this:




Element e = (Element) document.getElementsByTagName("SomeElementName").item(0);

String result = ((Element) e).getTextContent();

Seems to be an incredible 100x faster than this:




// Accounts for 30%, can be cached

XPathFactory factory = XPathFactory.newInstance();



// Negligible

XPath xpath = factory.newXPath();



// Negligible

XPathExpression expression = xpath.compile("//SomeElementName");



// Accounts for 70%

String result = (String) expression.evaluate(document, XPathConstants.STRING);

I'm using the JVM's default implementation of JAXP:




org.apache.xpath.jaxp.XPathFactoryImpl

org.apache.xpath.jaxp.XPathImpl

I'm really confused, because it's easy to see how JAXP could optimise the above XPath query to actually execute a simple getElementsByTagName() instead. But it doesn't seem to do that. This problem is limited to around 5-6 frequently used XPath calls, that are abstracted and hidden by an API. Those queries involve simple paths (e.g. /a/b/c , no variables, conditions) against an always available DOM Document only. So, if an optimisation can be done, it will be quite easy to achieve.

My question: Is XPath's slowness an accepted fact, or am I overlooking something? Is there a better (faster) implementation? Or should I just avoid XPath altogether, for simple queries?

Source: Tips4all

2 comments:

UserMay 31, 2012 at 7:28 PM
I have debugged and profiled my test-case and Xalan/JAXP in general. I managed to identify the big major problem in

org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName()

It can be seen that every one of the 10k test XPath evaluations led to the classloader trying to lookup the DTMManager instance in some sort of default configuration. This configuration is not loaded into memory but accessed every time. Furthermore, this access seems to be protected by a lock on the ObjectFactory.class itself. When the access fails (by default), then the configuration is loaded from the xalan.jar file's

META-INF/service/org.apache.xml.dtm.DTMManager

configuration file. Every time!:

Fortunately, this behaviour can be overridden by specifying a JVM parameter like this:

-Dorg.apache.xml.dtm.DTMManager=
org.apache.xml.dtm.ref.DTMManagerDefault

or

-Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault

So here's a performance improvement overview for 10k consecutive XPath evaluations of //SomeNodeName against a 90k XML file (measured with System.nanoTime():

measured library : Xalan 2.7.0 | Xalan 2.7.1 | Saxon-HE 9.3 | jaxen 1.1.3
--------------------------------------------------------------------------------
without optimisation : 10400ms | 4717ms | | 25500ms
reusing XPathFactory : 5995ms | 2829ms | |
reusing XPath : 5900ms | 2890ms | |
reusing XPathExpression : 5800ms | 2915ms | 16000ms | 25000ms
adding the JVM param : 1163ms | 761ms | n/a |

I have filed this as a bug to the Xalan guys at Apache:

https://issues.apache.org/jira/browse/XALANJ-2540
ReplyDelete
Replies
UserMay 31, 2012 at 7:28 PM
Not a solution, but a pointer to the main problem:
The slowest part of the process for evaluating an xpath in relation to an arbitrary node is the time it takes the DTM manager to find the node handle:

http://javasourcecode.org/html/open-source/jdk/jdk-6u23/com/sun/org/apache/xml/internal/dtm/ref/dom2dtm/DOM2DTM.html#getHandleOfNode%28org.w3c.dom.Node%29

If the node in question is at the end of the Document, it can end up walking the entire tree to find the node in question, for each and every query.

This explains why the hack to orphan out the target node works.
There should be a way to cache these lookups, but at this point I can't see how.
ReplyDelete
Replies

Add comment

Ccna final exam - java, php, javascript, ios, cshap all in one

Thursday, May 31, 2012

Java XPath (Apache JAXP implementation) performance

2 comments:

Total Pageviews