|
XMLDog is a dog that is trained to sniff xml documents. We give set of xpaths to XMLDog and ask to sniff some xml document. It uses SAX and with one pass over the document it evaluates all the given xpaths. Whether it is Xalan/XMLDog, first we need to define javax.xml.namespace.NamespaceContext. This interface defines the binding for prefix to uri. import jlibs.xml.DefaultNamespaceContext;
import jlibs.xml.Namespaces;
DefaultNamespaceContext nsContext = new DefaultNamespaceContext(); // an implementation of javax.xml.namespace.NamespaceContext
nsContext.declarePrefix("xsd", Namespaces.URI_XSD);Now create an instance of XMLDOG, and add the xpaths that need to be evaluated. Note that XMLDog can evaluate multiple xpaths in single SAX parse of given xml document. import jlibs.xml.sax.dog.XMLDog;
import jlibs.xml.sax.dog.expr.Expression;
XMLDog dog = new XMLDog(nsContext);
Expression xpath1 = dog.addXPath("/xs:schema/@targetNamespace");
Expression xpath2 = dog.addXPath("/xs:schema/xs:complexType/@name");
Expression xpath3 = dog.addXPath("/xs:schema/xs:*/@name");When you add xpath to XMLDog, it returns Expression object. This object is the compiled xpath. you can get the original xpath from XPath using getXPath(): System.out.println(xpath1.getXPath()); // prints "/xs:schema/@targetNamespace" you can ask Expression about its result type; import javax.xml.namespace.QName;
QName resultType = xpath1.resultType.qname;
System.out.println(resultType); // prints "{http://www.w3.org/1999/XSL/Transform}NODESET"The QName returned will be one of constants in javax.xml.xpath.XPathConstants. To evaluate given xpaths on some xml document: import jlibs.xml.sax.dog.XPathResults;
XPathResults results = dog.sniff(new InputSource("note.xml"));XPathResults object will contain the results of all xpath evaluations. to get result of particular xpath: object result = results.getResult(xpath1); The return type of getResult(XPath) will be java.lang.Object; Depending on the XPath.resultType(), this result can be safely cased to a particular type. Below is the actual result Type for each resultType returned by XPath:
NodeItem represens an xml node in xml document; NodeItem has following properties. NodeItem.type: returns type of xml node. will be one of following constants in NodeType:NodeItem.location:COMMENT, PI, DOCUMENT, ELEMENT, ATTRIBUTE, NAMESPACE, TEXT; returns unique xpath to the xml node. ex: /xs:schema[1]/xs:complexType[1]/@nameNodeItem.value, NodeItem.localName, NodeItem.namespaceURI, NodeItem.qualifiedName: return value/localName/namespaceURI/qualifiedName of the xml node it represens. NodeItem.toString() simply returns its location. XPathResults has handy print method to print results to given java.io.PrintStream: results.print(dog.getExpressions(), System.out); will print: XPath: /xs:schema/@targetNamespace
1: /xs:schema[1]/@targetNamespace
XPath: /xs:schema/xs:complexType/@name
1: /xs:schema[1]/xs:complexType[1]/@name
XPath: /xs:schema/xs:*/@name
1: /xs:schema[1]/xs:element[1]/@name
2: /xs:schema[1]/xs:element[2]/@name
3: /xs:schema[1]/xs:element[3]/@name
4: /xs:schema[1]/xs:element[4]/@name
5: /xs:schema[1]/xs:complexType[1]/@nameMulti Threading: XMLDog supports multi-hreading. You can add multiple xpaths once, XPath Support: XMLDog supports subset of XPath 1.0; Axises supported are:
Except id(), rest of the functions are supported. it supports predicates and all operators. XMLDog will tell you clearly, if given xpath is not supprted; for example: XPath xpath = dog.add("/xs:schema/../@targetNamespace", 1);throws following exception: java.lang.UnsupportedOperationException: unsupported axis: parent This will be very useful. for example you can first try using XMLDog and if it throws UnsupportedOperationException, DOM Results By default XMLDog does not construct dom nodes for results. import package jlibs.xml.sax.dog.sniff.Event;
Event event = dog.createEvent();
results = new XPathResults(event);
event.setListener(results);
event.setXMLBuilder(new DOMBuilder());
dog.sniff(event, new InputSource("note.xml"));
List<NodeItem> items = (List<NodeItem>)results.getResult(xpath1)you can get the dom node for a given NodeItem as follows: NodeItem item = ... org.w3c.dom.Node domNode = (org.w3c.dom.Node)item.xml; Note that, dom nodes are created only for portions of xml which are hit by xpaths. Event.setXMLBuilder(...) takes an argument of type jlibs.xml.sax.dog.sniff.XMLBuilder. Instant Results XPathResults object holds results of all xpaths in memory. This might not be feasible always. Let us say, you are searching employees.xml for employees with more that 5 years of experience. To solve this problem, you register your own InstantEvaluationListener with Event. Then your listener import jlibs.xml.sax.dog.expr.InstantEvaluationListener;
Event event = dog.createEvent();
event.setXMLBuilder(new DOMBuilder());
event.setListener(new InstantEvaluationListener(){
@Override
public void onNodeHit(Expression expression, NodeItem nodeItem){
org.w3c.dom.Node node = (org.w3c.dom.Node)nodeItem.xml;
System.out.println("XPath: "+expression.getXPath()+" has hit: "+node);
}
@Override
public void finishedNodeSet(Expression expression){
System.out.println("Finished Nodeset: "+expression.getXPath());
}
@Override
public void onResult(Expression expression, Object result){
// this method is called only for xpaths which returns primitive result
// i.e result will be one of String, Boolean, Double
System.out.println("XPath: "+expression.getXPath()+" result: "+result);
}
});
dog.sniff(event, new InputSource("note.xml"), false/*useSTAX*/); // this version sniff method returns voidYou can use variables and custom functions in xpath. For this you have to use following constructor: import javax.xml.namespace.NamespaceContext; import javax.xml.xpath.XPathVariableResolver; import javax.xml.xpath.XPathFunctionResolver; NamespaceContext nsContext = ...; XPathVariableResolver variableResolver = ...; XPathFunctionResolver functionResolver = ...; XMLDog dog = new XMLDogContext(nsContext, variableResolver, functionResolver); Note that functions are not supposed to expect arguments of type NodeSet Command Line Utility You can find xmldog.sh/xmldog.bat in $JLIBS_HOME/bin directory This will be usefull to play with XMLDog with various xml documents/xpaths Conformance The XMLDog results conforms to the XPath-Spec. It is coverted by jlibs.xml.sax.dog.tests.XPathConformanceTest You can look here, to see the type of xpaths it has been tested. You can find xmldog-conformance.sh/xmldog-conformance.bat in jlibs installation, Performance: You can find xmldog-performance.sh/xmldog-performance.bat in jlibs installation, Here is sample output of this performance test; Average Execution Time over 20 runs:
--------------------------------------------------------------------------------
File | XPaths XMLDog SAXON Diff Percentage
--------------------------------------------------------------------------------
resources/xmlFiles/note.xml | 290 35 84 -49 -2.42
resources/xmlFiles/simple.xml | 29 4 13 -8 -3.04
resources/xmlFiles/positions.xml | 110 16 27 -10 -1.64
resources/xmlFiles/sample.xml | 2197 195 176 18 +1.11
resources/xmlFiles/sample1.xml | 2197 25 76 -51 -3.04
resources/xmlFiles/sample2.xml | 2197 29 77 -47 -2.59
resources/xmlFiles/sample3.xml | 2197 28 72 -44 -2.53
resources/xmlFiles/numbers.xml | 83 1 3 -2 -2.22
resources/xmlFiles/underscore.xml | 80 2 3 -1 -1.41
resources/xmlFiles/contents.xml | 160 4 9 -4 -1.99
resources/xmlFiles/pi.xml | 31 0 1 -1 -2.63
resources/xmlFiles/evaluate.xml | 40 1 2 0 -1.32
resources/xmlFiles/web.xml | 431 7 23 -16 -3.30
resources/xmlFiles/fibo.xml | 94 4 12 -7 -2.58
resources/xmlFiles/defaultNamespace.xml | 80 0 2 -1 -3.28
resources/xmlFiles/namespaces.xml | 150 3 6 -3 -1.78
resources/xmlFiles/text.xml | 35 0 1 0 -2.16
resources/xmlFiles/organization.xml | 110 4 6 -2 -1.53
resources/xmlFiles/moreover.xml | 130 10 16 -6 -1.61
resources/xmlFiles/id.xml | 40 0 2 -1 -3.45
resources/xmlFiles/much_ado.xml | 78 14 17 -3 -1.24
resources/xmlFiles/sum.xml | 17 0 1 -1 -5.00
resources/xmlFiles/purchase_order.xml | 510 7 18 -11 -2.61
resources/xmlFiles/roof.xml | 20 0 1 -1 -2.64
resources/xmlFiles/nitf.xml | 60 1 3 -1 -2.28
resources/xmlFiles/message.xml | 10 0 1 -1 -3.42
resources/xmlFiles/lang.xml | 80 1 3 -2 -2.96
resources/xmlFiles/testNamespaces.xml | 22 0 1 -1 -3.65
resources/xmlFiles/test.xml | 20 0 1 -1 -3.06
resources/xmlFiles/jaxen3.xml | 10 0 1 0 -2.73
resources/xmlFiles/jaxen24.xml | 30 0 1 0 -3.01
resources/xmlFiles/pi2.xml | 10 0 1 -1 -5.00
resources/xmlFiles/library.xml | 20 1 1 0 -1.17
resources/xmlFiles/axis.xml | 32 0 1 0 -2.30
resources/xmlFiles/t.xml | 10 0 0 0 -2.58
--------------------------------------------------------------------------------
Total | 11610 410 683 -273 -1.67It shows that XMLDog is faster than Saxon9(1.67 times). The source code of testcase is here Future:
I am looking forward to know, who are interested in XMLDog, and why/where you are using. This will give me some boost-up to add more features. Because it takes most of my free time. Your comments are welcomed; | ||||||||||
Have you looked at vtd-xml? It is the fastest!!
Hi, we are evaluating XMLDog as a fast xpath evaluator, in order to use it in an esb as a way to implement content based routing of web services, we need a fast evaluation of xpath to minimize the overhead time added by the esb.
Regards, Martin
Hi Martin,
hi, i am currently trying to perform updates on (very large) xml documents being read on the fly (therefore through a streamsource) before writing them (to an output stream).
the elements to update are defined by xpath expressions (obviously a subset of xpath that would require access to parent, ancestor and forward axes only) the update would typically be performed using custom function (embedding a statefull object). do you know whether your library could be used to perform such processing ?
You can get a notification as soon as a particular xpath result is evaluated. This can be done as folllows:
Expression expr = xmldog.addXPath(xpathStr); Event event = xmldog.createEvent(); event.addListener(new EvaluationListener(){ @Override public void finished(Evaluation evaluation){ Expression expr = evaluation.expression; Object result = evaluation.getResult(); System.out.println("Result is: "+result); } }); xmldog.sniff(event, inputSource);for some xpaths, the result might be notified after the actual node is passed. for example: If the xpath is "/ab?" then when result is notified, you will be either in <b> startElement or <a> endElement.
if your xpaths don't need forward lookup, then you can use it to update the xml in streaming fashion.
Hi Santhosh, I needed to get the result of an xpath which returns a node either as an XMLElement object or an XML string but the node objects only return location paths strings. I have tried to modify your code to enable this but beyond the Event class, which seems to hold the current element data, I couldn't understand your classes. How the xpath query is able to produce an accurate location string. The one from the Event class doesn't seem to go beyond the current class. Regards, John.
This can be done using NodeSetListener?.
NodeSetListener?.mayHit() is called if the current sax event might be possible outcome of xpath expression to which it is attached. You can start populating XML objects after this call. if the xpath engine finds that it is not hit, then NodeSetListener?.discard() is called. It is not so straight forward to implement this. currently NodeSetListener? is used internally. and saxevents has to be multicasted so that xml object can be populated. I will try this at my end and let you know the status...
I am executing the default bin/xmldog.sh script with the 3rd party dependencies downloaded. However, I do not get a result as expected from XMLDog
asankha@asankha:~/java/jlibs/bin$ ./xmldog.sh /home/asankha/code/XMLPerf/src/test/resources/test1.xml Namespaces: soapenv = http://schemas.xmlsoap.org/soap/envelope/ z = http://somez y = http://someothery y1 = http://somey x = http://somex m = http://services.samples/xsd
XPaths: //order1?/symbol
| XPath-Results |
XPath: //order1?/symbol
Evaluated in 40 milliseconds
The XML input file is: <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:z="http://somez" xmlns:y="http://someothery"> <soapenv:Header xmlns:y="http://somey">
</soapenv:Header> <soapenv:Body> <m:buyStocks xmlns:m="http://services.samples/xsd"> <order><symbol>IBM</symbol><buyerID>asankha</buyerID><price>140.34</price><volume>2000</volume></order> <order><symbol>MSFT</symbol><buyerID>ruwan</buyerID><price>23.56</price><volume>8030</volume></order> <order><symbol>SUN</symbol><buyerID>indika</buyerID><price>14.56</price><volume>500</volume></order> <order><symbol>GOOG</symbol><buyerID>chathura</buyerID><price>60.24</price><volume>40000</volume></order> <order><symbol>IBM</symbol><buyerID>asankha</buyerID><price>140.34</price><volume>2000</volume></order> <order><symbol>MSFT</symbol><buyerID>ruwan</buyerID><price>23.56</price><volume>803000</volume></order> <order><symbol>SUN</symbol><buyerID>indika</buyerID><price>14.56</price><volume>50000</volume></order> <order><symbol>GOOG</symbol><buyerID>saliya</buyerID><price>60.24</price><volume>400000</volume></order> </m:buyStocks> </soapenv:Body> </soapenv:Envelope>NodeItem?.value will be non-null only for NodeTypes? COMMENT, PI, ATTRIBUTE, NAMESPACE, TEXT i.e NodeItem?.value will be for NodeTypes? DOCUMENT, ELEMENT
XMLDog doesn't create dom for the elements or document if they are hit. I am currently working on creating partial dom nodes (i,e create dom nodes only for those which are results of xpaths)
So if you are trying to evaluate xpaths whose resulting nodes are elements, as results you will get only the exact location of elements hit (not the entire element data)
So what should I do to obtain the element's text data - which is what I am after? If XMLDog parses the XML once, I'd like it to be able to give me the result too from this single parse
ashank,
use xpath: //order1?/symbol/text()
Hi M. Santhosh Kumar,
I am trying to parse a GPX file of about 150Mo. I would like not to put in memory all these Mo. I understood that your XMLDog would parse such files using a streaming way (ie without using memory). So I made a little program but I get a « java.lang.OutOfMemoryError? ».
Can you tell me where the problem is ?
I give you the code.
public class XMLDogParser { public static void main(String[] s) throws Exception { boolean useSTAX = true; String file = "/file.gpx"; DefaultNamespaceContext nsContext = new DefaultNamespaceContext(); nsContext.declarePrefix(Namespaces.URI_XSI); DefaultNamespaceContext resultNSContext = new DefaultNamespaceContext(); List<Object> dogResult; dogResult = new ArrayList<Object>(3); XMLDog dog = new XMLDog(nsContext); Expression xpath1 = dog.addXPath("/gpx/wpt"); Expression xpath2 = dog.addXPath("/gpx/rte/rtept"); Expression xpath3 = dog.addXPath("gpx/trk/trkseg/trkpt"); Event event = dog.createEvent(); event.setXMLBuilder(new DOMBuilder()); XPathResults dogResults = new XPathResults(event, dog.getDocumentXPathsCount(), dog.getXPaths()); System.out.println("Initialisation succeded."); dog.sniff(event, file, useSTAX); System.out.println("'snif' succeded."); dogResult.add(dogResults.getResult(xpath1)); dogResult.add(dogResults.getResult(xpath2)); dogResult.add(dogResults.getResult(xpath3)); System.out.println("Result : " + dogResult); System.out.println("End of program."); } }Moreover, XMLDog doesn't succed to “sniff” the file because of the content of the first markup. The first markup is :
In fact, the bug is generated by only one line :
Have you an idea to solve my problem ?
Thank you in advance, Antonin
comment the following line and try. this might solve OutOfMemoryError?:
what is the error you got for line:
can u post the stracktrace
When I comment the line, it don't get the OutOfMemoryError?.
But then I do, the 'dogresult' is different : I can't access to the data contained in the markup.
For instance,
Is it possible to get the same informations without using the line ?
When the GPX file contains the line
I don't get an error. I only get nothing, as if the GPX file were empty. I get :
> I don't get an error. I only get nothing, as if the GPX file were empty
do the following:
nsContext.declarePrefix("ns", "http://www.topografix.com/GPX/1/1");and then change the xpaths as below:
Expression xpath1 = dog.addXPath("/ns:gpx/ns:wpt"); Expression xpath2 = dog.addXPath("/ns:gpx/ns:rte/ns:rtept"); Expression xpath3 = dog.addXPath("ns:gpx/ns:trk/ns:trkseg/ns:trkpt");regarding OutOfMemory?,
currently XMLDog, doesn't support notifying intermediate results. i.e let us say: /gpx/wpt hits 1000 elements. xmldog will create 1000 dom elements and then give you the result. If xmldog supports intermediate results, then as each element is hit, it can give you the dom element for that. then you can process it and discard. This will give possibility of getting huge results without OutOfMemory? issue.
I can try if notifying intermediate results is possible on week end and let you know the status...
Thank you very much, I changed the Xpaths expressions and the prefix declaration as you said and it works now.
It would be great if you can see if notifying intermediate results is possible.
Best regards,
Antonin
Is there any way to iterate over the matches instead of buffering them into these result objects? The XML document I'm dealing with is large and has many matching nodes.
No. Currently Iterating over results without bufferring is not supported
Hi Antonin,
with revision@1570 XMLDog supports notifying intermediate results without buffering. So now it can evaluate dom results for large documents with less memory.
you can see how to configure XMLDog for intermediate results in XMLDogTest.java
Line 1 : <?xml version="1.0"?>
Line 2 : <catalog>
Line 3 : <book id="bk101">
Line 4 : <author>Gambardella, Matthew</author>
Line 5 : <title>XML Developer's Guide</title>
Line 6 : <genre>Computer</genre>
Line 7 : <price>44.95</price>
Line 8 : <publish_date>2000-10-01</publish_date>
Line 9 : <description>An in-depth look at creating applications
Line 10 : with XML.</description>
Line 11 : </book>
Line 12 : </catalog>
how can I use XMLDog to extract Line 3 to Line 11 from above XML ( i.e book section ).
try using path: /catalog/book[id='bk101']
Hi Santosh,
Great stuff. I was looking for a single pass XPath engine and found yours here. Almost everything works greate except this kind of construct:
Let's say I have xml:
Now I want to select one of /Root/Text or /Root/Number whichever comes first. Normally I would do that using the following XPath:
Using standard DOM-based XPathFactory I get single result as expected which is String 'abc'
Using event-based XMLDog however, I am getting two calls of onNodeHit() one for /Root/Text another for /Root/Number. This is not expected and not correct. Instead onNodeHit() should be called only once for the first position in the context of the temporary result node list that is {'abc', '123'}[1] = 'abc' .
Also Expression argument passed to the onNodeHit should match one of the objects returned by XMLDog.addXPath() whereas currently for this kind of expression it does not. So there is no way to figure out which requested XPath is evaluated.
Thanks!
Hi yurgis,
Yes. the xpath that you specified is giving wrong results. will verify and let you know.
thanks santhosh
Hi yurgis,
you can use the following workaround till it gets resolved:
listener = new InstantEvaluationListener(){ int nodeCounts[] = new int[dog.getDocumentXPathsCount()]; @Override public void onNodeHit(Expression expression, NodeItem nodeItem){ if(expression.getXPath()==null) return; // now we can use the result } @Override public void finished(Evaluation evaluation){ if(evaluation.expression.getXPath()==null) return; Object result = evaluation.getResult(); if(printResults){ if(result!=null){ if(result instanceof List){ for(NodeItem nodeItem: (List<NodeItem>)result) onNodeHit(evaluation.expression, nodeItem); }else{ // we reach here if xpath result is not dom node // now we can use result } } } } };use the above implementation of listener. you can place your code where you can see the comment "now we can use result";
Let me know if this works for you.
Fantastic! The workaround worked for me.
Hi Yurgis,
the mentioned issue is fixed with revision 1620;
now finished(Evaluation evaluation) in InstantEvaluationListener? is made final;
instead following two new abstract methods are introduced:
public abstract void finishedNodeSet(Expression expression); public abstract void onResult(Expression expression, Object result);
Any plans to support XPath 2.0 ?
I found out that incorrect XML (without closed tag like "<root>") can be parsed by XMLDog without any exception. Is it correct?
> Any plans to support XPath 2.0 ?
No. Frankly speaking I haven't used XPath 2.0 yet. No plans in near future...
> > Any plans to support XPath 2.0 ? XMLDog supports for loop which somewhat mimics path 2.0 see XMLDog.forEach(String forEach, String xpath)
this feature is not yet documented in this wiki page
Hi , I am using this with InstantEvaluationListener?
but I am not getting any result for any Xpath
my xml is as :
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="MeasDataCollection?.xsl"?> <!DOCTYPE mdc SYSTEM "MeasDataCollection?.dtd"> <mdc xmlns:HTML="http://www.w3.org/TR/REC-xml"> <mfh> <ffv>32.401 V6.2</ffv> <sn>SubNetwork?=ONRM_RootMo?_R,SubNetwork?=RABD201,MeContext?=RABD201</sn> <st></st> <vn></vn> <cbt>20110714171500Z</cbt> </mfh>
in MeasDataCollection?.xsl as
<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
please guide me how to get value for xpaths?
Hi Rohit Saxena,
XMLDog is not an XSLT engine. It seems you are expecting xslt transformation to be performed because your input xml has: <?xml-stylesheet type="text/xsl" href="MeasDataCollection??.xsl"?>
but that is not the case.
Looks like your example for DOM parsing isn't up to date. It goes more like this:
Expression xpath1 = dog.addXPath("/path"); Event event = dog.createEvent(); results = new XPathResults(event); event.setXMLBuilder(new DOMBuilder()); event.setListener(results); dog.sniff(event, new InputSource?("note.xml")); List<NodeItem> items = (List<NodeItem>)results.getResult(xpath1);
Also note: if you add more xpaths to the dog after creating the event it will throw ArrayIndexOutOfBoundsException? when parsing because the results array doesn't grow after creation (Bug?)
thanks karlkfi,
corrected the DOM parsing code.
regarding ArrayIndexOutOfBoundsException?, you have to create event only after adding all xpaths; because Event class allocates objects based on xpaths added in xmldog.
hi Santosh, i have the below XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE PurchaseOrderStatusNotification? SYSTEM "MS_V02_02_PurchaseOrderStatusNotification?.dtd"> <PurchaseOrderStatusNotification>
</PurchaseOrderStatusNotification>while applying xpath i will not be having the dtd file , and by default the xpath enging will look for dtd file.
Generally in a DOM model we used entity resolver to avoid this , is there way with XMLDog in SAX way to implement the same ?
currently there is no api for this.
you can modify SAXEngine class with your entity resolver implementation.
I will be working to provide a way to set entity resolver.
Thanks also, does this allow to have compressed (Zip/Gzip) as input file and apply Xpath without opening them in memory
you can use ZipInputStream?/GZipInputStream for that
Hi Santosh, need some help for the earlier problem i modified as below SAXUtil.newSAXParser(true, false, false).parse(new InputSource?(file),
Also, in SAXEngine class public void start(InputSource? is) throws XPathException {
With this change, Error seem to have vanished but i dont see nodeitem output . if i make parser.parse(is,this) then it works .. other wise fails is there some i'm messing with your code base.
you should download the jlibs sources and add following method to SAXEngine.java:
public org.xml.sax.InputSource resolveEntity(String publicId, String systemId) throws org.xml.sax.SAXException,java.io.IOException { System.out.println("Ignoring: " + publicId + ", " + systemId); return new org.xml.sax.InputSource(new java.io.StringReader("")); }