My favorites | Sign in
Project Home Downloads Issues Source
Search
for
Software  
Existing software solutions in Java.
Featured
Updated Feb 19, 2010 by kkape...@gmail.com

Finding differences in Text

The diff problem for pure text is considered solved more or less. There are several implementations which offer high quality results at minimal cost. Open-source solutions (especially GNU diff) are also available.

It is also worth noting that in several cases we are not interested in word-level diffs but only on line changes. Typical examples include Source Code Management systems or Wiki applications. This makes the job of a text diff library very easy.

Fine-grained differences at the character level are also possible. Google diff for example can show differences in characters. For example it can understand that horse and horses differ in one character only.

Finding differences in XML/HTML

Comparing two XHTML files is a completely different story. HTML holds tree structured data so the problem is no longer trivial. A diff library must be essentially "smart" in order to understand what is an html tag and what is not. Changes can now happen in HTML attributes apart from simple text. HTML also contains advanced constructs likes lists and tables which complicate the output code.

HTML found in the wild can also be very rough for a diff library. Some pre-processing code is needed that cleans up the HTML before the actual comparison takes place.

There is a lot of research and literature on XML diffing methods. Unlike pure text, a definitive solution has not yet appeared.

Diff software in Java

Below is a table that lists other solutions apart from Daisy Diff

Algorithm Type Version Licence Last release
Darwin Diff text 0.9 BSD 2004
GNU Diff text 1.7 GPL January 2009
JBDiff text 0.1.1 BSD October 2007
VMTools xml 0.5 VMtools Source Licence February 2002
diffXML xml 0.95Beta GPL May 2009
XMLDiff xml 2001 Alphaworks March 2001
Jlibdiff text 1.01 GPL February 2004
JDirDiff text 0.67 GPL June 2004
Google Diff text 20090804 Apache Licence August 2009
Diff MK text 3.0.a1 GPL March 2007
Java Diff text 1.1.0 LGPL January 2009
XmlUNIT xml 1.2 BSD June 2008
jxydiff xml 2006 QPL Feb 2006
delta XML xml V2 Commercial Oct 2009
Oracle XML Diff xml 10g Commercial 10g
FC XML xml 0.1 MIT Jun 2009
3DM XML xml 0.1.5beta1 LGPL March 2006
XOP xml 1.3 Research October 2009
Diff X xml 0.7.1 Artistic/GPL October 2009

What Daisy Diff offers

One of the most important features of Daisy Diff is the fact that it "understands" HTML tags and will actually look into the text to decide if a node is same for not.

For example assume that a user has changed a single word in a big paragraph. Most XML libraries would just mark the whole paragraph as different. Daisy Diff however will look into the inline text (the contents of the p tag node) and understand that only one word is different. Therefore it will present to the user only this word as changed.

DaisyDiff is also used in production (Daisy CMS) and also comes with a business friendly licence.


Sign in to add a comment
Powered by Google Project Hosting