|
API
Introduction This library is available in multiple languages. Regardless of the language used, the interface for using it is the same. This page describes the API for the public functions. For further examples, see the relevant test harness. Initialization The first step is to create a new diff_match_patch object. This object contains various properties which set the behaviour of the algorithms, as well as the following methods/functions: diff_main(text1, text2) => diffs An array of differences is computed which describe the transformation of text1 into text2. Each difference is an array (JavaScript, Lua) or tuple (Python) or Diff object (C++, C#, Objective C, Java). The first element specifies if it is an insertion (1), a deletion (-1) or an equality (0). The second element specifies the affected text. diff_main("Good dog", "Bad dog") => [(-1, "Goo"), (1, "Ba"), (0, "d dog")] Despite the large number of optimisations used in this function, diff can take a while to compute. The diff_match_patch.Diff_Timeout property is available to set how many seconds any diff's exploration phase may take. The default value is 1.0. A value of 0 disables the timeout and lets diff run until completion. Should diff timeout, the return value will still be a valid difference, though probably non-optimal. diff_cleanupSemantic(diffs) => null A diff of two unrelated texts can be filled with coincidental matches. For example, the diff of "mouse" and "sofas" is [(-1, "m"), (1, "s"), (0, "o"), (-1, "u"), (1, "fa"), (0, "s"), (-1, "e")]. While this is the optimum diff, it is difficult for humans to understand. Semantic cleanup rewrites the diff, expanding it into a more intelligible format. The above example would become: [(-1, "mouse"), (1, "sofas")]. If a diff is to be human-readable, it should be passed to diff_cleanupSemantic. diff_cleanupEfficiency(diffs) => null This function is similar to diff_cleanupSemantic, except that instead of optimising a diff to be human-readable, it optimises the diff to be efficient for machine processing. The results of both cleanup types are often the same. The efficiency cleanup is based on the observation that a diff made up of large numbers of small diffs edits may take longer to process (in downstream applications) or take more capacity to store or transmit than a smaller number of larger diffs. The diff_match_patch.Diff_EditCost property sets what the cost of handling a new edit is in terms of handling extra characters in an existing edit. The default value is 4, which means if expanding the length of a diff by three characters can eliminate one edit, then that optimisation will reduce the total costs. diff_levenshtein(diffs) => int Given a diff, measure its Levenshtein distance in terms of the number of inserted, deleted or substituted characters. The minimum distance is 0 which means equality, the maximum distance is the length of the longer string. diff_prettyHtml(diffs) => html Takes a diff array and returns a pretty HTML sequence. This function is mainly intended as an example from which to write ones own display functions. match_main(text, pattern, loc) => location Given a text to search, a pattern to search for and an expected location in the text near which to find the pattern, return the location which matches closest. The function will search for the best match based on both the number of character errors between the pattern and the potential match, as well as the distance between the expected location and the potential match. The following example is a classic dilemma. There are two potential matches, one is close to the expected location but contains a one character error, the other is far from the expected location but is exactly the pattern sought after: match_main("abc12345678901234567890abbc", "abc", 26) Which result is returned (0 or 24) is determined by the diff_match_patch.Match_Distance property. An exact letter match which is 'distance' characters away from the fuzzy location would score as a complete mismatch. For example, a distance of '0' requires the match be at the exact location specified, whereas a threshold of '1000' would require a perfect match to be within 800 characters of the expected location to be found using a 0.8 threshold (see below). The larger Match_Distance is, the slower match_main() may take to compute. This variable defaults to 1000. Another property is diff_match_patch.Match_Threshold which determines the cut-off value for a valid match. If Match_Threshold is closer to 0, the requirements for accuracy increase. If Match_Threshold is closer to 1 then it is more likely that a match will be found. The larger Match_Threshold is, the slower match_main() may take to compute. This variable defaults to 0.5. If no match is found, the function returns -1. patch_make(text1, text2) => patches patch_make(diffs) => patches patch_make(text1, diffs) => patches Given two texts, or an already computed list of differences, return an array of patch objects. The third form (text1, diffs) is preferred, use it if you happen to have that data available, otherwise this function will compute the missing pieces. patch_toText(patches) => text Reduces an array of patch objects to a block of text which looks extremely similar to the standard GNU diff/patch format. This text may be stored or transmitted. patch_fromText(text) => patches Parses a block of text (which was presumably created by the patch_toText function) and returns an array of patch objects. patch_apply(patches, text1) => [text2, results] Applies a list of patches to text1. The first element of the return value is the newly patched text. The second element is an array of true/false values indicating which of the patches were successfully applied. [Note that this second element is not too useful since large patches may get broken up internally, resulting in a longer results list than the input with no way to figure out which patch succeeded or failed. A more informative API is in development.] The previously mentioned Match_Distance and Match_Threshold properties are used to evaluate patch application on text which does not match exactly. In addition, the diff_match_patch.Patch_DeleteThreshold property determines how closely the text within a major (~64 character) delete needs to match the expected text. If Patch_DeleteThreshold is closer to 0, then the deleted text must match the expected text more closely. If Patch_DeleteThreshold is closer to 1, then the deleted text may contain anything. In most use cases Patch_DeleteThreshold should just be set to the same value as Match_Threshold. |
► Sign in to add a comment
It's very good API.
Could you tell me about how to integrate to GWT.
Although this is nor the place to ask user questions, nor the place to answer them (Neil, feel free to delete both): @lopmuj: just call it using JSNI - something like this will do the trick:
private native String getDiffString(String original, String amendment)/*-{ var dmp = new $wnd.diff_match_patch(); var d = dmp.diff_main(original, amendment); dmp.diff_cleanupSemantic(d); return dmp.diff_prettyHtml(d); }-*/;But this is so obvious that I'm not even sure if you tried finding it out by yourself in the first place ..
This is an incredible piece of work, nice job Neil!
This is completely what I was looking for after getting fed up with SVN&Co. bogus APIs Thanx Neil!
Excellent work. Thanks!
Simply awesome!
I just built a wrapper around your calls, customize it according to my needs! I was initially tempted to mimic the GNU "diff" behavior (dumping everything into the stdout), but when I saw the results in HTML....I was "...wwoooowed! this is great, it's really easy to spot the differences (Inserts/Deletes), as they are coded: "Insertions" show up in the HTML with a green background and underlined font whilst "Deletions" show up with a red background and Strike-through font...
Great job!
Thanks!
This is a great tool. However, I wasn't able to understand the matching algorithm. Is there a way to ignore the location and match based just on the pattern in match_main. Basically, I'm trying to identify if a particular word or phrase (or its variant (like a missing hyphen etc.)) appears in the given text.
Neil, Excellent!
One question: If I diff a big document against a small one, I get a big "DELETE" block in the resulting diff. Is this really nessesary? I think a "delete XX chars" would be enough?
Bob
jjknopf, how you format/display/visualize the diff is entirely up to you. diff_prettyHtml(diffs) is one way to render a diff, it's a trivial function, you can roll your own. Take a look at diff_toDelta(diff), it's not listed in the above API, but it is very compact and may be similar to what you are looking for.
I'm trying to match a piece of text against several "candidates'. Is there some way to compare the results of two matches to see which is "best"?
Thanks!
I've added a new function to all language versions called diff_levenshtein(diff) which takes a diff and reports the number of character insertions, deletions or substitutions it represents. Thus one can diff your target against several options and compute the Levenshtein distance for each to find the option that's closest to the target.
Hello all,
I would like to know. based on this implementation, how can I create a list of diffs to simulate something like this : http://www.caffeinated.me.uk/kompare/
What I need is the following : - be able to know "inserted" block of text - be able to know "deleted" block of text - be able to know "changed" block of text
Is it possible with this implementation ? Thank you very much !
Jimmy
Jimmy this would better be asked in the newsgroup: http://groups.google.com/group/diff-match-patch
But the simple answer is run diff_main(text1, text2), then step through the resulting array of diffs. They will be insertions, deletions or equalities. Obviously a 'change' is a deletion and an insertion next to each other.
Nice work. However, how hard would it be add unpatch(), that is, reversepatch()? The second issue is that I need to port this to php ...
Unpatching would just be to loop through the diff, swapping DIFF_INSERT with DIFF_DELETE, then apply the patch. As for PHP, there's a partial translation which someone wrote, email me and I'll send you a copy.
great job, this is what I'm looking for! thanks a lot.
A terrific work! Thanks a lot!
good stuff, just what i was looking for.
I need the PHP implementation as well. Neil -- I don't have your email but could you email me it at mathews.kyle@gmail.com? Thanks, that'd be really helpful.
Great Stuff! Please send me the php-implementation as well. I will send it back if i extend it.
E-Mail: dmp@shwups.ch
Great job!
Hi, but what about patch conflicts? In standard GNU patch it is implemented. How we can recognize patching conflict occured?
Thanks
Tom, look at the results from patch_apply. There are two returned values, one is the patched text, and the other is a list of booleans indicating which patches were applied and which failed.
I used a "Demo of Patch" example (http://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_patch.html). 1. I put "Old Version" text like this: <appSettings>
</appSettings> <applicationSettings> </applicationSettings>2. I put "New Version" text like this: <appSettings>
</appSettings> <applicationSettings> </applicationSettings>3. I computed a patch.
4. I put "Old Version" text like this: <appSettings>
</appSettings> <applicationSettings> </applicationSettings>5. I applied a patch. No conflicts were found. I got text like "New Version" text. Why weren't patch conflicts found?
What a remarkable piece of work! I'm impressed. :)
Would it be possible to add functionality to work with files, not just strings?
I compare 2 large texts and it gives me one big delete and one big insert even with "No cleanup". If I make texts smaller it shows differences nicely. Something is wrong.
student00x, nothing is wrong, your diff exceeded the time limit and it was forced to return a valid, but non-optimal diff. Try increasing diff_match_patch.Diff_Timeout?, or setting it to zero.
I have two documents (.txt) with accented characters (è, ù). After taking the patches, I try to use the function "patch_toText(patches)" but I obtain a segmentation fault. Is there a solution?. I'm working with C language.
Impressing !
Nice work. I'm evaluating this for use in a larger project requiring client-side text operations. The project is built on the GWT-platform. Are there any recommendations as to: 1. Use the supplied javascript library via JSNI (GWT). 2. Compile the supplied java library to javascript using the GWT-compiler.
I'm considering option 2 to be the most feasible. I might have to verify the correctness of the result but i will get highly optimized, crosscompiled, javascript. And i don't have to include any JSNI-code (which should be avoided when possible). Any thoughts anyone?
Very Cool, Is it possible to make it an option to have the comparison unit be a word instead of a char? An example where the output is confusing (to me at least) is Text1 = $200.00 abc $100.00 Text2 = $205.00 abc $102.50
Output is ... $20<strike>0</strike>5.00 abc $10<strike>0.0</strike>2.50
This is much less intuitive that <strike>200.00</strike>205.00 abc <strike>100.0</strike>102.50
Very nice and perfectly simple to use!
Very impressive piece of work. Thanks for sharing it with the world!
I too would like the php version, whatever state it's in.
Can i use this Library for generating diffs of binary files.
Yes Mike, DMP works great on binary files. When stress its suitability for 'text', that's meant as opposed to structured content such as XML; there's no problem with binary.
Hey Neil,
Great job on this package. Truly remarkable work. :) I have a question; I'd like to develop an editing app (non realtime) that behaves like GNU diff3 (compare docs A and B to older, parent copy C). I want to graphically represent the successful merges and and conflicting segments. I was looking at using patch_apply for this but I noticed your comment in the function description:
"...A more informative API is in development..."
I'm assuming I probably won't be able to build a conflict resolver built on the output of this api until that API improvement you mentioned is ready?
Thanks again!
-Mike
Nice job. One correction should be done in the Java-version though: The Diff-subclass implements a equals()-method but NO hashcode()-method. A hashCode is needed if you e.g. put Diff-instances into a HashMap? to count occurencies of certain Diffs.
Nice Work! Would it be possible to provide a line-level diff? I'm looking for a tool that would provide a line-level diff between two text docs that can be stored in db and later recreate a particular version by either applying/removing patches.
for eg:
Thanks
Since the question of line and word level diffs is frequently asked, I've created a page describing how to do this:
Thanks Neil!
Is there an option to unpatch patches from the latest text in order to obtain the original?
I don't see one in the api list.
@Test public void testDiffLineMode(){ dmp = new diff_match_patch(); String text1 = "The quick brown fox jumps over the lazy dog."; String text2 = "That quick brown fox jumped over a lazy dog."; LinesToCharsResult a = dmp.diff_linesToChars(text1, text2); List<String> lines = a.lineArray; //create the patches LinkedList<diff_match_patch.Diff> diffs = dmp.diff_main(a.chars1,a.chars2); dmp.diff_charsToLines(diffs, lines); LinkedList<Patch> patches = dmp.patch_make(diffs); //apply patches to text1 to get text2 Object[] results = dmp.patch_apply(patches, text1); Assert.assertEquals("That quick brown fox jumped over a lazy dog.", results[0]); boolean[] boolArray = (boolean[]) results[1]; Assert.assertTrue(boolArray[0]); //remove patches from text2 to get text1 ....something like this ... results = dmp.patch_remove(patches, text2); --> remove the patch(es) from text2 to obtain text1 }Unpatching can be done by just looping through the diff, swapping DIFF_INSERT with DIFF_DELETE, then applying the patch.
This page is for documentation of the API. meena, could you move this discussion to the newsgroup? I'll delete these posts once you have done so. Thanks.
Please go ahead and remove them (including this one). Going forward I'll use the newsgroup. thanks
can i use it to compare two HTML files
Thanks Vishal
Grate job Thank you but in my working i want to escape html tag example text1 = "<span class='some_class'>This is my Text</span>"; text2 = "<div>This is my Text</div>"; it equals text(not check html tag <span>,<div>) how i use diff_match_patch please suggest
Neil, is it possible to see how php version works? farhat.aminov@gmail.com
Hello!
One question. It is possible to know the line number where an INSERT diff appear?
Thanks for all!
ANN: A Ruby implementation is available as a GEM. Gem can be installed with RubyGems?, "gem install diff_match_patch" Source Code here https://github.com/kalmbach/diff_match_patch
Big thx for this great work! It's just what I was looking for :)
Hi , I have used the following program to compare data of 2 text files , got to check whether first file data is present in second file , comparing row wise . .......................................... import java.io.BufferedReader?; import java.io.File; import java.io.FileNotFoundException?; import java.io.FileReader?; import java.io.IOException; import java.util.ArrayList?; import java.util.List; public class ComparingTextFiles? {
..................................
its working .
Can you unapply a patch?