My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
Tutorial  

Tutorial
Updated Oct 22, 2011 by cubicdaiya@gmail.com

Getting Started

It is easy to use dtl. All you have to do for using dtl is including dtl.hpp in the source archive.

#include "dtl/dtl.hpp"

Compare two strings

First of all, calculate the difference between two strings.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.compose();

When the above code is executed, dtl calculates the difference between A and B as Edit Distance and LCS and SES.

The meaning of these three terms is following.

Edit Distance Edit Distance is numerical value for declaring a difference between two sequences.
LCS LCS stands for Longest Common Subsequence.
SES SES stands for Shortest Edit Script. I mean SES is the shortest course of action for tranlating one sequence into another sequence.

If one sequence is "abc" and another sequence is "abd", Edit Distance and LCS and SES is following.

Edit Distance 2
LCS ab
SES C a C b D c A d

  • 「C」:Common
  • 「D」:Delete
  • 「A」:ADD

If you want to know in more detail, please see examples/strdiff.cpp in the source archive.

This calculates Edit Distance and LCS and SES of two strings received as command line arguments and prints each.

When one string is "abc" and another string "abd", the output of strdiff is following.

$ ./strdiff abc abd
editDistance:2
LCS:ab
SES
 a
 b
-c
+d
$

Compare two data has arbitrary type

dtl can compare data has aribtrary type because of the C++'s template.

But the compared data type must support the random access_iterator.

In the previous example, the string data compared,

dtl can also compare two int vectors like following example.

int a[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int b[10] = {3, 5, 1, 4, 5, 1, 7, 9, 6, 10};
std::vector<int> A(&a[0], &a[10]);
std::vector<int> B(&b[0], &b[10]);
dtl::Diff< int > d(A, B);
d.compose();

If you want to know in more detail, please see examples/intdiff.cpp in the source archive.

Merge three sequences

dtl has the diff3 function.

This function is that dtl merges three sequences.

Additionally dtl detects the confliction.

typedef char elem;
typedef std::string sequence;
sequence A("qqqabc");
sequence B("abc");
sequence C("abcdef");
dtl::Diff3<elem, sequence> diff3(A, B, C);
diff3.compose();
if (!diff3.merge()) {
  std::cerr << "conflict." << std::endl;
  return -1;
}
std::cout << "result:" << diff3.getMergedSequence() << std::endl;

When the above code is executed, the output is following.

result:qqqabcdef

If you want to know in more detail, please see examples/strdiff3.cpp in the source archive.

Patch function

dtl can also translates one sequence to another sequence with SES.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
string s1(A);
string s2 = d.patch(s1);

When the above code is executed, s2 becomes "abd". The SES of A("abc") and B("abd") is following.

Common a
Common b
Delete c
Add    d

The patch function translates a sequence as argument with SES. For this example, "abc" is translated to "abd" with above SES.

Please see dtl's header files about the data structure of SES.

Treat difference as Unified Format

dtl can also treat difference as Unified Format. See the following example.

typedef char elem;
typedef std::string sequence;
sequence A("acbdeaqqqqqqqcbed");
sequence B("acebdabbqqqqqqqabed");
dtl::Diff<elem, sequence > d(A, B);
d.compose();             // construct an edit distance and LCS and SES
d.composeUnifiedHunks(); // construct a difference as Unified Format with SES.
d.printUnifiedFormat();  // print a difference as Unified Format.

The difference as Unified Format of "acbdeaqqqqqqqcbed" and "acebdabbqqqqqqqabed" is following.

@@ -1,9 +1,11 @@
 a
 c
+e
 b
 d
-e
 a
+b
+b
 q
 q
 q
@@ -11,7 +13,7 @@
 q
 q
 q
-c
+a
 b
 e
 d

The data structure Unified Format is following.

/**
 * Structure of Unified Format Hunk
 */
template <typename sesElem>
struct uniHunk {
  int a, b, c, d;                   // @@ -a,b +c,d @@
  std::vector<sesElem> common[2];   // anteroposterior commons on changes
  std::vector<sesElem> change;      // changes
  int inc_dec_count;                // count of increace and decrease
};

The actual blocks of Unified Format is this structure's vector.

If you want to know in more detail, please see examples/unistrdiff.cpp and examples/unidiff.cpp and dtl's header files in the source archive.

In addtion, dtl has the function translates one sequence to another sequence with Unified Format.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
d.composeUnifiedHunks()
string s1(A);
string s2 = d.UniPatch(s1);

When the above code is executed, s2 becomes "abd". The uniPatch function translates a sequence as argument with Unified Format blocks.

For this example, "abc" is translated to "abd" with the following Unified Format block.

@@ -1,3 +1,3 @@
 a
 b
-c
+d

Other

Compare large sequence

When compare two large sequence, dtl can optimizes the calculation of difference with the onHuge function.

But this function is could use when the compared data type is std::vector.

When you use this function, you may call this function before calling compose function.

typedef char elem;
typedef  std::vector<elem> sequence;
sequence A;
sequence B;
/* ・・・ */
dtl::Diff< elem, sequence > d(A, B);
d.onHuge();
d.compose();

Unserious difference

The calculation of difference is very heavy. dtl uses An O(NP) Sequence Comparison Algorithm.

Though this Algorithm is sufficiently fast, when difference between two sequences is very large,

the calculation of LCS and SES needs massive amounts of memory.

dtl avoids above-described problem by dividing each sequence into plural subsequences and joining the difference of each subsequence finally.

As this way repeats allocating massive amounts of memory, dtl provides other way. It is the way of calculating unserious difference.

For example, The normal SES of "abc" and "abd" is following.

Common a
Common b
Delete c
Add    d

The unserious SES of "abc" and "abd" is following.

Delete a
Delete b
Delete c
Add    a
Add    b
Add    d

Of course, when "abc" and "abd" are compared with dtl, above difference is not derived.

dtl calculates the unserious difference when dtl judges the calculation of LCS and SES needs massive amounts of memory and unserious difference function is ON.

dtl joins the calculated difference before dtl judges it and unserious difference finally.

As a result, all difference is not unserious difference when unserious difference function is ON.

When you use this function, you may call this function before calling compose function.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onUnserious();
d.compose();

Calculate only Edit Distance

As using onOnlyEditDistance, dtl calculates the only edit distance.

If you need only edit distance, you may use this function, because the calculation of edit distance is lighter than the calculation of LCS and SES.

When you use this function, you may call this function before calling compose function.

typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onOnlyEditDistance();
d.compose();
Powered by Google Project Hosting