oracleofwoodyallen - ApproximateStringMatching.wiki

Approximate string matching

The Oracle is able to recommend some persons to the user if he or she misspells the name the a person it is looking for. This is a pretty cool feature that is not so easy to implement efficiently as it could seem.

For example, if we take as reference the (great) web page of The oracle of Bacon, one could check that if the first letter of the name is misspelled, the web page is not able to find a good candidate. However, in the (f*cking damn great) web page of The oracle of Woody Allen, no matter where you make the mistake, the oracle will recommend you the person that you are actually looking for (unless you type a completely different name).

This feature is one of the main topics that I am currently working on. I have learned lot of things implementing this feature but it can be improved, so probably this will change in the (I hope near) future. Anyway, if you want to learn more about approximate string matching and a good algorithm to solve it, keep reading this page.

Damerau-Levenshtein distance

The Damerau-Levenshtein distance measures the minimum number of operations needed to transform a string into another one. The valid operations are insertion, deletion, substitution and transposition of a single character.

For example, suppose that the string "hello" must be transformed into "hall". The minimum number of transformations is 2: delete the "o" and then substitute the "e" by an "a".

The classical way of computing this distance is using a dynamic programming algorithm very famous and explained in all the courses of introduction to algorithms in college.

The algorithm builds a table where each row represents a character of one word and each column a character from the other word. The idea is that the distance between the strings a_1 ... a_i and b_1 ... b_j is stored in the cell (i,j) of the table. An extra column and row are added to the table to represent the empty string. So, the actual distance between A and B is in the cell (|A|,|B|). Take a look at the table for the previous example.

| | | h | e | l | l | o | |:|:|:------|:------|:------|:------|:------| | |0|1 |2 |3 |4 |5 | | h |1|0 |1 |2 |3 |4 | | a |2|1 |1 |2 |3 |4 | | l |3|2 |2 |1 |2 |3 | | l |4|3 |3 |2 |1 |2 |

The formula to compute the value of each cell in the table is the next one.

Note that both time and space complexity to fill up this table is O(|A| |B|).

Ukkonen improvement

The algorithm can be improved if we only want to know if a string can be transformed into an other one using at least K operations. Esko Ukkonen noticed a couple of things about this algorithm: 1. If all elements in a row are greater than K, no elements in the following rows will be lower or equal to K. 1. If C(i) is the greatest column in row i such that D(i,j) <= K, all elements starting from column C(i)+2 in the next row (i+1) will be greater than K.

This two observations allow to compute the value of some cells that are not necessary in fact. The number of evaluated cells (time complexity) taking into account the improvements made by Ukkonen is O(K^2), which is significantly lower than using the basic algorithm and then checking if D(|A|, |B|) is lower or equal to K.

However, it is not a good idea to simply compute the Damerau-Levenshtein distance between the name typed by the user and all the names in the oracle. O(n K^2) is not a good asymptotic running time when you have millions of persons.

To solve this problem, the names of persons are organized in a trie (prefix tree) that has some properties that allow us to avoid a large number of calculus when the Damerau-Levenshtein distances between the name searched by the user and the names in the Oracle are computed. If you have not read the RadixTree page yet, please read it before going further.

Approximate string matching algorithm

First, one have to realize that there are two importants properties that will able us to avoid to do some extra work.

When we compute the distance between the strings X and YZ, the first |Y|+1 rows in the Damerau-Levenshtein matrix will be the same for any Z (also when Z is the empty string).
If all the elements in the row |Y|+1, in the Damerau-Lenvenshtein matrix for the strings X and Y, are greater than k, any string with the prefix Y will be at a distance greater than k from X.

Using a trie we can save time checking all the words that are at a distance lower or equal to k from a given input, since the Damerau-Levenshtein matrix can be calculated iteratively while we go down the tree (property 1). Also, we can prune our search on the tree when a prefix is at a distance greater than k, then we do not need to explore its children (property 2).

Some notes about the previous pseudocode: * update(D, C, x) updates the Damerau-Levenshtein matrix adding |x| rows and computing the value for the new elements, and updates the Ukkonen's vector C as it was explained before. * dist(D,C) return the value of the last element in the Damerau-Levenshtein matrix (row |v|, column |w|). It takes into account the fact that this value may be not computed. If so, it returns k+1.

The time complexity of this algorithm is O(k |A|^k), as it is explained in (1). k is the number of maximum allowed misspellings and |A| is the number of symbols in our alphabet. The great thing about this time complexity is that it does not depend on the number of words in the trie!

However, the complexity grows exponentially with the parameter k... Hopefully, the probability that a user makes lots of mistakes (a large k value) when he or she is typing a name is very low, so I limit k to 3. Well, in fact the formula that gives a value to k is min(3, |w|/4) where w is the word typed by the user.

This means that the user can misspell the 25% of the name if the name has less than 12 characters, otherwise, the maximum number of allowed errors is 3.

The next graphic shows the response time for misspelled names in the current implementation of the oracle.

The confidence intervals are calculated with a confidence level of 95%. The average response time with Damerau-Levenshtein distance and Ukkonen's improvement using the trie described in RadixTree is 0.208 seconds. This includes the time spent processing the request and building the response.

Bibliography

(1) H. Shang, T.H. Merrett (1995). Tries for approximate string matching.