sketchsortj


Fast all pairs similarity search for Jaccard distance

Introduction

SketchSortJ(1,2) is a software for all pairs similarity search. It takes as an input data points and outputs approximate neighbor pairs within a Jaccard distance (1.0 - Jaccard-similarity). First, the input data points are mapped to sketches by minwise independent permutations, also called minhash, and then neighbor pairs of sketches within a Hamming distance are enumerated by the multiple sorting method (3). Finally, the Jaccard distances for such neighbor pairs are calculated. If the Jaccard distance for a neighbor pair is no more than a user-specified threshold , the neighbor pair is outputted. One might worry about missed nearest neighbor pairs by our method. A theoretical bound of the expectation of missing edge ratio is derived. It enables us to set parameters so as to limit the empirical missing edge ratio as small as possible.

Quick Start

To compile SketchSort , please type the followings: tar -xjvf sketchsort-jaccard-X.X.X.tar.bz2 cd sketchsort-jaccard-X.X.X/src make ./sketchsortj -auto -missingratio 0.001 -jaccard 0.05 ../dat/fingerprints.txt outputfile where fingerprints.txt is an input file and outputfile is a result file. By using -auto option, the other parameters are automatically decided based on given Jaccard distance threshold (0.05) and missing edge ratio (0.001).

Usage

./sketchsortj [options] input-file output-file Option: -hamdist [d]: set the Hamming distance threshold (default: 2) -numblocks [b]: set the number of blocks (default: 5). Setting b to d + 3 is recommended. -jaccard [\epsilon]: set the Jaccard distance threshold (1-Jaccard similarity) (default: 0.05) -numchunks [Q]: set the number of chuncks (default: 3) -auto: given Jaccard distance threshold (-jaccard) and missing edge ratio (-missingratio), the other parameters (-hamdist -numblocks -numchunks) are automatically decided -missingratio[m]: set the desired missing edge ratio (default:0.0001)

A user who would like to know the meanings of these options can see our original paper (2).

Format of input file

Each line in an input file is a set of items in which each item is separated by space. Here is an example. 1 2 5 6 8 10 11 3 4 101 120 130 140 1 6 8 10 12 16 18 100 120 130 ...

Format of output file

Each line in an output file consists of a pair of identifiers of neighbor points and their Jaccard distance. An identifier corresponds to a line number in an input file and starts from "0". 100 102 0.05 1 1000 0.01 1285 1363 0.0153 ...

References

(1) Y.Tabei, T.Uno, M.Sugiyama, K. Tsuda, Single versus Multiple Sorting in All Pairs Similarity Search, The 2nd Asian Conference on Machine Learning (ACML), 2010

(2) Y.Tabei and K.Tsuda: SketchSort: Fast all pairs similarity search for large databases of molecular fingerprints, Molecular Informatics, 30, 801-807, 2011.

(3) T. Uno. Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowlege and Information Systems, 2009. published online.

(4) https://sites.google.com/site/yasuotabei/sachica

Project Information

The project was created on Mar 5, 2011.

Labels:
sketchsort jaccard-tanimoto minhash allpairssimilaritysearch