|
Project Information
|
Introductionfujimap is a library for a space-efficient associative map. It associates a key with a value, and supports "set" and "get" operations, which is fundamental for many applications. The characteristic of a fujimap is that its working space is surprisingly small; it does not depend on key information, and it only requires a space for storing values by allowing some false positive errors. For example, if the all values are less than 65536 (can be estored in 16 bit) and false positives are allowed with the probability 2^{-8}, the working space is exactly (16+8) bit per key plus some small overhead whatever keys are. Moreover, fujimap supports these operations in almost constant time. Therefore it can handle very large set of key/values (e.g. 1 billion) on memory. Fujimap uses several techniques of recent studies in data structures and compression. Quick StartTo install fujimap, please type the followings. tar xvjf fujimap-x.x.x.tar.bz2 cd fujimap-x.x.x ./ configure make sudo make install Command line programs are also provided. fujimap is a command line program. It supports to build a dictionary file from a key/value file and save it in a specified file. When you specify a key/value file, and index name, it creates an index from the key/value file and save it in the index. When you specify an index name only, it performs get operations in an interactive mode. First, you need to prepare a key/value file as follows. key1\tvalue1 key2\tvalue2 ... Each line consists of a pair of key and value, and each key is followed by a tab and a corresponding value. $ fujimap
usage: fujimap --index=string [options] ...
options:
-i, --index Index (string)
-d, --dic Key/Value file. when "-" is specified it reads from stdin (string [=])
-f, --fpwidth False positive rate 2^{-f} (0 <= f < 31) (int [=0])
-t, --tmpN TemporarySize (int [=0])
-e, --encode Code encoding (=binary, gamma) (string [=])
-w, --workingfile Working file for temporary data (string [=])
-l, --logvalue When specified, store a log of input value
-s, --stringvalue When specified, sotre a string of input value
-h, --help print this message
$ cat ./dat/input
...
Fuji 4785972
Fuji.. 1002
Fuji.jpg 388
Fuji8x 522
FujiBBS 224
FujiCEM 302
FujiChrome 645
FujiColor 1266
FujiCrys 264
...
$ fujimap -d ./dat/input -i ./dat/index -f 10
read 13588391 keys
build done.
keyNum: 13588391
fpLen: 10
encoding: binary
wsize(bytes): 67343960
bits/key: 52.6506
save done.
$ fujimap -i ./dat/index
load done.
keyNum: 13588391
fpLen: 0
encoding: binary
wsize(bytes): 67343960
bits/key: 52.6506
>FujiChrome
FOUND:645
>Fuji
FOUND:4785972
>Fujisoba
NOTFOUND:
>You can also use fujimap in your C++ program by including "<fujimap/fujimap.hpp>" and add an option -lfujimap at a compile time. To register key/value, use Fujimap::setString() or Fujimap::setInteger() #include <fujimap/fujimap.hpp>
...
fujimap_tool::Fujimap fm;
string key = string("fujiyama");
uint64_t value = 3775;
fm.setInteger(key.c_str(), key.size(), value;
// string value = string("Japan");
// fm.setString(key.c_str(), key.size(), value.c_str(), value.size()); You can also register a string keyTo get a value associated with a key, use Fujimap::getString() or Fujimap::getInteger(). If a key is not found, it returns fujimap_tool::NOTFOUND, and NULL respectively. string key = string("fujiyama");
uint64_t val = fm.getInteger(key.c_str(), key.size()); // val == 3775
uint64_t val2 = fm.getInteger("yama", 4); // val2 == fujimap_tool::NOTFOUND
// size_t retLen = 0;
// const char* sval = fm.getString(key.c_str(), key.size(), retLen); // sval == "Japan" If the same key is already registered, the old value is replaced with new value. You can save (load) Fujimap in (from) a file fm.save("index"); // save in a file "index"
fm.load("index"); // load from a file "indexAPI C++See fujimap.hpp Example Code#include <fujimap/fujimap.hpp>
...
fujimap_tool::Fujimap fm;
fm.initFP(30) // Allow false positives with the probability 2^{-30}
vector<pair<char*, uint64_t> > mountains;
mountains.push_back(make_pair("Fujiyama", 3775));
mountains.push_back(make_pair("Everest", 8844));
mountains.push_back(make_pair("McKinley", 6194));
mountains.push_back(make_pair("Kilima-Njaro", 5895));
for (size_t i = 0; i < mountains.size(); ++i){
const string& key = mountains[i].first;
const uint64_t val = mountains[i].second;
fm.setInteger(key.c_str(), key.size(), val, true); // register and ready to search immediately with some additional space
assert(fm.getInteger(key.c_str(), key.size()) == val); // the key is searchable immediately
}
string dummyKey = "Aconcagua"; // not registered in fm
if (fm.getInteger(dummyKey.c_str(), dummyKey.size()) != fujimap_tool::NOTFOUND){
cout << "your unlucky" << endl; // This would happen with the prob. 2^{-30}
}
if (fm.save("myindex") == -1){
cerr << fm.what() << endl; // error report
return -1;
}
fujimap_tool::Fujimap fm2;
if (fm2.load("myindex") == -1){
cerr << fm2.what() << endl; // error report
return -1;
}
Other example codes are found in fujimapMain.cpp fujimapTest.cpp Performance TestWeb 1T 5-gram Version 1
Store an approximate value (a log of input value)
Store an exact value
Random Key and Value
fujimapPerformanceTest 100000000
NoteHere, we use the notation following the Bloom filter; positive corresponds to the filtered out items, and "false positive" refers to the item which should be filtered out, are misclassified as registered items. |