My favorites | Sign in
Project Logo
                
Search
for
Updated Oct 31, 2008 by anmol.bhasin
Labels: Featured
Kamikaze  

Kamikaze

Kamikaze is a utility package wrapping set implementations on document lists. It also implements P4Delta compression algorithm for sorted integer segments to enable Inverted List compression for Search engines like Lucene.

Details

Kamikaze version 1.0.0 provides docset implementations on various underlying document id set representations for inverted lists in search engines. Currently the supported implementations include

  1. Integer Array representation : Document set based on Dynamic Integer Arrays
  2. OpenBitSet representation : Document Set based on OpenBitSet implementation from Lucene.
  3. P4Delta representation : Document Set for sorted Integer segments compressed using a variation of the P4Delta compression algorithm.
Ref : http://homepages.cwi.nl/~heman/downloads/msthesis.pdf
Ref : http://www2008.org/papers/pdf/p387-zhangA.pdf

The library also provides elementary set (AND|OR|NOT) operations on DocSets without materializing the final document set, this is extremely useful for large sorted integer segments.

Performance Comparison

This is a very elementary performance comparison in terms of compression ratios and iteration speeds for the different supported DocSet implementations. For a more comprehensive evaluation watch the performance comparison page in the future.

NOTE :

  1. These are numbers for iteration over the docidset using the iterator framework.
  2. P4Delta numbers represent iterating over the docsets in COMPRESSED form. We never actually decompress the P4Delta sets, as our motivation is to use it for caching inverted lists and hence the missing patching phase in the implementation.
  3. The size estimates are generated by employing the repeated GC with finalization technique and thus include Object Overheads with OpenBitSet and P4Delta implementations.(See class com.kamikaze.performance.TestSizeEstimates.java on the BR_DEV_1_0_4 branch)
  4. All Tests are run with the following VM args : -d64 -Xms2G -Xmx4G

Implementation Max Value Total Count Compression Ratio Time
IntArrayDocIdSet 30M 0.512M 1.0 8.78 ms
IntArrayDocIdSet 30M 1.024M 1.0 15.4 ms
IntArrayDocIdSet 30M 2.048M 1.0 31.9 ms
IntArrayDocIdSet 30M 2.048M 1.0 48.9 ms
IntArrayDocIdSet 30M 4.096M 1.0 62.4 ms
IntArrayDocIdSet 30M 8.192M 1.0 124 ms
IntArrayDocIdSet 40M 16.384M 1.0 250 ms
OpenBitSetDocIdSet 30M 0.512M 1.83 11.2 ms
OpenBitSetDocIdSet 30M 1.024M 0.91 20.2 ms
OpenBitSetDocIdSet 30M 2.048M 0.45 37.8 ms
OpenBitSetDocIdSet 30M 3.072M 0.305 52.4 ms
OpenBitSetDocIdSet 30M 4.096M 0.22 67 ms
OpenBitSetDocIdSet 30M 8.192M 0.11 130 ms
OpenBitSetDocIdSet 40M 16.384M 0.076 230 ms
P4DDocIdSet 30M 0.512M 0.278 21.9 ms
P4DDocIdSet 30M 1.024M 0.278 30.6 ms
P4DDocIdSet 30M 2.048M 0.278 57.6 ms
P4DDocIdSet 30M 3.072M 0.278 86.4 ms
P4DDocIdSet 30M 4.096M 0.278 115.0 ms
P4DDocIdSet 30M 8.192M 0.278 230 ms
P4DDocIdSet 40M 16.384M 0.278 457.0 ms

Skip Performance : (Skipping to every value incrementally in the set)

Implementation Max Value Total Count Compression Ratio Skip Time
IntArrayDocIdSet 30M 0.512M 1.0 109 ms
IntArrayDocIdSet 30M 1.024M 1.0 115 ms
IntArrayDocIdSet 30M 2.048M 1.0 193 ms
IntArrayDocIdSet 30M 2.048M 1.0 194 ms
IntArrayDocIdSet 30M 4.096M 1.0 193 ms
IntArrayDocIdSet 30M 8.192M 1.0 193 ms
IntArrayDocIdSet 40M 16.384M 1.0 183 ms
OpenBitSetDocIdSet 30M 0.512M 1.83 228 ms
OpenBitSetDocIdSet 30M 1.024M 0.91 207 ms
OpenBitSetDocIdSet 30M 2.048M 0.45 207 ms
OpenBitSetDocIdSet 30M 3.072M 0.305 211.4 ms
OpenBitSetDocIdSet 30M 4.096M 0.22 193 ms
OpenBitSetDocIdSet 30M 8.192M 0.11 182 ms
OpenBitSetDocIdSet 40M 16.384M 0.076 244 ms
P4DDocIdSet 30M 0.512M 0.278 268 ms
P4DDocIdSet 30M 1.024M 0.278 242 ms
P4DDocIdSet 30M 2.048M 0.278 257 ms
P4DDocIdSet 30M 3.072M 0.278 287 ms
P4DDocIdSet 30M 4.096M 0.278 310 ms
P4DDocIdSet 30M 8.192M 0.278 423 ms
P4DDocIdSet 40M 16.384M 0.278 715 ms


Sign in to add a comment
Hosted by Google Code