Kamikaze
Kamikaze is a utility package wrapping set implementations on document lists. It also implements P4Delta compression algorithm for sorted integer segments to enable Inverted List compression for Search engines like Lucene.
Details
Kamikaze version 1.0.0 provides docset implementations on various underlying document id set representations for inverted lists in search engines. Currently the supported implementations include
- Integer Array representation : Document set based on Dynamic Integer Arrays
- OpenBitSet representation : Document Set based on OpenBitSet implementation from Lucene.
- P4Delta representation : Document Set for sorted Integer segments compressed using a variation of the P4Delta compression algorithm.
The library also provides elementary set (AND|OR|NOT) operations on DocSets without materializing the final document set, this is extremely useful for large sorted integer segments.
Performance Comparison
This is a very elementary performance comparison in terms of compression ratios and iteration speeds for the different supported DocSet implementations. For a more comprehensive evaluation watch the performance comparison page in the future.
NOTE :
- These are numbers for iteration over the docidset using the iterator framework.
- P4Delta numbers represent iterating over the docsets in COMPRESSED form. We never actually decompress the P4Delta sets, as our motivation is to use it for caching inverted lists and hence the missing patching phase in the implementation.
- The size estimates are generated by employing the repeated GC with finalization technique and thus include Object Overheads with OpenBitSet and P4Delta implementations.(See class com.kamikaze.performance.TestSizeEstimates.java on the BR_DEV_1_0_4 branch)
- All Tests are run with the following VM args : -d64 -Xms2G -Xmx4G
| Implementation | Max Value | Total Count | Compression Ratio | Time |
| IntArrayDocIdSet | 30M | 0.512M | 1.0 | 8.78 ms |
| IntArrayDocIdSet | 30M | 1.024M | 1.0 | 15.4 ms |
| IntArrayDocIdSet | 30M | 2.048M | 1.0 | 31.9 ms |
| IntArrayDocIdSet | 30M | 2.048M | 1.0 | 48.9 ms |
| IntArrayDocIdSet | 30M | 4.096M | 1.0 | 62.4 ms |
| IntArrayDocIdSet | 30M | 8.192M | 1.0 | 124 ms |
| IntArrayDocIdSet | 40M | 16.384M | 1.0 | 250 ms |
| OpenBitSetDocIdSet | 30M | 0.512M | 1.83 | 11.2 ms |
| OpenBitSetDocIdSet | 30M | 1.024M | 0.91 | 20.2 ms |
| OpenBitSetDocIdSet | 30M | 2.048M | 0.45 | 37.8 ms |
| OpenBitSetDocIdSet | 30M | 3.072M | 0.305 | 52.4 ms |
| OpenBitSetDocIdSet | 30M | 4.096M | 0.22 | 67 ms |
| OpenBitSetDocIdSet | 30M | 8.192M | 0.11 | 130 ms |
| OpenBitSetDocIdSet | 40M | 16.384M | 0.076 | 230 ms |
| P4DDocIdSet | 30M | 0.512M | 0.278 | 21.9 ms |
| P4DDocIdSet | 30M | 1.024M | 0.278 | 30.6 ms |
| P4DDocIdSet | 30M | 2.048M | 0.278 | 57.6 ms |
| P4DDocIdSet | 30M | 3.072M | 0.278 | 86.4 ms |
| P4DDocIdSet | 30M | 4.096M | 0.278 | 115.0 ms |
| P4DDocIdSet | 30M | 8.192M | 0.278 | 230 ms |
| P4DDocIdSet | 40M | 16.384M | 0.278 | 457.0 ms |
Skip Performance : (Skipping to every value incrementally in the set)
| Implementation | Max Value | Total Count | Compression Ratio | Skip Time |
| IntArrayDocIdSet | 30M | 0.512M | 1.0 | 109 ms |
| IntArrayDocIdSet | 30M | 1.024M | 1.0 | 115 ms |
| IntArrayDocIdSet | 30M | 2.048M | 1.0 | 193 ms |
| IntArrayDocIdSet | 30M | 2.048M | 1.0 | 194 ms |
| IntArrayDocIdSet | 30M | 4.096M | 1.0 | 193 ms |
| IntArrayDocIdSet | 30M | 8.192M | 1.0 | 193 ms |
| IntArrayDocIdSet | 40M | 16.384M | 1.0 | 183 ms |
| OpenBitSetDocIdSet | 30M | 0.512M | 1.83 | 228 ms |
| OpenBitSetDocIdSet | 30M | 1.024M | 0.91 | 207 ms |
| OpenBitSetDocIdSet | 30M | 2.048M | 0.45 | 207 ms |
| OpenBitSetDocIdSet | 30M | 3.072M | 0.305 | 211.4 ms |
| OpenBitSetDocIdSet | 30M | 4.096M | 0.22 | 193 ms |
| OpenBitSetDocIdSet | 30M | 8.192M | 0.11 | 182 ms |
| OpenBitSetDocIdSet | 40M | 16.384M | 0.076 | 244 ms |
| P4DDocIdSet | 30M | 0.512M | 0.278 | 268 ms |
| P4DDocIdSet | 30M | 1.024M | 0.278 | 242 ms |
| P4DDocIdSet | 30M | 2.048M | 0.278 | 257 ms |
| P4DDocIdSet | 30M | 3.072M | 0.278 | 287 ms |
| P4DDocIdSet | 30M | 4.096M | 0.278 | 310 ms |
| P4DDocIdSet | 30M | 8.192M | 0.278 | 423 ms |
| P4DDocIdSet | 40M | 16.384M | 0.278 | 715 ms |