|
HowTo
Usage information.
Multi-Dimensional Space SpecificationBefore doing anything useful the library needs to know the specification of the multi-dimensional space, that is, how many dimensions there are, and also the precision (number of bits) for each dimension. All this information is captured in a MultiDimensionalSpec object. For compact Hilbert index and inverse calculation there are no limitations on the number of bits. For query building, the total precision must not exceed 62. In general it's a good idea to keep the precision of each dimension as small as possible. Fixed Size Bit VectorThe coordinates of a point in the multi-dimensional space need to be first represented as a BitVector[]. If the coordinates are of type long or BitSet, the corresponding bit vectors can be created using BitVectorFactories.OPTIMAL.apply(precision), and then using the copyFrom(long/BitSet) overloads. There is no direct support for other numerical types such as BigInteger right now, so in such situations one will most likely need to populate the bit vector bit by bit. To transform from a bit vector back into long use toExactLong(); for BitSet use toBitSet(). Compact Hilbert Index MappingsFor performance reasons some methods need a pre-allocated output parameter to be passed in. To compute an index, call SpaceFillingCurve.index(point, 0, index) with a pre-allocated index bit vector of size MultiDimensionalSpec.sumBitsPerDimension(). To extract the coordinates back from an index, use SpaceFillingCurve.index(index, point) with a pre-allocated point array. Queries in Read-Write TablesSpace filling curves can be used to index multidimensional data so that regions described mainly as range constraints on some or all of the dimensions can be queried more efficiently than if one indexed only one dimension. The challenge is to encode the data in such a way that small regions in the multi-dimensional space result in a small number of ranges for a plain-vanilla uni-dimensional relational database index. One way of achieving this is by storing the (big-endian) multi-dimensional index either as the primary key (recommended), or as an indexed column in the table. Then there are a few options to consider:
select columns from table where (hilbertIndex between lo_0 and hi_0 or ... hilbertIndex between lo_n and hi_n)
and x between xlo and xhi and y between ylo and yhi and z between zlo and zhi and t between tlo and thi ...select columns from (select hilbertIndex, columns from table where hilbertIndex between lo_0 and hi_0)
where hilbertIndex between sublo_00 and subhi_00 or hilbertIndex between sublo_01 and subhi_01 ...
UNION ALL
... (the same but replace lo_0 with lo_n, hi_0 with hi_n, sublo_0i with sublo_ni and subhi_0i with subhi_ii)Note that all sub-ranges of a range are fully covered by that range. In either case the query region is represented by the RegionInspector abstraction, with the plain-vanilla SimpleRegionInspector implementation provided out of the box for queries consisting of one or more hyper-rectangles. Note that the query doesn't have to look like a (small) set of hyper-rectangles, and can in general have any shape, with the inherent performance price and the need to implement a custom region inspector. Here's a snippet of what the code could look like: // int[] min, int[] max represent the query hyper-rectangle, both ends inclusive
List<LongRange> criterion = new ArrayList<LongRange>(min.length);
for (int i = 0; i < min.length; ++i) {
criterion.add(LongRange.of(min[i], max[i] + 1));
}
RegionInspector<Object> regionInspector = SimpleRegionInspector.create(
ImmutableList.<List<LongRange>>of(criterion), 1L,
Functions.constant(FILTER));
final int maxFilteredRanges = 20;
// PlainFilterCombiner since we're not using sub-ranges here
QueryBuilder<Object> queryBuilder = BacktrackingQueryBuilder.create(
regionInspector, PlainFilterCombiner.INSTANCE,
maxFilteredRanges, true);
// sfc is of type SpaceFillingCurve, such as an instance of CompactHilbertIndex
sfc.accept(new ZoomingSpaceVisitorAdapter(sfc, queryBuilder));
Query<Object> query = queryBuilder.get();
List<FilteredIndexRange<Object>> ranges = query.getFilteredIndexRanges();Then construct an SQL query that selects between ranges.get(i).getIndexRange().getStart() and ranges.get(i).getIndexRange().getEnd() - 1 inclusive at both ends, and for those ranges returning true from ranges.get(i).isPotentialOverSelectivity() add a where clause based on the coordinate values if stored redundantly in the database. Note that those ranges that return false from ranges.get(i).isPotentialOverSelectivity() are guaranteed to not select any extraneous points. Also, if maxFilteredRanges is made large enough then it is guaranteed that no ranges will have potential over-selectivity, but the number of ranges can easily grow too large. Queries in Mostly Read-Only TablesOptionally for a mostly read-only table, after each modification the table can be scanned and a limited rolled-up version of the table information extracted in memory and used to significantly speed up both query building and query execution, the latter as a result of better queries being produced. The idea is that the table is analysed to see which regions have points, and which regions have more points than others and the information is aggregated into a tree-shaped cache of specified size. Although not a complete example, this snippet of code shows the main steps involved in the analysis phase that involves one full table scan: int[] elementLengths =
PrimitiveArrays.toIntArray(new HilbertIndexMasks(sfc.getSpec()).cardinalities());
BitVector[] path = new BitVector[elementLengths.length];
for (int i = 0; i < path.length; ++i) {
path[i] = BitVectorFactories.OPTIMAL.apply(elementLengths[path.length - i - 1]);
}
StreamingRollup<BitVector, CountingDoubleArray> rollup = BoundedRollup.create(
CountingDoubleArray.newEmptyValue(0), cacheSize);
// for each row, ordered by the hilbertIndex column:
for (int i = 0; i < path.length; ++i) {
path[i] = path[i].clone();
}
BitVectorMath.split(hilbertIndex, path);
rollup.feedRow(Iterators.forArray(path, 0, path.length),
new CountingDoubleArray(ArrayUtils.EMPTY_DOUBLE_ARRAY));
MapNode<BitVector, CountingDoubleArray> rolledupTree = rollup.finish();
Pow2LengthBitSetRangeFactory<CountingDoubleArray> factory =
Pow2LengthBitSetRangeFactory.create(PrimitiveArrays.asList(elementLengths));
Map<Pow2LengthBitSetRange, NodeValue<CountingDoubleArray>> rolledupMap = factory.apply(rolledupTree);Then all that's left is to decorate the region inspector for query building purposes: RegionInspector regionInspector =
MapRegionInspector.create(rolledupMap, simpleRegionInspector, false);Other ConsiderationsAt this point very little performance testing has been done, and although the index and inverse computation routines are pretty stable, significant changes might be made in the query building mechanism in the future. Also, note that many relational databases provide built-in multi-indexing support based on R-trees and not only, which should be more efficient where available than the database independent approach employed here. From a stability point of view, although this version is marked as 0.1, there are extensive unit tests for most of the functionality with very good coverage. The unit tests should also serve as further usage examples. Dependencies: The library has only 2 dependencies: Google Collections 20080819 and Apache Commons Lang 2.4. The test code also depends on EasyMock 2.0 and JUnit 3.8.1. Useful Links |
Sign in to add a comment