New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BloomFilter<E> #12
Comments
Original comment posted by kevinb9n on 2007-10-30 at 11:06 PM Hypothetical use cases:
There may be practical reasons why a Bloom filter is or isn't a good choice in any of Note: a Bloom filter requires good hashing to work properly, not just a crummy 32-bit |
Original comment posted by kevinb9n on 2007-11-03 at 05:33 PM (No comment entered for this change.) Labels: |
Original comment posted by kevinb9n on 2008-06-02 at 05:48 PM (No comment entered for this change.) Labels: - |
Original comment posted by kevinb9n on 2009-09-17 at 06:02 PM (No comment entered for this change.) Labels: |
Original comment posted by tomgibara on 2010-03-17 at 10:34 AM I've just taken a look at the interface and thought I'd respond with some initial There is another method of the Collection interface that could be usefully adopted: boolean mightContainAll(Collection<?> c) //equivalent of containsAll() Implementing size() is a possibility, by counting the number of adds that mutate the Extending the Collection interface directly is of little benefit because the boolean addAll(BloomFilter<E> filter) boolean mightContainAll(BloomFilter<E> filter) //containsAll would be a more accurate These last two operations are clearly only defined over compatible filters, but are int getCapacity() And this is not sufficient because access to the actual bitset would also be Bitset getBits() Whether this Bitset would be an interface or concrete I find difficult to determine. * The BloomFilter 'union' operation (addAll) and 'subset' test (containsAll) become Whether this complexity is worthwhile is an open question, but my view is that, if |
Original comment posted by kevinb@google.com on 2010-07-30 at 03:53 AM (No comment entered for this change.) Labels: - |
Original comment posted by kevinb@google.com on 2010-07-30 at 03:56 AM (No comment entered for this change.) Labels: - |
Original comment posted by reachbach on 2010-08-11 at 11:50 AM Support for set operations (determining the union, intersection or difference without exposing bitsets) over bloom filters would be useful. |
Original comment posted by kevinb@google.com on 2011-01-27 at 01:54 PM (No comment entered for this change.) Owner: kev...@google.com |
Original comment posted by kevinb@google.com on 2011-01-27 at 01:59 PM (No comment entered for this change.) Owner: --- |
Original comment posted by jon.h.clark on 2011-02-07 at 05:31 PM A very useful data structure indeed. |
Original comment posted by kevinb@google.com on 2011-07-13 at 06:18 PM (No comment entered for this change.) Status: |
Original comment posted by kevinb@google.com on 2011-07-16 at 08:37 PM (No comment entered for this change.) Status: |
Original comment posted by kevinb@google.com on 2011-07-19 at 02:24 PM There has been a lot of progress on this and we may see it in release 12, to put a guess out there. |
Original comment posted by wasserman.louis on 2011-12-06 at 04:09 PM Any chance we could get this into Git, if not exposed, so I can play with it? |
Original comment posted by jim.andreou on 2011-12-06 at 05:10 PM Since recently it became more or less feature complete (i.e. serialization), I wouldn't mind exposing it as-is in @Beta. Kevin? |
Original comment posted by jim.andreou on 2011-12-07 at 02:01 AM Seems like there is no roadblock to making this public (@Beta of course), along with the new hashing abstractions. Just a matter of someone actually doing it |
Original comment posted by wasserman.louis on 2011-12-07 at 02:37 AM I get the impression folks are really busy, though, so I'd settle for open-sourcing it without exposing the API yet. |
Original comment posted by fry@google.com on 2011-12-10 at 03:12 PM (No comment entered for this change.) Labels: |
Original comment posted by wasserman.louis on 2011-12-31 at 05:17 PM (No comment entered for this change.) Status: |
Original comment posted by vayasya on 2012-04-07 at 12:05 AM Didn't want to create a separate ticket for this question. 'put' has Key-Value association semantics by convention and 'add' has Set-element-addition semantics, (both in Java at least). I am wondering why the BloomFilter implementation uses 'put' instead of 'add' even though the original API draft at the top of this page says 'add' a possible 'addAll'. |
Original comment posted by seharris on 2012-06-14 at 02:37 PM The proposed addAll(BloomFilter<T>) method (alternately called "union") is essential for my needs, which involve populating several compatible filters on separate computers and then combining the (deserialized) filters back into one. At present, there's no way build up one filter from one or more other filters (despite the copy() method being provided). |
Original comment posted by wasserman.louis on 2012-06-14 at 02:39 PM It's not clear to me that such a method could be defined, depending on the definition of "compatible" filters. |
Original comment posted by seharris on 2012-06-14 at 04:53 PM Comment #5 above touches on this challenge of deducing that two filters are compatible. I suspect that you won't buy this stance, but I'd be happy enough with trusting the caller. If the caller doesn't understand well enough the invalid consequences of pouring one bag of bits into another, there's not much we can do for him. I'd rather see it become /possible/ to perform this operation than continue to ban it, just because it's vulnerable to misuse. |
Original comment posted by wasserman.louis on 2012-06-14 at 04:58 PM Hmmm. How would you feel about throwing an exception if the target filter wasn't initialized with the same parameters? |
Original comment posted by seharris on 2012-06-14 at 05:44 PM Throwing an exception (IllegalArgumentException) is a reasonable response to trying to merge in a filter constructed with different parameters. The addAll (or merge() or union()) method would then involve first verifying equivalent configuration of the incoming filter and, upon confirming compatibility, ORin in its underlying bit vector, right? |
Original comment posted by wasserman.louis on 2012-06-14 at 05:46 PM Yep, that's what I'd expect. |
Original issue created by kevinb9n on 2007-10-23 at 04:02 AM
A BloomFilter is a useful data structure.
http://en.wikipedia.org/wiki/Bloom_filter
Proposed straw-man API follows. It shares only four methods in common with
Collection, and adds two of its own.
/**
* A probabilistic "shadow" of a set of elements, useful
* when the set itself would be too expensive to maintain in
* memory and query directly. A Bloom filter can give false
* positivites, but never false negatives. That is, adding
* an element to the filter guarantees that {@link
* #mightContain} will return {@code true}, but {@link
* #mightContain} returning {@code true} does not guarantee
* that this element was ever actually added to the filter.
/
public final class BloomFilter<E> implements Serializable {
/*
* Returns {@code true} if it is possible
* (probability nonzero) that {@code element} is contained
* in the set represented by this Bloom filter. If this
* method returns {@code false}, this element is
* definitely not present. If it {@code true}, the
* probability that this element has not actually
* been added is given by {@link
* #getFalsePositiveProbability()}.
*/
public boolean mightContain(@Nullable E element) { ... }
/**
* Returns the probability that {@link #mightContain} will
* return {@code true} for an element not actually
* contained in this set.
*/
public double getFalsePositiveProbability() { ... }
/**
* Adds {@code newElement} to this Bloom filter, so that
* {@code mightContain(newElement)} is now guaranteed to
* return {@code true}.
*
* @return true if the Bloom filter changed as a result of
* this call
*/
public boolean add(@Nullable E newElement) { ... }
// self-explanatory methods:
public boolean addAll(Iterable<? extends E> elements) { ... }
public boolean isEmpty() { ... }
public void clear() { ... }
}
The text was updated successfully, but these errors were encountered: