Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BloomFilter<E> #12

Closed
gissuebot opened this issue Oct 31, 2014 · 27 comments
Closed

BloomFilter<E> #12

gissuebot opened this issue Oct 31, 2014 · 27 comments
Labels

Comments

@gissuebot
Copy link

Original issue created by kevinb9n on 2007-10-23 at 04:02 AM


A BloomFilter is a useful data structure.

  http://en.wikipedia.org/wiki/Bloom_filter

Proposed straw-man API follows. It shares only four methods in common with
Collection, and adds two of its own.

/**
 * A probabilistic "shadow" of a set of elements, useful
 * when the set itself would be too expensive to maintain in
 * memory and query directly. A Bloom filter can give false
 * positivites, but never false negatives. That is, adding
 * an element to the filter guarantees that {@link
 * #mightContain} will return {@code true}, but {@link
 * #mightContain} returning {@code true} does not guarantee
 * that this element was ever actually added to the filter.
 /
public final class BloomFilter<E> implements Serializable {
  /
*
   * Returns {@code true} if it is possible
   * (probability nonzero) that {@code element} is contained
   * in the set represented by this Bloom filter. If this
   * method returns {@code false}, this element is
   * definitely not present. If it {@code true}, the
   * probability that this element has not actually
   * been added is given by {@link
   * #getFalsePositiveProbability()}.
   */
  public boolean mightContain(@Nullable E element) { ... }

  /**
   * Returns the probability that {@link #mightContain} will
   * return {@code true} for an element not actually
   * contained in this set.
   */
  public double getFalsePositiveProbability() { ... }

  /**
   * Adds {@code newElement} to this Bloom filter, so that
   * {@code mightContain(newElement)} is now guaranteed to
   * return {@code true}.
   *
   * @return true if the Bloom filter changed as a result of
   * this call
   */
  public boolean add(@Nullable E newElement) { ... }

  // self-explanatory methods:
  public boolean addAll(Iterable<? extends E> elements) { ... }
  public boolean isEmpty() { ... }
  public void clear() { ... }
}

@gissuebot
Copy link
Author

Original comment posted by kevinb9n on 2007-10-30 at 11:06 PM


Hypothetical use cases:

  1. Wikipedia uses an example of a spell-checker; if you put legal words in a Bloom
    filter, you will be sure not to call something an error that isn't, but you could
    miss flagging certain misspellings on occasion.
  2. A "screen" so that you can optimize away an expensive rpc/db query in case the
    Bloom filter gives you a definitive "no".
  3. Suppose you want to display a special message to customers that are in your
    special set of "most valued". But you don't mind if that message also gets shown to
    some other folks as well. You could put all your valued customers into a
    BloomFilter<Customer>, and whenever this filter "mightContain" the logged-in
    customer, show the message.

There may be practical reasons why a Bloom filter is or isn't a good choice in any of
these situations, but hopefully these are useful as illustrations.

Note: a Bloom filter requires good hashing to work properly, not just a crummy 32-bit
hash code from Object.hashCode(). So this enhancement request depends on us
providing some better support for hashing.

@gissuebot
Copy link
Author

Original comment posted by kevinb9n on 2007-11-03 at 05:33 PM


(No comment entered for this change.)


Labels: Post-1.0

@gissuebot
Copy link
Author

Original comment posted by kevinb9n on 2008-06-02 at 05:48 PM


(No comment entered for this change.)


Labels: -Post-1.0

@gissuebot
Copy link
Author

Original comment posted by kevinb9n on 2009-09-17 at 06:02 PM


(No comment entered for this change.)


Labels: Milestone-Post1.0

@gissuebot
Copy link
Author

Original comment posted by tomgibara on 2010-03-17 at 10:34 AM


I've just taken a look at the interface and thought I'd respond with some initial
thoughts.

There is another method of the Collection interface that could be usefully adopted:

boolean mightContainAll(Collection<?> c) //equivalent of containsAll()

Implementing size() is a possibility, by counting the number of adds that mutate the
filter, but probably not worthwhile; it should be possible (better actually) to
compute the falsePositiveProbability without this count and to use the number of bits
set within the filter instead.

Extending the Collection interface directly is of little benefit because the
iterator() method cannot be supported; such a Collection implementation would
interact very poorly with any of the standard collection classes. Where it would
provide a small benefit is that the BloomFilter interface could avoid overloading
'batch' operations with versions that take another bloom filter:

boolean addAll(BloomFilter<E> filter)

boolean mightContainAll(BloomFilter<E> filter) //containsAll would be a more accurate
in this case

These last two operations are clearly only defined over compatible filters, but are
important because they are very efficient. But the conditions under which two
BloomFilters are compatible is not a straightforward issue since the filters must
match in capacity (number of bits) and in their hashing. Accessors over these would
enable two different BloomFilter implementations to determine if they are compatible
to each other. For example:

int getCapacity()
Hash getHash() //serializablity requires hash generators be serializable also

And this is not sufficient because access to the actual bitset would also be
necessary, it might be necessary to replace the capacity accessor with a method that
exposes the filter's state through a specific bitset object:

Bitset getBits()

Whether this Bitset would be an interface or concrete I find difficult to determine.
What are the benefits of all this?

 * The BloomFilter 'union' operation (addAll) and 'subset' test (containsAll) become
possible.
 * Exposure of the filter bits allows an application to persist the filter outside of
standard java serialization.
 * Equality of BloomFilters becomes defined (as equality of bitset and hash).
 * BloomFilter.hashCode() can be specified too.

Whether this complexity is worthwhile is an open question, but my view is that, if
you are looking to introduce BloomFilter as an interface rather than a concrete
class, it's necessary to consider the conditions under which the instances of two
different BloomFilter implementations are equal (as the java.util package attempts to
do for collections). I also know that the first two capabilities in this list have
been vital in one of my applications.

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2010-07-30 at 03:53 AM


(No comment entered for this change.)


Labels: -Milestone-Post1.0

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2010-07-30 at 03:56 AM


(No comment entered for this change.)


Labels: -Priority-Medium

@gissuebot
Copy link
Author

Original comment posted by reachbach on 2010-08-11 at 11:50 AM


Support for set operations (determining the union, intersection or difference without exposing bitsets) over bloom filters would be useful.

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-01-27 at 01:54 PM


(No comment entered for this change.)


Owner: kev...@google.com

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-01-27 at 01:59 PM


(No comment entered for this change.)


Owner: ---

@gissuebot
Copy link
Author

Original comment posted by jon.h.clark on 2011-02-07 at 05:31 PM


A very useful data structure indeed.

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-07-13 at 06:18 PM


(No comment entered for this change.)


Status: Triaged

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-07-16 at 08:37 PM


(No comment entered for this change.)


Status: Accepted

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-07-19 at 02:24 PM


There has been a lot of progress on this and we may see it in release 12, to put a guess out there.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-12-06 at 04:09 PM


Any chance we could get this into Git, if not exposed, so I can play with it?

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-12-06 at 05:10 PM


Since recently it became more or less feature complete (i.e. serialization), I wouldn't mind exposing it as-is in @Beta. Kevin?

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-12-07 at 02:01 AM


Seems like there is no roadblock to making this public (@Beta of course), along with the new hashing abstractions. Just a matter of someone actually doing it

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-12-07 at 02:37 AM


I get the impression folks are really busy, though, so I'd settle for open-sourcing it without exposing the API yet.

@gissuebot
Copy link
Author

Original comment posted by fry@google.com on 2011-12-10 at 03:12 PM


(No comment entered for this change.)


Labels: Package-Collect

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-12-31 at 05:17 PM


(No comment entered for this change.)


Status: Fixed
Labels: -Package-Collect, Package-Hash

@gissuebot
Copy link
Author

Original comment posted by vayasya on 2012-04-07 at 12:05 AM


Didn't want to create a separate ticket for this question.

'put' has Key-Value association semantics by convention and 'add' has Set-element-addition semantics, (both in Java at least).

I am wondering why the BloomFilter implementation uses 'put' instead of 'add' even though the original API draft at the top of this page says 'add' a possible 'addAll'.

@gissuebot
Copy link
Author

Original comment posted by seharris on 2012-06-14 at 02:37 PM


The proposed addAll(BloomFilter<T>) method (alternately called "union") is essential for my needs, which involve populating several compatible filters on separate computers and then combining the (deserialized) filters back into one. At present, there's no way build up one filter from one or more other filters (despite the copy() method being provided).

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-06-14 at 02:39 PM


It's not clear to me that such a method could be defined, depending on the definition of "compatible" filters.

@gissuebot
Copy link
Author

Original comment posted by seharris on 2012-06-14 at 04:53 PM


Comment #5 above touches on this challenge of deducing that two filters are compatible. I suspect that you won't buy this stance, but I'd be happy enough with trusting the caller. If the caller doesn't understand well enough the invalid consequences of pouring one bag of bits into another, there's not much we can do for him. I'd rather see it become /possible/ to perform this operation than continue to ban it, just because it's vulnerable to misuse.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-06-14 at 04:58 PM


Hmmm. How would you feel about throwing an exception if the target filter wasn't initialized with the same parameters?

@gissuebot
Copy link
Author

Original comment posted by seharris on 2012-06-14 at 05:44 PM


Throwing an exception (IllegalArgumentException) is a reasonable response to trying to merge in a filter constructed with different parameters. The addAll (or merge() or union()) method would then involve first verifying equivalent configuration of the incoming filter and, upon confirming compatibility, ORin in its underlying bit vector, right?

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-06-14 at 05:46 PM


Yep, that's what I'd expect.

@gissuebot gissuebot added status=fixed migrated type=enhancement Make an existing feature better labels Oct 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant