Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-blocking ConcurrentBloomFilter? #1090

Closed
gissuebot opened this issue Oct 31, 2014 · 11 comments
Closed

non-blocking ConcurrentBloomFilter? #1090

gissuebot opened this issue Oct 31, 2014 · 11 comments
Labels

Comments

@gissuebot
Copy link

Original issue created by ishaaq on 2012-07-31 at 02:57 AM


Is it possible to create a Concurrent version of BloomFilter?

Semantics I am looking for:

  1. Ability for multiple threads to call put() simultaneously without blocking
  2. A put(x) will ensure that all mightContain(x) calls that occur after it will return true
@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-07-31 at 10:45 AM


I was thinking that an AtomicLongArray-based variant should work pretty smoothly here, but I'm not positive I've thought through all the issues. (Also, this seems like another instance in which a ProbabilisticSet interface might be appropriate.)

@gissuebot
Copy link
Author

Original comment posted by kurt.kluever on 2012-08-07 at 05:32 PM


Can you describe your use-case a bit more? This sounds like a reasonable request.


Status: Research
Owner: andreou@google.com
Labels: Package-Hash, Type-Enhancement
CC: kak@google.com

@gissuebot
Copy link
Author

Original comment posted by ishaaq on 2012-08-08 at 06:56 AM


Hmm, not sure how much more to elaborate than what I initially suggested.

Point 2 in my requirements above is no different from what is expected from a bloom filter anyway, so nothing new there.

Point 1 is basically allowing multiple threads to do simultaneous non-blocking put calls - ideally using atomic CAS (Compare And Swap) operations and not resorting to lock synchronisation.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-08-09 at 04:31 PM


By "describe your use case," we mean "tell us why you need a concurrent Bloom filter in the first place."

@gissuebot
Copy link
Author

Original comment posted by ishaaq on 2012-08-10 at 09:34 AM


Right, sorry about the confusion.

I have a resource that multiple threads simultaneously read from and write records to. Writes are inexpensive but reads are not.

The most expensive use-case for this resource is for a reader to ask for a record that does not exist. I can reduce this load significantly by protecting the resource with a bloom filter, i.e. the readers only go on to retrieve the record if the bloom filter indicates that the record probably exists. Obviously, for this to work, the writers need to write to the bloom filter before writing to the resource.

Implementing synchronized locking around the bloom filter would lead to to excessive contention, a CAS-based solution would be perfect - allowing both readers and writers to access the bloom-filter simultaneously.

@gissuebot
Copy link
Author

gissuebot commented Nov 1, 2014

Original comment posted by nto...@maginatics.com on 2012-08-10 at 04:37 PM


FWIW, we also have a use case that is similar to the one described above. We have a number of threads that insert new data into an underlying database but work very hard to prevent duplicate entries. Currently, this involves an expensive round-trip to the database to lookup if entries exist but we would like to abort early in case we know the entry is not present. Of course, all worker threads would want to update the bloom filter once they have insert a new entry.

As a side note, in this proposed use of a ConcurrentBloomFilter, it could be possible for a duplicate to occur but, as long as it is not common, a few duplicates would be fine and would be handled by other means in our system.

@gissuebot
Copy link
Author

Original comment posted by andreou@google.com on 2012-08-10 at 05:44 PM


I'd suggest to try synchronization first, and see what kind of contention you get.

Worst case, you can stripe the bloomfilter and reduce contention accordingly: List<BloomFilter<E>>, and do:

BloomFilter<E> bf = list.get(IntMath.mod(e.getHashCode(), bfs.size());
synchronized (bf) {
  bf.put(e);
}

This could also be modeled as a Striped<BloomFilter<E>>, if we ever opened Striped up, accepting an arbitrary Supplier<L>, instead of just the three types it supports now.

(I say this because implementing special concurrent versions of each structure as needed is too much work!)

@gissuebot
Copy link
Author

gissuebot commented Nov 1, 2014

Original comment posted by mjohns...@capsaicin.ca on 2013-01-10 at 02:29 PM


I'm also interested in a version supporting concurrent puts. My use case is one where I have a bunch of threads processing inputs and updating a single bloom filter. Alternatively if multiple blooms could be merged, that would allow me to update ThreadLocal blooms and merge each of them in a batch.

My use case does not require feedback as to whether or not the put modified the filter, in case that simplifies things.

@gissuebot
Copy link
Author

Original comment posted by argaul on 2013-01-10 at 05:10 PM


Issue #1134 tracks merging Bloom filters of the same size:

#1134

@MGunlogson
Copy link

My use-case would also benefit from thread safety on Bloom filters. I'm rate limiting connections and the underlying framework (Netty) is multithreaded. Merging filters doesn't help in this case because the threads need to see the same data. Having stale data violates the "no false negatives" aspect of Bloom filters and makes Guava's Bloom unsuitable for my application

I could use synchronization, but having profiled Bloom it is apparent that most of the processing on my system is generating the hash. The way Bloom works now- hashing is done on a single core. I find this silly since the Guava's hash functions are specifically designed to be stateless and thread-safe.

Since ~90% of CPU time is tied up in hash calculation, Bloom should scale nicely using striped locks. I'm fairly familiar with the Bloom code and can submit a patch for thread safety if desired.

@lowasser
Copy link
Contributor

Dupe of #2761 that we've now done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants