Performance of ImmutableSet.contains #1155

gissuebot · 2014-10-31T17:37:01Z

Original issue created by Maaartinus on 2012-09-27 at 10:48 AM

In case an ImmutableSet gets filled by objects whose hashCode fills a continuous range, the performance of an unsuccessful lookup strongly degenerates (milliseconds for a single lookup). While each hash-based algorithm degenerates for some input, this happens just too easily.

http://stackoverflow.com/questions/12573463/performance-of-guavas-immutableset-contains

The problem is caused by 1. big continuous filled area of the table, 2. using linear reprobing. Any lookup starting near the beginning of such a continuous area takes a very long time as it has to traverse to its end.

Preventing continuous filled area

The hash code gets processed in Hashing.smear by a function propagating bits to the right only. This helps to prevent collisions when the higher bits get masked away.

But there are no collisions here. What happens in the extreme case (of the set containing N consecutive integers where N is a power of two)
is that all entries land in the same half of the table, thus resulting in an area of N slots without any hole to terminate the search.

With a smearing which propagates bits also to the left this could be avoided. A modification like

hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
hashCode ^= (hashCode >>> 7) ^ (hashCode >>> 4);
hashCode *= MULTIPLIER;
return hashCode;

helps here a lot. Funnily enough, MULTIPLIER=2 seems to work perfectly, as it makes every other slot empty. Obviously, an even multiplier is wrong, as it increases the number of collision in other cases.

For any given set size some "optimal" MULTIPLIER can be found, e.g. 0x80001 works nicely for set size = 500000, but not for the other two. Randomly chosen odd values work reasonably well on the average. Instead of the multiplication some xors and/or additions with left shift could be used, but the multiplication seems to be faster.

Obviously, the number of collision for the expression hashCode & (table.length-1) doesn't get changed by the final multiplication, as a multiplication by an odd number is a bijective operation, even when restricted to some number of least significant bits. So it doesn't undo the improvement achieved by the original smear.

Better reprobing

IMHO, the linear reprobing will always amplify any problems caused by hash collisions or bad distribution. I'd suggest a combination with double hashing: After some number of failed attempts to find a free slot, a jump to some other part of the table should be done. This has absolutely no impact in case everything works fine and can be a huge win in case of problems. It's quite simple (maybe 50 lines).

The text was updated successfully, but these errors were encountered:

gissuebot · 2014-11-01T00:37:55Z

Original comment posted by lowasser@google.com on 2012-09-27 at 10:58 PM

This is legit, and while I know a lot of work went into ImmutableSet, I do think this demonstrates that it's still too easy to trigger the worst case of linear probing.

I'm currently experimenting with the following smearing function:

hashCode ^= Integer.rotateRight(hashCode, 20) ^ Integer.rotateRight(hashCode, 12);
hashCode ^= Integer.rotateRight(hashCode, 7) ^ Integer.rotateRight(hashCode, 4);
hashCode *= 0x5BD1E995;
return hashCode;

(The multiplier constant is borrowed from murmur hash, which I suspected would probably yield good results.)

I'm experimenting with quadratic probing at the moment with this smear function, just because I want to avoid additional branches, and quadratic probing retains pretty good locality.

Status: Accepted
Owner: wasserman.louis
Labels: Type-Performance, Package-Collect

gissuebot · 2014-11-01T00:37:57Z

Original comment posted by kevinb@google.com on 2012-09-27 at 11:30 PM

What's the latest rehash function being used in ConcurrentHashMapV8.java and so forth? Doug changes it often and we might improve just by trying whatever he's doing nowadays.

I also think double hashing sounds very worth trying -- all those bits sitting there in the upper reaches of the "smeared" hash code, unused. (Maybe double hashing isn't the right term here, not sure, perhaps this is cuckoo hashing with N=2.)

gissuebot · 2014-11-01T00:37:59Z

Original comment posted by lowasser@google.com on 2012-09-27 at 11:42 PM

From http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/ConcurrentHashMapV8.java?revision=1.62&view=markup :

/**
 * Spreads higher bits to lower, and also forces top 2 bits to 0.
 * Because the table uses power-of-two masking, sets of hashes
 * that vary only in bits above the current mask will always
 * collide. (Among known examples are sets of Float keys holding
 * consecutive whole numbers in small tables.)  To counter this,
 * we apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed across bits (so don't benefit
 * from spreading), and because we use trees to handle large sets
 * of collisions in bins, we don't need excessively high quality.
 */
private static final int spread(int h) {
    h ^= (h >>> 18) ^ (h >>> 12);
    return (h ^ (h >>> 10)) & HASH_BITS;
}

But honestly, part of the issue is that linear or quadratic open addressing is vulnerable not just to exact hash collisions, but to hashes that differ only in relatively low bits. The smearing methods don't do anything at all to help there, and that's the issue that needs addressing.

(I'm pretty strongly inclined to keep linear or quadratic probing for the first couple of probes at least, just to take advantage of memory locality.)

The experimental smearing method I proposed above is doing pretty well in benchmarks at the moment, which is nice -- and it's doing well with both linear and quadratic probing. Will keep y'all updated.

gissuebot · 2014-11-01T00:38:08Z

Original comment posted by Maaartinus on 2012-09-29 at 07:24 AM

@Louis: Note that unlike the right shifts, the rotations gives you no guarantee that "hashCodes that differ only by constant multiples at each bit position have a bounded number of collisions". Or maybe they do, but I can't see how.

and quadratic probing retains pretty good locality.

I don't think so. It has good locality in the first few steps (which is good as you quite often stop there), but on the average it's quite bad. Assuming the probe sequence
0, 1, 3, 6, 10, 15
the distances are
1, 2, 3, 4, ...
which rarely hits the same cache line on the next step.

The classical double hashing has distances
N, N, N, ...
(where N is derived from the hash code and must be odd for power of two table sizes)
which for most values of N never hits the same cache line on the next step.

What I'd propose is using distances
1, 1, 1, ..., 1, N, 1, 1, 1, ..., 1, N, ....
which most of the time hits the same cache line on the next step. Moreover, in the beginning it's as efficient as the linear reprobing.

In order for the probe sequence to cover the whole table it's sufficient that

the block length K is a power of two
(K-1) + N is an odd multiple of K

Explanation: A block consists of K-1 ones and a single N, so the distance travelled after one block is (K-1)_1 + 1_N. Moving by an odd multiple of the block length guarantees that all blocks get visited (as an odd number is co-prime to the number of blocks which is a power of two).

@Kevin:

I also think double hashing sounds very worth trying -- all those bits sitting there in the upper reaches of the "smeared" hash code, unused.
(Maybe double hashing isn't the right term here, not sure, perhaps this is cuckoo hashing with N=2.)

Agreed, while most hashing algorithms use only log2(table.length) bits, both double and cuckoo hashing can use nearly twice as much.
This means that for tables bigger than 2**16, all bits gets used -- somehow.

I think you've meant double hashing, as cuckoo works quite differently.

Actually, I've just got a different idea, very simple and maybe optimal for the issue I described. I'll try it out and report later.

gissuebot · 2014-11-01T00:38:16Z

Original comment posted by Maaartinus on 2012-10-02 at 01:41 PM

Here the idea I've already announced: During construct the maximum number of probes needed and store it in field maxProbes. As there are no collisions in the case I reported, maxProbes equals to 1 and thus RegularImmutableSet.contains can return false after having inspected a single slot, so it works perfectly.

However, there's a similar problem including a lot of collision, and then it can't help at all (see the attachment). The benchmark is here:

https://dl.dropbox.com/u/4971686/published/maaartin/guava/ImmutableSetBenchmark1.java

Note that with collide==true there are exactly 2 items per slot, but maxProbes get big due to linear reprobing. You may want to make absent shorter in case it aborts due to the caliper time limit.

I think I've got a nice solution for all these problems (soon to come).

gissuebot · 2014-11-01T00:38:24Z

Original comment posted by lowasser@google.com on 2012-10-02 at 11:35 PM

Here the idea I've already announced: During construct the maximum number of probes needed and store it in field maxProbes. As there are no collisions in the case I reported, maxProbes equals to 1 and thus RegularImmutableSet.contains can return false after having inspected a single slot, so it works perfectly.

I'm not seeing how this would be a doable thing without significant increased complication in the construction method -- we don't even remember hash codes at all, at the moment.

Note that unlike the right shifts, the rotations gives you no guarantee that "hashCodes that differ only by constant multiples at each bit position have a bounded number of collisions". Or maybe they do, but I can't see how.

Hrrrrm. I know you're quoting the HashMap source, but I don't really understand what "differ only by constant multiples at each bit position" could possibly mean. Do the bits differ by only a constant multiple? Of what?

I'm going to poke around Google for advice, maybe.

gissuebot · 2014-11-01T00:38:27Z

Original comment posted by lowasser@google.com on 2012-10-02 at 11:49 PM

I observe that the JDK's IdentityHashMap uses open addressing, and its "smearing" function looks like

    int h = System.identityHashCode(x);
    // Multiply by -127, and left-shift to use least bit as part of hash
    return ((h << 1) - (h << 8)) & (length - 1);

Worth exploring, perhaps?

gissuebot · 2014-11-01T00:38:29Z

Original comment posted by Maaartinus on 2012-10-03 at 01:08 PM

I'm not seeing how this would be a doable thing without significant increased complication in the construction method...

It's trivial and very cheap, just counting the iterations in the inner loop, see the patch.

But in general, there's a tradeoff with unknown parameters: Storing the hashes would speed up reprobing and so can spending more time in smear and/or in the construction. Is there any chance to find out e.g. how many contains calls per ImmutableSet are there on the average in your real world programs?

Note that unlike the right shifts, the rotations...

Hrrrrm. I know you're quoting the HashMap source, but I don't really understand what "differ only by constant multiples at each bit position" could possibly mean. Do the bits differ by only a constant multiple? Of what?

I can only guess what they mean, too. But whatever property it is, it can't be destroyed by a final multiplication, as there's no additional flow from the higher bits to the lower ones. An initial multiplication (i.e. before the right shift) is a different case, as them together create such a flow. It's hard to explain what I mean, as it's based of my fuzzy understanding of their stated goal.

I'm going to poke around Google for advice, maybe.

You mean personally? I couldn't google out anything.

I observe that the JDK's IdentityHashMap... Worth exploring, perhaps?

I don't think so. The multiplication puts nearby hashes far apart, which is a good thing because of the linear reprobing, but that's about all. The hash distrubution is different and the cost of reprobing much lower.

gissuebot · 2014-11-01T00:38:37Z

Original comment posted by wasserman.louis on 2012-10-03 at 08:49 PM

I'm going to poke around Google for advice, maybe.

Perhaps a more accurate statement would be

I'm going to poke around Googlers for advice, maybe.

In any event, I'm getting the impression that we agree that an initial multiplication strictly helps -- it's still possible to cause the worst-case, of course, but it requires being more deliberate about it. Should I go ahead and try to push ahead on that change while we continue the discussion on the other changes?

gissuebot · 2014-11-01T00:38:39Z

Original comment posted by Maaartinus on 2012-10-04 at 09:57 AM

In any event, I'm getting the impression that we agree that an initial multiplication strictly helps...

Yes. Currently, I know of no better way how to find the cases when the multiplication is detrimental than bruteforcing it.

It's very cheap on modern CPUs:

StandardSmearer 1.71
MultiplyingSmearer 1.97

As it is very cheap and improves the distribution, it might speed the search up even in commonly occuring cases, too.

Should I go ahead and try to push ahead on that change while we continue the discussion on the other changes?

It depends on the probability that someone gets hit by this problem in the meantime. That's something I can't tell.

gissuebot · 2014-11-01T00:39:36Z

Original comment posted by Maaartinus on 2012-10-11 at 09:55 AM

This small program illustrates how I understand what "differ only by constant multiples at each bit position" means:
https://dl.dropbox.com/u/4971686/published/maaartin/guava/SmearerDemo.java

According to it:

final multiplication is good
multiplication before the shifts is bad
replacing the second two shifts by rotations is good
using rotations only is bad
the simplified shifting from ConcurrentHashMapV8 is bad

But the program only counts collisions assuming closed hashing. Secondary collisions due to rehashing are a very different beast.

gissuebot · 2014-11-01T00:43:43Z

Original comment posted by lowasser@google.com on 2012-10-30 at 07:20 PM

I'm putting this down as fixed by 69ad96b .

Even if the new smearing function can be defeated, it's at least much more difficult to make it blow up accidentally. (Incidentally, it also beats the old Hashing.smear in benchmarks.)

Status: Fixed

gissuebot · 2014-11-01T00:43:45Z

Original comment posted by lowasser@google.com on 2012-10-30 at 07:21 PM

(No comment entered for this change.)

Labels: Milestone-Release14

gissuebot added status=fixed migrated package=collect type=performance Related to performance labels Nov 1, 2014

gissuebot assigned lowasser Nov 1, 2014

gissuebot closed this as completed Nov 1, 2014

cgdecker modified the milestone: 14.0 Nov 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of ImmutableSet.contains #1155

Performance of ImmutableSet.contains #1155

gissuebot commented Oct 31, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

Performance of ImmutableSet.contains #1155

Performance of ImmutableSet.contains #1155

Comments

gissuebot commented Oct 31, 2014

Preventing continuous filled area

Better reprobing

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014