New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance problem in HashBasedTable #1085
Comments
Original comment posted by kevinb@google.com on 2012-07-25 at 11:19 PM (No comment entered for this change.) Owner: wasserman.louis |
Original comment posted by wasserman.louis on 2012-07-26 at 09:35 AM I honestly think this goes too far, and I think the original behavior is appropriate. As a rule, I only expect containsAll to be linear time on a collection that is based on a hash table containing the elements to be checked for containing -- e.g. HashSet or HashMultiset, not ArrayList or Map.values(). The JDK set that precedent, and I feel totally comfortable following it. I also strongly suspect there exist classes used only as the values in a map or table that don't have a properly implemented hashCode() because they're never expected to be used as keys, which this change would break. In any event, if any users really desire the linear-time behavior that constructs a HashSet, they can freely substitute the one-liner Sets.newHashSet(table.values()).containsAll(collection); |
Original comment posted by adi175652q on 2012-07-26 at 03:13 PM Hi Louis,
There are two options here. One, we can have a method that blocks If we can make a slow method fast, why not do it? Especially when the
Do you think a programmer thinks of the complexity of the method, Best, Adrian |
Original comment posted by cpovirk@google.com on 2012-07-26 at 09:21 PM We have a long-standing internal bug to "create a table of our collection implementations and their operations showing the big-O performance of each." That might help here (though the JDK precedent is probably still more important), but it doesn't answer the question "Why not make it faster?" Well, there's always the potential for problems in weird edge cases. If anyone implements hashCode() wrong, as Louis points out, he would have problems. He is arguably getting what he deserves, but if there's a chance that we're going to break his app, we'd better be able to demonstrate a clear improvement over what users can do on their own. Most affected users, as you point out, can call "newHashSet" themselves, and their problem is gone. If the performance impact is severe, they'll likely identify this problem and fix quickly. For some other users, the performance impact will be negligible. For still others, the HashSet version may even perform worse. (For example, my understanding is that object allocation on Android is still expensive in comparison to server JVMs.) Should programmers have to worry about time complexity? When we can hide it from them, we can, but I don't think this is one of those cases. Status: |
Original comment posted by em...@soldal.org on 2012-07-27 at 06:21 AM Random thought, but should we try to document certain methods and classes with their big-O performance? Like /** At least in a few places this could work as a gentle warning to the developer. |
Original comment posted by wasserman.louis on 2012-07-27 at 03:30 PM For "views" like the values() collections, it's much less obvious how to do this cleanly. |
Original comment posted by adi175652q on 2012-07-27 at 07:11 PM Hi, I agree with most discussion in Comment 5 and I would like to add two First, I agree it is easy to find performance problems after they Second, we are missing many serious performance improvements because Personally, I feel Guava is better than JDK because Guava evolves Best, Adrian |
Original comment posted by kevinb@google.com on 2012-07-27 at 07:24 PM I'm not fussed about what happens to people who implement hashCode wrong, or whose hashCode changes while in the collection. I'd say let's set that argument aside. So we have a quadratic operation that can be made linear by allocating a bunch of temporary objects, is that right? It's a huge win for your case, but it sounds like a loss for most usages. It seems better to me to document the complexity of our stuff clearly, and them do the newHashSet themselves when they need to. |
Original comment posted by wasserman.louis on 2012-07-27 at 08:34 PM I guess I feel that it's just common sense that containsAll() should take time proportional to the cost of a single contains(), times the size of the target collection. This also feels like an extreme edge case that people who care about asymptotics already expect to be a quadratic-time operation. I certainly can't remember the last time, in any context, when I called values().containsAll(). Did the OP encounter an actual, real-world case where this caused a performance bug? I have to admit I don't consider this a "major performance improvement" at all, compared to e.g. a significant constant-factor improvement to add, remove, or contains in a "named" collection like HashMultiset. |
Original comment posted by adi175652q on 2012-07-31 at 02:23 PM Hi Kevin,
I agree with Comment 9 that we should also consider other cases. The I wrote a test (attached TestWorstScenariosForProposedPatch.java) to These are the results on my machine (ORIGINAL is the existing code, MILLISECONDS when table.values().containsAll(collection) ORIGINAL PROPOSED CALLER_RESPONSIBILITY ...returns FALSE for the 0th element of collection : 1.534 4.366 4.364...returns TRUE for the 60th element of table.values() : 0.260 0.074 4.276 ...returns TRUE for the 0th element of table.values() : 0.014 0.070 4.276The above quantitative data shows the following: (1) the proposed patch is faster than the existing containsAll() in (1.1) when containsAll() returns false, the proposed patch is (1.2) when containsAll() returns true, the absolute times are (2) the proposed patch is significantly faster than the the one-line I am running JDK 1.6.0_24 on Ubuntu 11.10. If possible, I would like I am also attaching patchFullImproved.java, which has several changes The above numbers are the worst cases for the patch and the best cases Best, Adrian |
Original comment posted by cpovirk@google.com on 2012-08-02 at 10:54 PM I still fear for GWT and Android performance, which are less convenient to measure, and I have vague reservations, but that's not much reason to close this. Status: |
Original comment posted by adi175652q on 2012-09-25 at 10:23 PM I am attaching the benchmark in Comment 11 written with Caliper. The ...................................................................... |
Original issue created by adi175652q on 2012-07-25 at 10:56 PM
Hi,
I am encountering a performance problem in HashBasedTable. It appears
in Guava 12.0.1 and in the latest revision. I attached a test that
exposes this problem and a patch that fixes it. For this test, the
patch provides a 1397 X speedup. Note that this test uses a very
small table (300x300), but for medium size tables (e.g., 1000x1000),
the test does not finish even after 30 minutes (I killed the process),
while the patched version finishes in 368 milliseconds.
To run the test, just do:
$ java Test
The output for the un-patched version is:
Time is 65679
The output for the patched version is:
Time is 47
The attached test shows that, for a "HashBasedTable table" object, the
following operation is very slow:
table.values().containsAll(toContain);
"HashBasedTable.values()" returns a "StandardTable.Values" object,
which inherits containsAll() from "java.util.AbstractCollection",
which has quadratic complexity.
I attached two patches. Both patches override containsAll() and
implement a linear time algorithm. patchSmall.diff populates a
HashSet eagerly, and patchFull.diff populates a HashSet lazily.
patchFull.diff is faster than patchSmall.diff when containsAll()
returns false after inspecting only a few elements, though in most
cases they are equally fast. I am including patchSmall.diff just in
case you prefer smaller code.
Note that this problem also exists for the other "Values" classes
(LocalCache.Values, ArrayTable.Values, ImmutableMultimap.Values,
MapMakerInternalMap.Values, AbstractFilteredMap.Values, Maps.Values,
FilteredMultimap.AsMap.Values, FilteredMultimap.Values,
Multimaps.Values, StandardTable.Column.Values).
Is this truly a performance problem, or am I misunderstanding the
intended behavior? If so, can you please confirm that the patch is
correct?
Thanks,
Adrian
The text was updated successfully, but these errors were encountered: