Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multisets method request - highest/lowest frequency / element collection sorted by frequency #356

Closed
gissuebot opened this issue Oct 31, 2014 · 38 comments
Labels

Comments

@gissuebot
Copy link

Original issue created by blank101 on 2010-04-29 at 03:28 PM


I've been using Multiset recently for frequency counting of objects, then
various tasks sorting by frequency. Perhaps there's a better way to do
that in general (I'm no expert when it comes to algorithms), but the
Multiset seems a convenient basic data structure to use.

That in mind, it would be useful for the static Multisets had methods for
getting highest/lowest frequency elements, or returning an
ascending/descending collection of the elements (or elements + counts).

An alternative would be some specialized version of TreeMultiset that
sorted by element count first, then individual element comparison. I tried
that briefly, but couldn't seem to get the comparator right.

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2010-04-29 at 11:32 PM


Here are some static methods we can work on releasing:

public static <E> List<Entry<E>> sortedEntries(
    Multiset<E> multiset, Order order) {}

public static <E> List<E> mostFrequent(
    Multiset<E> multiset, int n) {}

public static <E> List<E> leastFrequent(
    Multiset<E> multiset, int n) {}

Sounds like these would help you?

Next, I have often mused about the idea of a "leaderboard" type of Multiset which is
just like any other except always iterates in (increasing/decreasing) count order. I
think I can take this as a request for that?

These two should probably be split.

Thanks for the requests!

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2010-04-29 at 11:58 PM


Yes, I would definitely like to see both those static methods added, and the leaderboard style multiset. I'm not
sure which I'd use for the particular case in front of me, but I could see both cleaning up a lot of the boilerplate I
have.

I suppose I just get excitable as to choice-like to have my options open.

I agree, this request should probably be split into the two different options.

@gissuebot
Copy link
Author

Original comment posted by theycallmeroger on 2010-08-07 at 05:57 PM


Instead of returning List<E> for the most/leastFrequent(n), what about returning List<Multiset.Entry<E>>, as is done for sortedEntries? This would provide both the elements and counts directly, without requiring a subsequent call to count() to retrieve the count. Or is the thinking that most callers would want just the elements, and the counts can easily be retrieved with the additional call to count() ?

@gissuebot
Copy link
Author

Original comment posted by dfrankow on 2010-08-08 at 03:45 PM


I don't understand why the methods are static.

@gissuebot
Copy link
Author

Original comment posted by dfrankow on 2010-08-08 at 03:46 PM


That is, shouldn't they be methods of Multiset?

public <E> List<Entry<E>> sortedEntries(Order order)
public static <E> List<E> mostFrequent(int n)
public static <E> List<E> leastFrequent(int n)

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2010-08-08 at 06:17 PM


The methods as part of the general Multiset utility class would be static obviously. In general, Multisets aren't sorted, so it seems inappropriate to require interface implementers to make them so. And I don't see it as API feature everyone using the class would want (though obviously, I think enough people would want it to warrant inclusion in the utility class).

For a class like what Kevin mentions re a Multiset sorted by element frequency, specific instance methods might be appropriate - but I'd rather they weren't and such a class instead simply followed a NavigableSet style API, where the "comparator" compares by element count.

@gissuebot
Copy link
Author

Original comment posted by rottenwindfish on 2010-09-24 at 07:52 PM


I think this would be very useful. I use the mostcommon() function from this bag class for Python very often: http://code.activestate.com/recipes/259174/

@gissuebot
Copy link
Author

Original comment posted by fry@google.com on 2011-01-28 at 04:06 PM


(No comment entered for this change.)


Status: Accepted
Labels: Type-Enhancement

@gissuebot
Copy link
Author

Original comment posted by hein.meling on 2011-02-23 at 10:53 PM


I'm very very interested in such a LeaderboardMultiset. If such exists in some svn or elsewhere, I'd be very keen to experiment and play with it and provide suggestions...

I've got a custom impl. of something similar to the static methods; it is using a TreeSet to copy/store the entries to provide sorted order... And now I'm struggling with out of memory errors due to the large volume entries I have in my multiset.

Would the leaderboard implementation provide sorted order of counts without keeping two copies around?? I've tried to think how I would implement this, but so far I've only come up with solutions that would involve keeping two separate collections, one mapping to the count, and another keeping the count in sorted order (along with a reference back to the element).

@gissuebot
Copy link
Author

Original comment posted by finnw1 on 2011-02-24 at 08:51 PM


Here's an alternative. If Multiset had a method

Map<E, Integer> asMap();

Then you could use Multimaps.invertFrom() to construct a new map sorted by frequencies in the multiset.

@gissuebot
Copy link
Author

Original comment posted by hein.meling on 2011-02-24 at 10:28 PM


Interesting idea. But I guess, my main issue is that keeping two copies of the same collection data is difficult due to the large number of entries in my multiset, and this approach does not help solve that root problem. (invertFrom() makes a copy of every element in the provided Map).

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-02-25 at 02:56 AM


You effectively want to maintain the elements sorted in two different orders. You will need two containers. (But hey, these are merely just pointer indexes, not that big to copy).

@gissuebot
Copy link
Author

Original comment posted by finnw1 on 2011-02-25 at 01:14 PM


If you only want the top N values you could copy the entries to a limited-capacity MinMaxPriorityQueue instead of a Multimap.

@gissuebot
Copy link
Author

Original comment posted by cgdecker on 2011-04-14 at 01:45 AM


The Multisets.countFunction() from issue #419 would go a long way toward solving this, though the methods seem useful enough to be worth having anyway.

  List<E> topElements = Ordering.natural()
      .onResultOf(Multisets.countFunction(multiset))
      .greatestOf(multiset.elementSet(), k);

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-07-13 at 06:18 PM


(No comment entered for this change.)


Status: Triaged

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-07-16 at 08:37 PM


(No comment entered for this change.)


Status: Accepted

@gissuebot
Copy link
Author

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2011-10-08 at 03:01 PM


I'm interested in naming suggestions for this method; I find it extraordinarily hard to name well.

@gissuebot
Copy link
Author

Original comment posted by neveue on 2011-10-08 at 04:03 PM


Multisets.copyOrderedByFrequency() (shorter, but users might think it's actually starting with least frequent elements...)
Multisets.copyOrderedByDecreasingFrequency() (longer, but clearer)
Multisets.immutableCopyOrderedByDecreasingFrequency() (too long IMO...)

I also like finn's asMap() idea, which lets you use the Map<E, Integer> to build a Multimap<Integer, E> with the appropriate order.

Another idea would be to expose an "asFrequencyMultimap()" that could be ordered from least frequent to most frequent.

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2011-10-08 at 11:36 PM


"count" seems better than frequency - shorter and already used elsewhere in the API. "rank" might also make sense, but it's perhaps muddled relative to the formal mathematical definition of rank. So, using count:

Multisets.countSortedCopy()

I don't think there's a need for a ...Desc() version, since whatever the result is should be traversable in reverse. That is - seems "straightforward" to return something like a ImmutableTreeMultiset that offers a descending iterator.

That also provides a nice baseline for, say, .countSortedView(), if modifiable view is someday desirable.

@gissuebot
Copy link
Author

Original comment posted by neveue on 2011-10-09 at 05:13 AM


Good point about "count" being shorter than "frequency" and already used elsewhere in the API.

I'd like to change my suggestion to:
Multisets.copyOrderedByCount()
Multisets.copyOrderedByDecreasingCount()

The "decreasing" part is there because, according to the method's javadoc, it returns "an ImmutableMultiset whose iteration order is highest count first". I thought it would be wise to make this explicit in the method name.

Of course, we could get rid of it and provide a method returning a Multiset ordered by ascending count instead. This set could potentially be traversed in reverse... But I think ascending count is the least useful order, and I'm not sure privileging it is the right approach...

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-10-16 at 11:35 PM


I'm still pretty attached to the current name, copyHighestCountFirst. Let me restate the arguments for it here:

Multisets.copyOrderedByFrequency() (shorter, but users might think it's actually starting with least frequent elements...)

copyHighestCountFirst is completely unambiguous, while being actually slightly shorter.

Multisets.copyOrderedByCount()
I think this still has the problem that it's not clear which direction counts are being ordered in.

@gissuebot
Copy link
Author

gissuebot commented Oct 31, 2014

Original comment posted by j...@nwsnet.de on 2011-10-17 at 07:40 AM


copyHighestCountFirst is completely unambiguous, while being actually slightly shorter.
Well, the name only refers to the first element(s). I personally find copyHigh[est]ToLow[est]Counts to be even more clear.

But then again, I'm unsure about the "copy" prefix and highestToLowestCounts might do it as well.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-10-17 at 05:42 PM


Hrrrrm. That's not how I read it -- I read it as "highest counts come first, lower counts come later."

I'm not sure I like "copyHighestToLowestCounts" though, just because it doesn't say what "highest to lowest" refers to.

How do you feel about "copyHigherCountsFirst"?

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2011-10-18 at 04:09 PM


To take a completely different tack:

Why not have a "ImmutableMultiset<T> Multisets.sortedCopy(Multiset<T> src, Comparator<T> order)" method, and a "Ordering<T> Multisets.higherCountFirstOrdering(Class<T> cls)" method?

I think it's also reasonable to define that the "natural" ordering of a multiset is by count (not the natural ordering of it's contents), such that a method

ImmutableMultiset<T> Multisets.sortedCopy(Multiset<T> src)

would behave as

ImmutableMultiset<T> Multisets.sortedCopy(Multiset<T> src, Multisets.higherCountFirstOrdering(T.class)).

Provides a lot more flex for only a little more code.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-10-18 at 04:35 PM


This is, I'm afraid, not a well-defined ordering at all. There is no implementation for

Ordering<T> Multisets.higherCountFirstOrdering(Class<T> cls)

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2011-10-18 at 05:31 PM


Ah, yes - that ordering must be in reference to some particular multiset - I was thinking too hopefully that there might be someway to get away from that. No problem: Ordering<T> Multisets.higherCountFirstOrdering(final Multiset<T> ref), implemented as, essentially:

int compare(T left, T right) {
 return ref.count(one) - ref.count(theOther);
}

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-10-18 at 05:45 PM


Except, of course, this ordering isn't compatible with equals().

I've thought about this direction, I don't believe it can be made to work, I'm afraid.

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2011-10-18 at 06:21 PM


Can you elaborate on incompatibility with equals()? I also spaced on that earlier, but:

left == right seems to obviously compare to 0, given the way equality works with Multiset.

compare = 0 -> left == right seems to be resolvable by either the T being already comparable, or by supplying a second comparator (perhaps a bit clunky).

i.e.,

ImmutableMultiset<T> Multisets.countSortedCopy(Multiset<T extends Comparable<T>> src)
ImmutableMultiset<T> Multisets.countSortedCopy(Multiset<T> src, Comparator<T> nonCountOrder)

and then for the count ordering

private final Comparator<T> ifZero = //...either natural or supplied comparator/ordering
int compare(T left, T right) {
  int comp = ref.count(left) - ref.count(right);
  return comp !=0 ? comp : ifZero.compare(left, right);
}

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2011-10-18 at 06:25 PM


or maybe those signatures need to be applied to the method that returns the Ordering, but the same idea applies: require either a natural or supplied comparator to handle the count(A) == count(B) && A != B case to maintain compatibility with equals().

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-10-18 at 06:33 PM


The point is, it's so awkward to make this approach work that it's no longer worthwhile.

@gissuebot
Copy link
Author

Original comment posted by cgdecker on 2011-10-18 at 06:35 PM


Has anything like this been considered?

  Multisets.countOrderedCopy(multiset).highestToLowest()
  Multisets.countOrderedCopy(multiset).lowestToHighest()

Good:

  • separates "ordered by count" from "in which order?"
  • reads fairly nicely

Bad:

  • requires adding an interface to support it
  • possibly confusing since "countOrderedCopy(multiset)" doesn't clearly indicate that another call is required to get the final result

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-10-18 at 06:40 PM


Hmmmm.

We considered extending the ImmutableMultiset builder to support an option like this. I think we concluded, though, that it wasn't really ImmutableMultiset's job to provide count-ordering.

That said, I like this...and I can't 100% recall why we rejected something along these lines.


CC: kevinb@google.com

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2011-10-20 at 06:02 PM


caveat: I haven't hooked these up to tests, this is basically just porting what I did to solve the problem back when I proposed it into something like the solution that was linked above. No compilation problems at least, right?

The attached is an implementation that I think is not so awkward as to be untenable.

It provides:

descCountOrdering() - doc explicitly states it is inconsistent with equals and indicates easy ways to make consistent with equals. I left in comments the options to provide methods to do that (countOrderingWithEquals()) as an alternative to exposing some autocast singletons.

I've got some commented out rough implementations for the API for the typical use case coverage:

countOrderedCopy(Multiset) - just uses descCountOrdering
countOrderedCopy(Multiset, Comparator) - adds secondary ordering on T, easy to use with Ordering.<T>natural() or any custom comparator

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-10-20 at 06:48 PM


This is some pretty good code -- you made excellent use of the Ordering features -- but I don't think this is the right approach.

First off, we don't want to expose to the user any comparators on multiset entries. Specifically, we don't want to expose countOrdering(), descCountOrdering(), or anything of the sort. These are just a means to an end -- sorting multisets by element frequency -- and if we're already exposing something like countOrderedCopy(Multiset), there's no reason to additionally expose countOrdering() or descCountOrdering().

Secondly, we don't want to expose any new tools that take extra orderings on the multiset elements. The only feature we're planning on offering is something that lets you take a multiset and get back a copy with the highest counts coming first, with ties broken by whichever elements came first in the original multiset. (If users really want ties to be broken according to some Comparator<E>, they can just do an intermediate call to ImmutableSortedMultiset.copyOf(multiset, myComparator).)

For reference, please remember that sending us code doesn't make it more likely that your feature will be accepted to Guava. Writing code is maybe 10-20% of the work we do for each feature we accept: the real work is discussing, arguing, and naming things, and even so we reject most feature requests. In the future, please work out an API with us, and make sure everybody's satisfied, before you spend time writing patches.

@gissuebot
Copy link
Author

Original comment posted by blank101 on 2011-10-20 at 07:27 PM


Ah, I suppose I'm still pining for the TreeMultiset that sorts on count per my original request, (or LeaderboardMultiset like in comment #9), hence the continuous pestering.

As for the rest, my own solution to the problem I faced a year ago did expose those things (and they've found some use) but not exposing them has pros as well (e.g., plenty of ways to misuse), and I find actually writing code a better way to think and communicate about any API - not from a misunderstanding about what gets features into Guava.

Anyway, thanks for the detailed rejection, and I look forward to the removal of @Beta, however the method is finally named/signatured!

@gissuebot
Copy link
Author

Original comment posted by fry@google.com on 2011-12-10 at 03:45 PM


(No comment entered for this change.)


Labels: Package-Collect

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-12-28 at 11:54 PM


Done as of release 11 with copyHighestCountFirst.


Status: Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant