Export to GitHub

morelinq - issue #6

Allow caller to specify value for padding shorter of Zip sequences


Posted on Feb 16, 2009 by Grumpy Bird

I think it would be good to allow the caller to control what is being used to pad the shorter collection if desired, rather than relying on default(TResult). Especially for numerics, where 0 may very well have a meaning for the caller, other than "not there".

Since this would change the method signature, new method overrides would probably be required. These methods would probably have no need for a zip strategy argument either.

Here is the method signature I would envision: private static IEnumerable<TResult> ZipImpl<TFirst, TSecond, TResult>( IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> resultSelector, TFirst padFirst, TSecond padSecond)

Comment #1

Posted on Feb 16, 2009 by Grumpy Bird

"overrides" in my original note should be "override". Didn't mean to imply existing methods should be replaced, but rather just a new one should be added.

Comment #2

Posted on Feb 16, 2009 by Happy Bear

I can't see how you get away without the overload taking a strategy - you still need to differentiate between Pad, Truncate and Fail. How would you make the above signature handle truncation and failure?

Comment #3

Posted on Feb 16, 2009 by Grumpy Bird

My thought was that adding the pad argument to the existing methods would overcomplicate the signature because it's only necessary for one of the three strategies. The signature I specified above would be a separate overload from the ones already there.

Comment #4

Posted on Feb 16, 2009 by Massive Ox

When I originally submitted the Zip implementation, I intentionally left a comment that referenced a similar discussion regarding zip in Python 3000 as food for thought. Since the comment has been lost in translation, :) I'll re-iterate it here:

[Python-3000] have zip() raise exception for sequences of different lengths http://mail.python.org/pipermail/python-3000/2006-August/003338.html

The discussion is very interesting and worth reading up on before settling here.

The current use of an enumeration (ImbalancedZipStartegy) to express the required behavior seems awkward because it's not something you want to be able to vary externally, which is usually why you would make a parameter or argument. Plus, it's a little cumbersome to write and read back in code and not very LINQ-ish in spirit. The enumeration could certainly be used internally for sharing the implementation logic but the public API should use distinct and clear operator names that imply the strategy. For example:

  • EquiZip: Sequences must be equal in length otherwise throw error

  • Zip: Truncates to shorter of the two sequences

  • ZipLongest: Uses longer of the two sequences while padding shorter of the two

Comment #5

Posted on Feb 16, 2009 by Massive Ox

(No comment was entered for this change.)

Comment #6

Posted on Feb 17, 2009 by Grumpy Bird

I agree with Aziz that using the strategy enum seems like it might be more cumbersome with little added value for the API consumer... This doesn't mean we wouldn't want to expose it or use it internally, just that simpler methods might be warranted as well.

I think we could conceivably condense all behavior into two methods both named Zip...

One method would either truncate or throw, if the sequences aren't of the same length: void Zip(this IEnumerable src, IEnumerable other, Boolean throwIfUneven)

The other would take generators for each sequence in case one should be shorter, and if the generator is null, use the "default" for that type: void Zip(this IEnumerable src, IEnumerable other, Func srcPadGen, Func otherPadGen)

Comment #7

Posted on Feb 17, 2009 by Happy Bear

I removed the comment about Python on the grounds that I'd implemented the optional functionality, and referring to Python probably wouldn't actually help our users :)

Personally I still think the enumeration is the simplest way to go - using a boolean for throw or not feels like the kind of thing that reduces readability: when looking at the code in 6 months time, you'd be pretty much required to check what that parameter was.

I can see the benefit in the padding generators in that it allows a non-default padding, but that's all - and I suspect you'd rarely want to actually express both, because you'd probably know which is going to be longer.

Don't forget that in the current setup the client doesn't have to specify the enum at all - it defaults to Truncate.

Comment #8

Posted on Feb 17, 2009 by Grumpy Bird

Fair point on readability for the boolean argument.

It's anecdotal, but in my personal use I far more often want to pad the short side than truncate the long side. I use the generators all the time. I have no qualms about making truncate the default behavior, but I think the generator versions would also get a good amount of use.

Especially when working through the IEnumerable interface rather than directly with List, Array, etc., it's not at all uncommon to not know which side is longer before zipping. In fact, needing to know ahead of time I think sort of defeats the purpose of the deferred execution.

If we want to support padding generators, we'd pretty much have to require both to be provided (at least filled in with a null to indicate default), because the way generics works would prevent us from specifying one overload for each side. (They'd conflict with each other when the element types are the same.)

Comment #9

Posted on Feb 17, 2009 by Massive Ox

Comment deleted

Comment #10

Posted on Feb 17, 2009 by Massive Ox

referring to Python probably wouldn't actually help our users :)

I was more thinking of MoreLINQ developers and maintainers until we polish (docs) for 1.0. :) Meanwhile, is the MoreLINQ audience Mort or Elvis? ;)

Personally I still think the enumeration is the simplest way to go

We risk going on forever with personal preferences and subjectivity. The right solution may be simple, but we may not be there yet. :)

The argument is no longer whether we should provide the functionality or not, but rather how to expose it and using what signature(s). There's already agreement that a Boolean definitely reduces readability. The next thing that should be avoided is an explosion of overloads to do and mean everything. As for the ImbalancedZipStrategy, I've pointed out before that making the strategy an argument of the operator via an enumeration (or even polymorphic type) would be interesting if the problem that needed solving was that the strategy is a variable of the run-time. I think we can also agree that's not the case here. If that's not convincing by itself, then there is a very subtle yet major idea behind extension methods and LINQ that the enumeration works against. If the enumerable source type wants to provide an optimization for an operator, it can do so by providing a method by the same name and signature as the operator in question. If we use an enumeration that comes from MoreLINQ then the other type has to take in a dependency that won't be looked upon lightly. If we take the approach of embodying the strategy in the name (assuming this is a compile-time decision), then we have simple signatures with Zip, EquiZip and ZipLongest being distinct and clear names. Now, if I have a type called SuperDuperList that wants to provide an optimization for Zip, EquiZip and ZipLongest (or any one of the three) then I can do so using simply base and generic types. Right now, with the enumeration, I have no choice but to support all strategies and take a hit on MoreLINQ! Think about it. This is how Enumerable.Contains works. When using Contains on variant typed as List or IList, the LINQ extension method is not used! If one wants to force use of the LINQ's Contains implementation then one has to hop over AsEnumerable first. So to stay objective, we need to see which approach doesn't take us to far away from how LINQ operators should be designed, taking built-in ones as guidance.

Comment #11

Posted on Mar 10, 2009 by Grumpy Horse

I don't want to resurrect any painful discussions here, but unless you envisage more than the three strategies already specified, I prefer the idea of the strategy being inferred by the method name.

  • Zip
  • PaddedZip

I don't think the truncated version is required because if you lazily evaluate the results of Zip and PaddedZip you can just do PaddedZip.Take or Zip.Take to get the shorter of the two.

Comment #12

Posted on Mar 10, 2009 by Massive Ox

@jeffmagic: You've got my vote on strategy being part of the method name. ;)

Comment #13

Posted on Apr 7, 2009 by Massive Ox

I've taken sub-issue about distinct method names instead of a strategy enumeration and created issue #24 out of it.

Comment #14

Posted on Jun 30, 2013 by Swift Cat

Great to see some progress on MoreLinq again. I have implemented a patch for this and pushed it to my Bitbucket fork (it seems I do no longer have commit access to this repo and I don't know how GoogleCode supports pull requests). Code is here https://bitbucket.org/johannesrudolph/morelinq/commits/b607e78b404bd86ff83a7f9ae85c511397b06a7a

Please pull it in the trunk if it's acceptable.

Comment #15

Posted on Jul 1, 2013 by Massive Ox

@jojo.rudolph: While I can imagine rare cases (though I have yet to come across it) where the last value could be handy, it's better to keep the simpler overload where the default is given instead of the one accepting providers. When case for providers builds stronger, it can always be added back in a future version.

Defaults can be implemented as shown below but it could become verbose so providing defaults can be a bit more palatable, especially in the midst of long query.

var ns = MoreEnumerable.GenerateByIndex(n => (double?) n * 0.5);

var xy = ns.Take(5).ZipLongest(ns.Take(10), (x, y) => new { X = x ?? double.NaN, Y = y ?? double.NaN });

foreach (var e in xy) Console.WriteLine(e.ToString());

// prints: // { X = 0, Y = 0 } // { X = 0.5, Y = 0.5 } // { X = 1, Y = 1 } // { X = 1.5, Y = 1.5 } // { X = 2, Y = 2 } // { X = NaN, Y = 2.5 } // { X = NaN, Y = 3 } // { X = NaN, Y = 3.5 } // { X = NaN, Y = 4 } // { X = NaN, Y = 4.5 }

Comment #16

Posted on Aug 21, 2015 by Massive Ox

This issue has been migrated to: https://github.com/MoreLINQ/morelinq/issues/6 The conversation continues there. DO NOT post any further comments to the issue tracker on Google Code as it is shutting down. You can also just subscribe to the issue on GitHub to receive notifications of any further development.

Status: Started

Labels:
Type-Enhancement Priority-Medium