Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zip-like version of skipmissing for multiple iterators? #30596

Open
stevengj opened this issue Jan 4, 2019 · 10 comments
Open

zip-like version of skipmissing for multiple iterators? #30596

stevengj opened this issue Jan 4, 2019 · 10 comments
Labels
domain:missing data Base.missing and related functionality

Comments

@stevengj
Copy link
Member

stevengj commented Jan 4, 2019

As suggested on discourse, a method like

skipmissing(itr, itrs...) = skipmissing(mapfoldl(ismissing, |, tuple) ? missing : tuple for tuple in zip(itr, itrs...))

might be useful.

(Not using any here to exploit the optimized mapfoldl for tuples from #30471.)

@stevengj stevengj added the domain:missing data Base.missing and related functionality label Jan 4, 2019
@stevengj
Copy link
Member Author

stevengj commented Jan 5, 2019

One incongruity with the above API is that skipmissing(itr) produces a sequence of elements whereas skipmissing(itr1, itr2) produces a sequence of tuples. Makes me think that the tuple version should be called something different, maybe zipmissing.

@pdeffebach
Copy link
Contributor

Thanks for making this issue!

I think the behavior I'm looking for is actually OR, so you get complete cases for everything.

skipmissing(itr, itrs...) = skipmissing(mapfoldl(ismissing, |, tuple) ? missing : tuple for tuple in zip(itr, itrs...))

definitely separate objects rather than an iterator of pairs is nice. The use-case I imagine is when you have arguments that need to be synced by index and inputted into functions.

@tkf
Copy link
Member

tkf commented Jan 5, 2019

FYI @galenlynch implemented skipoftype in #30549 (thx!). I haven't tried it, but I think it let us write skipoftype(Tuple{Vararg{Missing}}, zip(itrs...)) and should give us correct eltype for free.

@nalimilan
Copy link
Member

zipmissing sounds like the appropriate name. But is this really what people need? For example, if you want to compute the correlation between two vectors with missing values, cor(skipmissing(zipmissing(x, y))...) won't work. What's needed is a function returning a tuple of iterators which return entries for complete cases in x and y.

@piever
Copy link
Contributor

piever commented Jan 5, 2019

What's needed is a function returning a tuple of iterators which return entries for complete cases in x and y.

Maybe skipmissing(a, b) could return two iterators, one for a and one for b. Then one would simply use zip(skipmissing(a, b)...) to have the iterator of tuples, or we could define zipmissing(itr...) = zip(skipmissing(itr...)...).

OTOH the tuple iterator as implemented in the first post seems more natural as it has to traverse the data only once, so an alternative would be to go with zipmissing and for the cor use case do something like unzip(zipmissing(a, b)) (provided unzip gets implemented).

@nalimilan
Copy link
Member

Maybe skipmissing(a, b) could return two iterators, one for a and one for b. Then one would simply use zip(skipmissing(a, b)...) to have the iterator of tuples, or we could define zipmissing(itr...) = zip(skipmissing(itr...)...).

The problem @stevengj highlighted is that skipmissing(a) currently returns an iterator, but skipmissing(a, b) would have to return a tuple of iterators, which would be inconsistent with the former method (which should return a single-iterator tuple for consistency).

OTOH the tuple iterator as implemented in the first post seems more natural as it has to traverse the data only once, so an alternative would be to go with zipmissing and for the cor use case do something like unzip(zipmissing(a, b)) (provided unzip gets implemented).

If unzip(zipmissing(a, b)) worked without collecting the iterator first (not sure how that could be done), I don't see why a function wouldn't be able to do that directly, with a single pass. Anyway in the end everything depends on the order in which you go over each iterator.

@pdeffebach
Copy link
Contributor

I'm curious if this idea makes any sense.

Now that we can index a skipmissing object, we could define zip(x::Skipmissing, y::Vector) to only include the elements of y that where x is not missing.

Then we could mandate that functions that require arrays of equal length to zip them, guaranteeing two matching iterators.

@nalimilan
Copy link
Member

That would break the current behavior of zip, as this pattern already works. Also it would be weird not to follow the general behavior implemented for other iterables.

@pdeffebach
Copy link
Contributor

Based on the above conversation, a good way forward would be a multiskipmissing function that returns a tuple of iterators. It will not be called zipmissing because zip returns an iterator of tuples, not a tuple of iterators.

Should I submit a PR to Missings.jl for this?

@pdeffebach
Copy link
Contributor

I've implemented a multiskipmissing iterator in Missings.jl here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests

5 participants