Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-propogation, skipmissing-related improvements to Missing handling. #35050

Open
pdeffebach opened this issue Mar 9, 2020 · 10 comments
Open
Labels
domain:missing data Base.missing and related functionality

Comments

@pdeffebach
Copy link
Contributor

pdeffebach commented Mar 9, 2020

There is broad consensus that missing handling could be improved. Many discussions focus on making propagation of missings easier, and those discussions are worth having, but I also want to focus on how skipmissing handling could be improved. Here are my suggestions. There is a lot of overlap with #30596 here but this discussion should focus more on building ideas and a roadmap rather than a specific implementation.

  1. Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.
  2. Make way more functions in Base like cor, accept any iterator and not vectors so we don't need to collect skipmissings.
  3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.
  4. Change broadcasting so that we can go from a skipmissing back to a vector with missings in the same locations. It would be nice for skipmissing to have some kind of persistence so that you don't lose the location of missings when you collect. This allows you to, say, de-mean (or a more complicated function) elements of a vector with respect to non-missing entries.
  5. Use dispatch for DataFrames (or Tables or NamedTuples etc.) to simulate Stata's if syntax, where the new dataframe is a view into the non-missing elements uses 4. (above) to fill in missings where needed. R doesn't have this feature so I don't think its obvious to everyone that this is an option. Stata is really great with this, you can do
egen x = y - mean(z) if !missing(v)

And it will apply a filter on everything at the start of the function.

These are concrete changes that can be made without using relying on propagation of missings. They would lead to a workflow where one is able to take a vector, filter it to remove missings in whatever way you like, do things with the vector (the hard part), and then keep the missings in the correct locations.

@pdeffebach
Copy link
Contributor Author

With regards to point 4. map on skipmissing should return an object of the original type, but with missings where they are supposed to be. i.e.

x = [1, 2, missing, 4]
y = map(x) do xx
    xx - 1
end 
# [0, 1, missing, 3]

@ChrisRackauckas
Copy link
Member

No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:

https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1

The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions

But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:

#26631 (comment)

@propagatemissing f
where that expands to
g = x -> x === missing ? x : f(x)

So if the missing debate is as settled as @StefanKarpinski is saying, we should fix the docs to signal in the same way.

@KristofferC
Copy link
Sponsor Member

For documentation I would just use the higher-order function

propagate_missing(f) = x -> x === missing ? x : f(x)

instead of a macro.

@DilumAluthge
Copy link
Member

For documentation I would just use the higher-order function

propagate_missing(f) = x -> x === missing ? x : f(x)

instead of a macro.

Would it be worth adding a type parameter to force specialization on the type of f?

@DilumAluthge
Copy link
Member

For documentation I would just use the higher-order function

propagate_missing(f) = x -> x === missing ? x : f(x)

instead of a macro.

Would it be worth adding a type parameter to force specialization on the type of f?

E.g.

propagate_missing(f::F) where {F} = x -> x === missing ? x : f(x)

@pdeffebach
Copy link
Contributor Author

These proposals do not sound very different from the current passmissing function already defined in Missings.jl.

@ChrisRackauckas
Copy link
Member

Alright, so should the docs just say functions don't propagate missing and point to using passmissing?

@pdeffebach
Copy link
Contributor Author

Alright, so should the docs just say functions don't propagate missing and point to using passmissing?

Yes, but there are still a lot of improvements to skipmissing-type workflows that should be considered as well.

@tkf
Copy link
Member

tkf commented Mar 10, 2020

1. Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.

3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.

Just FYI, I think we can and should just make (t for t in zip(xs, ys, zs, ...) if any(ismissing, t)) and (... if all(ismissing, t)) fast to support these idioms. It's already kind of true as of #33526 (e.g., sum(x for x in xs if x !== missing) is even faster than sum(skipmissing(xs)) though this is partially because the former doesn't use pairwise summation). #33526 doesn't work with reduce yet so it doesn't cover the whole story. But it is straightforward to support reduce at least when there is no dims. (I want to have a go at it at some point but there is another reduce related improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)

I think just improving vanilla iterator transformations (filter etc.) is better as it would not only make missings faster but also make small Union of user-defined types faster.

@nalimilan
Copy link
Member

No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:

https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1

The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions

But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:

This should probably have been "when passed to standard mathematical operators and functions". See #35264. (Do note that there's a paragraph not visible in the diff which explicitly says that most functions do not propagate.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests

7 participants