Non-propogation, `skipmissing`-related improvements to Missing handling. #35050

pdeffebach · 2020-03-09T12:24:14Z

There is broad consensus that missing handling could be improved. Many discussions focus on making propagation of missings easier, and those discussions are worth having, but I also want to focus on how skipmissing handling could be improved. Here are my suggestions. There is a lot of overlap with #30596 here but this discussion should focus more on building ideas and a roadmap rather than a specific implementation.

Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.
Make way more functions in Base like cor, accept any iterator and not vectors so we don't need to collect skipmissings.
Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.
Change broadcasting so that we can go from a skipmissing back to a vector with missings in the same locations. It would be nice for skipmissing to have some kind of persistence so that you don't lose the location of missings when you collect. This allows you to, say, de-mean (or a more complicated function) elements of a vector with respect to non-missing entries.
Use dispatch for DataFrames (or Tables or NamedTuples etc.) to simulate Stata's if syntax, where the new dataframe is a view into the non-missing elements uses 4. (above) to fill in missings where needed. R doesn't have this feature so I don't think its obvious to everyone that this is an option. Stata is really great with this, you can do

egen x = y - mean(z) if !missing(v)

And it will apply a filter on everything at the start of the function.

These are concrete changes that can be made without using relying on propagation of missings. They would lead to a workflow where one is able to take a vector, filter it to remove missings in whatever way you like, do things with the vector (the hard part), and then keep the missings in the correct locations.

The text was updated successfully, but these errors were encountered:

pdeffebach · 2020-03-09T12:44:59Z

With regards to point 4. map on skipmissing should return an object of the original type, but with missings where they are supposed to be. i.e.

x = [1, 2, missing, 4]
y = map(x) do xx
    xx - 1
end 
# [0, 1, missing, 3]

ChrisRackauckas · 2020-03-09T17:06:15Z

No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:

https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1

The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions

But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:

#26631 (comment)

@propagatemissing f
where that expands to
g = x -> x === missing ? x : f(x)

So if the missing debate is as settled as @StefanKarpinski is saying, we should fix the docs to signal in the same way.

KristofferC · 2020-03-09T19:48:36Z

For documentation I would just use the higher-order function

propagate_missing(f) = x -> x === missing ? x : f(x)

instead of a macro.

DilumAluthge · 2020-03-09T19:58:58Z

For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)
instead of a macro.

Would it be worth adding a type parameter to force specialization on the type of f?

DilumAluthge · 2020-03-09T20:08:45Z

For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)
instead of a macro.
Would it be worth adding a type parameter to force specialization on the type of f?

E.g.

propagate_missing(f::F) where {F} = x -> x === missing ? x : f(x)

pdeffebach · 2020-03-09T20:24:34Z

These proposals do not sound very different from the current passmissing function already defined in Missings.jl.

ChrisRackauckas · 2020-03-09T20:27:31Z

Alright, so should the docs just say functions don't propagate missing and point to using passmissing?

pdeffebach · 2020-03-09T22:17:44Z

Alright, so should the docs just say functions don't propagate missing and point to using passmissing?

Yes, but there are still a lot of improvements to skipmissing-type workflows that should be considered as well.

tkf · 2020-03-10T03:44:23Z

1. Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.

3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.

Just FYI, I think we can and should just make (t for t in zip(xs, ys, zs, ...) if any(ismissing, t)) and (... if all(ismissing, t)) fast to support these idioms. It's already kind of true as of #33526 (e.g., sum(x for x in xs if x !== missing) is even faster than sum(skipmissing(xs)) though this is partially because the former doesn't use pairwise summation). #33526 doesn't work with reduce yet so it doesn't cover the whole story. But it is straightforward to support reduce at least when there is no dims. (I want to have a go at it at some point but there is another reduce related improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)

I think just improving vanilla iterator transformations (filter etc.) is better as it would not only make missings faster but also make small Union of user-defined types faster.

nalimilan · 2020-03-26T12:25:04Z

No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:

https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1

The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions

But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:

This should probably have been "when passed to standard mathematical operators and functions". See #35264. (Do note that there's a paragraph not visible in the diff which explicitly says that most functions do not propagate.)

ararslan added the domain:missing data Base.missing and related functionality label Mar 12, 2020

pdeffebach mentioned this issue Mar 13, 2020

Make covariance and correlation work for any iterators JuliaStats/Statistics.jl#30

Closed

pdeffebach mentioned this issue Mar 21, 2020

Functions applied to an empty skipmissing should return missing. #35194

Closed

nalimilan mentioned this issue Mar 26, 2020

Fix manual about propagation of missing values #35264

Merged

pdeffebach mentioned this issue Apr 30, 2020

Make covariance and correlation work for iterators, skipmissing in particular. JuliaStats/Statistics.jl#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-propogation, `skipmissing`-related improvements to Missing handling. #35050

Non-propogation, `skipmissing`-related improvements to Missing handling. #35050

pdeffebach commented Mar 9, 2020 •

edited

Loading

pdeffebach commented Mar 9, 2020

ChrisRackauckas commented Mar 9, 2020

KristofferC commented Mar 9, 2020

DilumAluthge commented Mar 9, 2020

DilumAluthge commented Mar 9, 2020

pdeffebach commented Mar 9, 2020

ChrisRackauckas commented Mar 9, 2020

pdeffebach commented Mar 9, 2020

tkf commented Mar 10, 2020

nalimilan commented Mar 26, 2020

Non-propogation, skipmissing-related improvements to Missing handling. #35050

Non-propogation, skipmissing-related improvements to Missing handling. #35050

Comments

pdeffebach commented Mar 9, 2020 • edited Loading

pdeffebach commented Mar 9, 2020

ChrisRackauckas commented Mar 9, 2020

KristofferC commented Mar 9, 2020

DilumAluthge commented Mar 9, 2020

DilumAluthge commented Mar 9, 2020

pdeffebach commented Mar 9, 2020

ChrisRackauckas commented Mar 9, 2020

pdeffebach commented Mar 9, 2020

tkf commented Mar 10, 2020

nalimilan commented Mar 26, 2020

Non-propogation, `skipmissing`-related improvements to Missing handling. #35050

Non-propogation, `skipmissing`-related improvements to Missing handling. #35050

pdeffebach commented Mar 9, 2020 •

edited

Loading