Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalesce for mixed nothing/missing: bug and design issue #26927

Closed
bkamins opened this issue Apr 28, 2018 · 23 comments
Closed

Coalesce for mixed nothing/missing: bug and design issue #26927

bkamins opened this issue Apr 28, 2018 · 23 comments
Labels
design Design of APIs or of the language itself domain:missing data Base.missing and related functionality
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Apr 28, 2018

First let me report a bug:

julia> coalesce.([missing, nothing, Some(nothing), Some(missing)], "replacement")
ERROR: MethodError: convert(::Type{Union{Missing, Nothing, Some{Union{Missing, Nothing}}}}, ::Some{Nothing}) is ambiguous. Candidates:
  convert(::Type{Union{Missing, T}}, x) where T in Base at missing.jl:39
  convert(::Type{Union{Nothing, T}}, x) where T in Base at some.jl:21
Possible fix, define
  convert(::Type{Union{Missing, Nothing}}, ::Any)

But I have tracked it because I want to get back to #26661 and I have second thoughts about the fact that coalesce mixes handling missing and Some/nothing. It would help me if I understood why (maybe e.g. @nalimilan could explain or give a reference to an earlier discussion - I could not find it unfortunately).

The problem I have is that I feel that current coalesce is not very useful for handling missing values because coalesce(Some(missing), "replacement") returns missing, so e.g. this pattern to get rid of missings in a vector does not work (it worked in old Missings.jl package): coalesce.([Some(missing), "a", "b", missing], "replacement").

My first reaction is to have separate function for missing and separate for Some/nothing but maybe there is some good reason to mix it.

@nalimilan
Copy link
Member

Good catch about the convert bug. It's a pretty unlikely situation but we should handle it correctly. I'm not sure how to fix that though. These ::Type{Union{...}} methods are terrible to manage. FWIW, defining the following function makes Julia crash when creating the array:

 Base.convert(::Type{Union{Missing, Nothing, T}}, x::Any) where {T} = convert(T, x)

Regarding the behavior of coalesce, the main reason it handles both missing, nothing and Some is that it's useful for all three cases, and it's not clear how we would name separate functions if we distinguished these behaviors. In practice I'm not sure it's a problem. What's the issue with coalesce(Some(missing), "replacement") returning missing? Some(missing) is definitely not a missing value, it's precisely a missing value protected by the Some wrapper to be able to distinguish it from standard missing values. Anyway I doubt actual use cases would mix missing and Some like that.

@bkamins
Copy link
Member Author

bkamins commented Apr 28, 2018

The problem I have is that I would not like to have Some(missing) unwrapped when I use coalesce in "remove missing" mode. In particular, if I understand it correctly Some is designed to work with Nothing (and this is how it is documented). Unwrapping Some(missing) is a side effect only of the fact that now coalesce works on both missing and nothing.

I agree that the situation that we mix missing with nothing/Some will be rare. In the worst case we can write a clear documentation warning not to mix them, but I feel it is not clean. I would prefer two functions. One handling missing (it could have another name e.g. fuse) only and the other handling nothing/Some (this could keep coalesce name).

Alternatively coalesce could get a keyword argument mode with three traits:

  • missing (only works with missing);
  • nothing (only works with nothing and unwraps Some);
  • some third option (default - this would be current behavior).
    (or some other similar solution)

@nalimilan
Copy link
Member

There's nothing special about Some(missing), it's just like Some(1). AFAICT the question is rather whether coalesce should replace missing given that it also replaces nothing and unwraps Some. We could obviously allow tweaking the behavior with a keyword argument as you suggest, but that would be really verbose, when the point of coalesce is to be short and convenient.

coalesce is the standard SQL (and dplyr) term for this operation, so I'd rather not use another one like fuse. unwrap would be a good term, but it only applies to Union{Some{T}, Nothing}, not to Union{T, Nothing}.

I guess one possibility would be to introduce ?? as the null-coalescing operator, and reserve coalesce to missing. That would be consistent with C# and Swift (for ??), and with SQL and dplyr (for coalesce).

I've filed #26929 for the crash mentioned above.

@bkamins
Copy link
Member Author

bkamins commented Apr 29, 2018

AFAICT the question is rather whether coalesce should replace missing given that it also replaces nothing and unwraps Some.

This is exactly my point. I am flexible to whatever choice is made - given it is properly documented 😄 (I agree that Some(missing) is totally normal and should be unwrapped if unwrapping is wanted).

If we wanted to use R as a reference you have that dplyr uses coalesce for NA BUT NOT FOR NULL (which is an equivalent of nothing in Julia). And in SQL NULL is missing in Julia. So based on these terms I would argue that current coalesce will actually confuse people coming from R/SQL (as it will not ensure invariants they expect - namely that e.g. coalesce(x, 1) is guaranteed not to be missing no matter what x is).

I am pushing this so much now as for me this function is an important part of data science targeted ecosystem in base Julia so I would prefer to have a well thought of API here in Julia 1.0.

@ararslan ararslan added the domain:missing data Base.missing and related functionality label Apr 30, 2018
@bkamins
Copy link
Member Author

bkamins commented May 1, 2018

To show a situation that bothers me this is a simplified situation I had today:

julia> x = ["a", "b", missing]
3-element Array{Union{Missing, String},1}:
 "a"
 "b"
 missing

julia> y = [ismissing(v) ? missing : findfirst("a", v) for v in x]
3-element Array{Any,1}:
 1:1
 nothing
 missing

and now you cannot use coalesce meaningfully on such a vector.

My point is that for interactive use current coalesce is probably not a problem (as user will know what and why is being replaced) but for production code (e.g. writing a general purpose package) it is very easy to forget some use-case of coalesce and end up with a hard to catch bug in the code. Consider eg. some NLP processing package that would do processing similar to what I have given on the top.

@nalimilan nalimilan added the status:triage This should be discussed on a triage call label May 3, 2018
@JeffBezanson
Copy link
Sponsor Member

I think coalesce should handle exactly one value specially; a different value requires a different function.

@nalimilan
Copy link
Member

@JeffBezanson So how about adding ?? as the nothing-coalescing operator? See #26303.

@JeffBezanson
Copy link
Sponsor Member

@vtjnash has proposed this patch:

diff --git a/base/some.jl b/base/some.jl
index f75a493d91..6b9e1c84b6 100644
--- a/base/some.jl
+++ b/base/some.jl
@@ -64,8 +64,8 @@ function coalesce end
 
 coalesce(x::Any) = x
 coalesce(x::Some) = x.value
-coalesce(x::Nothing) = nothing
-coalesce(x::Missing) = missing
+coalesce(x::Nothing) = error("no default value")
+coalesce(x::Missing) = error("no default value")

Basically, the idea is that coalesce can never return nothing or missing unless it was wrapped in Some. That effectively forces you to supply a "default value", e.g. coalesce(x, y, Some(missing)).

@Keno
Copy link
Member

Keno commented May 10, 2018

Another proposal in addition:
Have coalesce decide which is the sentinel value based on the first argument, e.g.

coalesce(nothing, nothing, 1) = 1
coalesce(nothing, missing, 1) = missing
coalesce(missing, missing, 1) = 1
coalesce(missing, nothing, 1) = nothing

That way if you have, just pure mixtures of either nothing/missing or a value, that'll work just
fine. And you if you can have both, you can use the first argument as the sentinel:

coalesce(nothing, f(x), g(x), 1) # Will use `nothing` as the sentinel value
coalesce(missing, f(x), g(x), 1) # Will use `missing` as the sentinel value

Additionally @omus had a proposal to have a version of coalesce that takes a predicate for which values to consider missing, but that seems like a separate generic function.

JeffBezanson added a commit that referenced this issue May 10, 2018
add promotion rule for `Union{Nothing, Missing, T}`

fixes the bug part of #26927
JeffBezanson added a commit that referenced this issue May 11, 2018
add promotion rule for `Union{Nothing, Missing, T}`

fixes the bug part of #26927
JeffBezanson added a commit that referenced this issue May 11, 2018
add promotion rule for `Union{Nothing, Missing, T}`

fixes the bug part of #26927
@nalimilan
Copy link
Member

Throwing an error when the result would be missing or nothing could make sense, but AFAICT it wouldn't fix the difficulty that both missing and nothing values are replaced when they are mixed. @Keno's proposal to specify the sentinel value as the first argument would be more to the point. To be clear: we would retain the current behavior, but allow a more restrictive one when passing missing or nothing as the first value?

@bkamins
Copy link
Member Author

bkamins commented May 12, 2018

I have two additional comments/questions:

  1. If we want to retain current functionality that coalesce unwraps Some(x) in missing-mode then I would document it in Some docstring (as now the documentation is unequivocal that Some is intended to be used with Nothing).
  2. @Keno's proposal highlights my earlier point - in general I agree with it, but the question is:
    • how does it differ from defining two functions (I am not trying for the best names - but to be clear what I mean) coalesce_missing and coalesce_nothing (i.e. - putting the sentinel into function name)
    • if we would use this approach what would be the use case of coalesce without sentinel (would anyone actually want this functionality? - of course if we disregard that this saves typing)

@Keno
Copy link
Member

Keno commented May 12, 2018

The idea of my proposal is that the sentinel value is optional. That way if you have a bunch of functions f,g,h that return either a value (other than missing) or nothing, coalesce(f(), g(), h()) would just work.

@bkamins
Copy link
Member Author

bkamins commented May 12, 2018

@Keno: yes - I understand this. But my question is: does there exist a realistic use case, when you use coalesce where you may get nothing or missing in the process and you want to skip over BOTH (effectively not knowing which one did happen).

@Keno
Copy link
Member

Keno commented May 12, 2018

This came up on the triage call, and I think the consensus was to create a separate function what takes an explicit predicate for this case instead. (Or just use first(filter(x->x === nothing || x === missing, args)))).

JeffBezanson added a commit that referenced this issue May 13, 2018
add promotion rule for `Union{Nothing, Missing, T}`

fixes the bug part of #26927
@JeffBezanson JeffBezanson added the design Design of APIs or of the language itself label May 17, 2018
@nalimilan
Copy link
Member

With #27157, coalesce throws an error when all arguments are nothing/missing. This strict behavior can always be made more permissive in 1.x since it wouldn't be breaking.

I've thought a bit more about how to handle the case where missing and nothing would be mixed, and it's not really possible to choose the sentinel value based on the first argument, unless you always require specifying it: if you don't know whether the first value is a "normal" value or whether it's supposed to indicate that the behavior should be strict, you can't change the behavior based on it. But coalesce(Missing, x, y...) or coalesce(ismissing, x, y...) would be fine.

@Keno
Copy link
Member

Keno commented May 18, 2018

I've thought a bit more about how to handle the case where missing and nothing would be mixed, and it's not really possible to choose the sentinel value based on the first argument, unless you always require specifying it: if you don't know whether the first value is a "normal" value or whether it's supposed to indicate that the behavior should be strict, you can't change the behavior based on it.

Yes, that is part of the design. The idea would be that the most common case is that function would return either a sentinel or a value guaranteed not to be either sentinel (e.g. Union{Missing, Int}). It would be odd for a function to return either sentinel or a value. If a function does that, I agree the behavior is confusing, but the though was to make things as simple as possible for cases where the user knows things to be unambiguous (i.e. not requiring a first argument for the most common cases).

@nalimilan
Copy link
Member

OK, got it. Basically we just need to add methods to handle cases where nothing and missing follow one another. I've added a commit to #27157, please tell me what you think. The main issue is that I find it hard to describe in simple terms.

@StefanKarpinski
Copy link
Sponsor Member

During triage, @JeffBezanson and I violently objected to coalesce's unwrapping behavior. There was some appeal to the approach of letting the first argument determine the sentinel value, but also concern that one could accidentally get missing as a value and have it be mistakenly taken as a sentinel (which would only be a problem if followed by a nothing). @JeffBezanson proposed the notion of making nothing "more missing" than missing—i.e. this sort of thing:

coalesce(nothing, 1) === 1
coalesce(missing, 1) === 1
coalesce(nothing, missing) === missing
coalesce(missing, nothing) === missing

We didn't come to a conclusion but there was significant support for separating the unwrapping behavior of coalesce out into an unwrap function and doing this sort of thing instead of auto-unwrapping:

identity_or_nothing(x) = rand(Bool) ? Some(x) : nothing

unwrap(coalesce(identity_or_nothing(x), Some(0)))

@Keno
Copy link
Member

Keno commented May 24, 2018

The other approach that was discussed was to go back to having two separate generic function. One that unwraps Some, skips nothing and errors if nothing is found and one that skips missing.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented May 24, 2018

To give a concrete example of the first-argument-as-sentinel problem, suppose both f() and g() return nothing to indicate no value but can otherwise sometimes return missing (say because they're functions that operate on data frames or whatever). Then you can have code like this:

coalesce(f(), g(), 0)

If f() returns missing as a value and g() returns nothing to indicate "no value", then you would get nothing when the intended behavior would be to missing. You could argue that the call should be coalesce(nothing, f(), g(), 0), but it's far too easy for people not to do that and if that's a known trap, it would be better to just always require the caller to explicitly indicate which sentinel they want. Note that the "nothing is more missing than missing" approach would not have this problem—missing would be returned as intended.

@JeffBezanson
Copy link
Sponsor Member

I think the key is to first be clear about what domain the function operates on. Currently there are two issues there:

  1. coalesce considers nothing and missing equivalently as non-values, which doesn't seem universally right.
  2. It unwraps Some, but it's not clear whether Some should be used alongside missing.

I think we should define two domains, and have a function for each:

  1. Union{Missing, T} and coalesce: Returns the first non-missing value, propagates missing if all values are missing. Propagation makes sense since that's how missing usually behaves.
  2. Union{Nothing, Some} and unwrap: Returns the first non-nothing value and unwraps it if it's a Some. Throws an error if every argument is nothing. To be really strict we could require non-nothing arguments to be Some, but that's probably not necessary. unwrap also might not be the best name for this but I don't have any other ideas yet.

If that proposal isn't appealing, I could also tolerate the following:

  1. Have coalesce prefer to return missing over nothing (regardless of argument order), and remove the Some-unwrapping behavior.
  2. Add unwrap(x::Some) for unwrapping.

@StefanKarpinski
Copy link
Sponsor Member

How about calling that unwrap function something? As in it's guaranteed to return something. It also makes it unsurprising that it has to do with both nothing and Some.

@bkamins
Copy link
Member Author

bkamins commented May 24, 2018

I think we should define two domains, and have a function for each

This is exactly what I think is best from the very start.

JeffBezanson added a commit that referenced this issue May 25, 2018
nalimilan pushed a commit that referenced this issue May 27, 2018
ViralBShah pushed a commit that referenced this issue May 28, 2018
@JeffBezanson JeffBezanson added this to the 0.7 milestone May 28, 2018
@JeffBezanson JeffBezanson removed the status:triage This should be discussed on a triage call label May 28, 2018
Keno pushed a commit that referenced this issue Jun 5, 2024
* make `coalesce` handle only `missing`; add `something` to handle `nothing`/`Some`

fixes #26927

* Move something docs with those for nothing and Some
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design of APIs or of the language itself domain:missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests

6 participants