Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs of construction undef arrays with missing #31091

Merged
merged 4 commits into from
Oct 30, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Feb 16, 2019

This PR explains the behavior I observe, but it is not documented, so maybe it is not guaranteed (therefore before merging it should be confirmed that Base wants to guarantee this contract).

The contract is that if T is a bits type then creation of uninitialized array of type Union{Missing, T} initializes it to hold missing in all entries.

CC @nalimilan

This PR explains the behavior I observe, but it is not documented, so maybe it is not guaranteed (therefore before merging it should be confirmed that Base wants to guarantee this contract).

The contract is that if `T` is a bits type then creation of uninitialized array of type `Union{Missing, T}` initializes it to hold `missing` in all entries.

CC @nalimilan
@@ -254,6 +254,23 @@ julia> Array{Union{Missing, String}}(missing, 2, 3)
missing missing missing
```

For `T` that is bits type (`isbitstype(T)` returns `true`),
`Array{Union{Missing, T}}(undef, dims)` creates an array filled with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to recommend Array{Union{Missing, T}}(undef, dims) (which should work even for non-isbits T)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem. If T is non isbitstype then the array is truly uninitialized (#undef). If I understand the reason correctly this is for performance reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean Array{Union{Missing, T}}(missing, dims) @fredrikekre?

I'm not sure whether it's a good idea to develop this here given that we want people to use Array{Union{Missing, T}}(missing, dims) rather than Array{Union{Missing, T}}(undef, dims), and that it makes things more complex with the introduction of the concept of isbits type. Maybe make this a shorter !!note without showing the REPL output? It would also be useful to mention for non-isbits type the entries will be #undef.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean...

Yea...

@bkamins
Copy link
Member Author

bkamins commented Feb 16, 2019

I also was not sure what to do. One thing that was clear to me is that we should document it, especially the behavior similar (but similar calls constructor with undef).

I am OK to change the message (I just wanted to put something on the table), but first let us confirm that this is the intended behavior that will be supported (and this result is a consequence of conscious design decision), i.e.:

  • array constructor with undef for isbitstype fills the array with missing;
  • array constructor with undef for !isbitstype leaves the array with #undef.

(making this clear was the main reason why I have opened this PR)

@andyferris
Copy link
Member

The latter is intentional.

I didn’t realize the former was a guarantee... probably more of an interaction between ordering of unions and how the isbits union array constructor was defined - ie more of a coincidence. Generic code should never read from an undef array in any case, and I wouldn’t encourage people to rely on this for concrete types either, when the constructor with missing is explicit and user friendly and works in generic contexts like when T isn’t isbits.

@nalimilan
Copy link
Member

Yes AFAICT both behaviors are guaranteed (since changing them would potentially break a lot of code), but the former it's not really "intended": that's just because the type tag is filled with zeros (unlike the data part, we can't leave it filled with arbitrary values).

@bkamins
Copy link
Member Author

bkamins commented Feb 17, 2019

Yes AFAICT both behaviors are guaranteed

This was my assumption. I have rewritten the message to explain the behavior (because when more people start using Julia they will notice this so questions may arise) but discourage relying on it.

`Array{Union{Missing, T}}(undef, dims)` creates an array filled with
missing values. Also calling `similar` to create an unitialized array that
has element type of the form `Union{Missing, T}` creates an array filled with
`missing` if `T` is bits type. This behavior is an implementation detail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Implementation detail that may change in the future" might be a bit strong, as it makes it sound like the public API cannot be relied on. Maybe just recommend using Array{Union{Missing, T}}(missing, dims), noting that it works for all types? Maybe also remove the leading "currently".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is strong, but from the discussion I understood that we want to be discouraging. The question is if in e.g. Julia 1.7 we will guarantee this behavior (I know we rely on it internally, and the probability that this will change is minimal, but theoretically it is possible). In general - I understand that this is only the current behavior but it is not a part of a public API.

Actually, I have written this PR is exactly to clarify: do we want to make it a public API or just describe the current behavior and warn users that this is not a part of public API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, that's part of the public API (anyway, any behavior that is stable like this will de facto end up as being part of the API since lots of code will rely on it). We can mark this for triage to make sure that's the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - can you please mark it?

any behavior that is stable like this will de facto end up as being part of the API since lots of code will rely on it

Yes- but I would prefer an explicit decision 😄.

Copy link
Sponsor Member

@JeffBezanson JeffBezanson Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have Union{SingletonType, NonSingletonType} you are guaranteed to get the singleton type (e.g. Missing), and that won't change. For other combinations it's less predictable.

@nalimilan nalimilan added the status:triage This should be discussed on a triage call label Feb 18, 2019
@mbauman
Copy link
Sponsor Member

mbauman commented Feb 20, 2019

This isn't true.

julia> module A
           struct NotMissing end
           export NotMissing
       end
       using .A

julia> Array{Union{NotMissing, Missing}}(undef, 5)
5-element Array{Union{NotMissing, Missing},1}:
 NotMissing()
 NotMissing()
 NotMissing()
 NotMissing()
 NotMissing()

In short, it's sorted by singleton-ness and module/type name. In most cases, though, folks are going to be interested in combinations of missing with non-singletons... and there it'll be true.

I say we document a very simplistic version of this behavior — but getting the wording right there is gonna be a pain.

@mbauman mbauman added the domain:arrays [a, r, r, a, y, s] label Feb 20, 2019
@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2019

My conclusion would be just to stress that the result is undefined (as documented elsewhere, but I think it is good to be explicit here). The reason is, as I think @nalimilan noted earlier, that if you want to fill an Union-array with missing there is an easy way to do it and it should be used. And explaining the exact rules when you can rely on it is not worth.

@nalimilan - where do we rely on the behavior on similar for union with Missing (you have noted that we do somewhere in the code so maybe that part should be reviewed?)

@nalimilan
Copy link
Member

Ah, good catch @mbauman. So I guess the note should be a warning saying that the result is undefined or at least varies depending on the type T. Maybe mentioning that it can happen with singleton types could be useful for people who doubt it.

@nalimilan - where do we rely on the behavior on similar for union with Missing (you have noted that we do somewhere in the code so maybe that part should be reviewed?)

I don't think I said we rely on it in a particular place. It's just that inevitably some code somewhere will depend on it. As long as that code isn't exercized on a singleton type which breaks the apparent rule, it won't be caught -- which might be OK if e.g. T<:Real.

doc/src/manual/missing.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Oct 24, 2020

I have updated the description following the discussion we had. Would it be acceptable now?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just add an empty line after the note (with [ci skip] in the commit message).

doc/src/manual/missing.md Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented Oct 25, 2020

CI errors seem unrelated

@knuesel
Copy link
Member

knuesel commented Oct 30, 2020

Shouldn't we altogether discourage people from relying on this behavior? I see only downsides here. We now have the semantically correct missing constructors. So why not be clear in the documentation:

  • Use a missing constructor to initialize to missing.
  • Use an undef constructor to defer initialization to a later stage, and don't read the values before you initialize them.

Otherwise we just further the confusion about the semantics of the undef constructor. Imagine new users reading code that uses this behavior: they will not understand what undef is really about, and they might write buggy code following this pattern (e.g. generic functions that will have wrong behavior when called with singleton types).

If we mention this behavior in the documentation, I think it should only be to note that it's an accident of the implementation and that the semantics of undef are not to initialize to missing, but to postpone initialization.

@vtjnash vtjnash merged commit f412222 into JuliaLang:master Oct 30, 2020
@bkamins bkamins deleted the patch-24 branch October 30, 2020 14:59
@bkamins
Copy link
Member Author

bkamins commented Oct 30, 2020

I think that there are two purposes of the documentation:

  1. to provide a specification of the behaviour (this is this PR)
  2. to give recommendations (and for sure this makes sense to have them)

@rfourquet
Copy link
Member

I'm also a bit confused by this update of the documentation... On the one hand, it helps to not rely on a behavior which is observed in particular cases (non isbits T), which is good; but on the other hand it "validates" using undef to mean fill with missing, which seems wrong (if you don't know this corner-case behavior, you misunderstand code using it). Is it also documented for non-Missing singleton types in general, e.g. for Nothing?

that's just because the type tag is filled with zeros (unlike the data part, we can't leave it filled with arbitrary values).

Out of curiosity, why does the type tag has to be filled with zeros?

@bkamins
Copy link
Member Author

bkamins commented Oct 30, 2020

but on the other hand it "validates" using undef to mean fill with missing

So for me it is the opposite. It tells me not to rely on this when writing generic code (as in such cases if something is broken only rarely it is the same as if it were broken 50% of times).

@knuesel
Copy link
Member

knuesel commented Nov 1, 2020

I think that there are two purposes of the documentation:

  1. to provide a specification of the behaviour (this is this PR)

Yes but sometimes it's best to leave a particular behavior unspecified.

If we specify this, then the developers of new array types have to choose between performance and "compatibility" with Base types. This is a self-inflicted wound: it makes perfect sense to leave undefined the values returned by undef constructors.

As you know, this is already a problem for the developers of SentinelArrays.jl. The SentinelArray undef constructor sacrifices performance to reproduce the behavior of Base (see JuliaData/SentinelArrays.jl#40).

So I'm afraid we're missing an opportunity here. I think the Julia documentation should say something like this:

Due to implementation details, Array{Union{Missing, T}}(undef, dims) returns missing values in most cases. However this behavior is not reliable (it fails for certain singleton types) and the purpose of undef constructors is not to initialize values to missing (use a missing constructor for that).

Then it would be clear that SentinelArray is free to make a real undef constructor, with the performance advantage.

@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2020

Then it would be clear that SentinelArray is free to make a real undef constructor, with the performance advantage.

OK, makes sense. I am not Julia Base maintainer, so maybe please open a PR with this change?

@knuesel
Copy link
Member

knuesel commented Nov 1, 2020

Sure, PR filed at #38260 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants