Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve docstrings for unique() #20800

Merged
merged 1 commit into from
Jun 14, 2017
Merged

Improve docstrings for unique() #20800

merged 1 commit into from
Jun 14, 2017

Conversation

nalimilan
Copy link
Member

Remove any mention of ordering of the result, since for some custom array
types an order can be more natural, useful or efficient than the order of
appearance. Also fix the signature of the AbstractArray method, and add
a mention of isequal() to it. Use imperative form.

Returns an array containing only the unique elements of the iterable `itr`, in
the order that the first of each set of equivalent elements originally appears.
If `dim` is specified, returns unique regions of the array `itr` along `dim`.
Return an array containing only the unique elements of array `A`, as determined by
Copy link
Member

@Sacha0 Sacha0 Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A being identified as an AbstractArray in the docstring signature, perhaps nix the redundant "array" just before A?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it doesn't hurt, can still be useful if you didn't read the signature carefully.

the order that the first of each set of equivalent elements originally appears.
If `dim` is specified, returns unique regions of the array `itr` along `dim`.
Return an array containing only the unique elements of array `A`, as determined by
[`isequal`](@ref). If `dim` is specified, returns unique regions of `A` along
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the second sentence be imperative as well?

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Feb 25, 2017

I'm reluctant not to guarantee the ordering property since it is quite useful. cc @JeffBezanson

@ararslan ararslan added domain:collections Data structures holding multiple items, e.g. sets domain:docs This change adds or pertains to documentation labels Feb 25, 2017
@nalimilan
Copy link
Member Author

@StefanKarpinski Then I hadn't understood what you meant. I'm fine with keeping that guarantee (though I've yet to find a case where it's really useful), but then we need another solution to get the unique values ofPooledDataArray and CategoricalArray in their user-defined order. Cf. JuliaStats/DataArrays.jl#237.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Feb 26, 2017

How about this...


Elements in the collection returned by unique will be ordered in a manner associated with the input collection:

  1. For unordered collections like sets the result will also be a set with arbitrary ordering.
  2. For normal vectors, the associated ordering is the order in which elements appear in the collection, so accordingly the order of first appearance is preserved.
  3. For collections with an associated element ordering, like sorted vectors or categorical vectors, the result collection will be sorted according to that associated ordering. For sorted vectors, this means that the result will also be sorted and for categorical vectors, the elements (or levels) will be returned in their natural ordering as levels (not in the order of appearance in the input).

@nalimilan
Copy link
Member Author

That makes sense, but there's still something which bothers me: the order of appearance is also defined for categorical arrays, so if there are cases in which that ordering is indeed useful, not following it is a problem. Conversely, when working with categorical data (e.g. in StatsModels), we would like to get either the levels in the user-defined custom ordering (for categorical arrays), or the levels in their lexical ordering (for Vector{String}). Conflating these two definitions doesn't sound like a good API. (That's not an issue for sorted vectors, since the two definitions agree in that case.)

So I'd rather take a more consistent approach. If we think order of appearance can be useful, let's always follow it when it exists (i.e. all the time except for sets, where order is arbitrary); else let's just drop it completely.

Then, is the idea of a "natural ordering" of arrays something that should live in Base? If so, we could add a boolean argument to unique which would allow opting for the natural ordering instead of the order of appearance. This would default to sort. If we don't want this to live in Base, we can define a levels interface for this somewhere else.

@nalimilan
Copy link
Member Author

@StefanKarpinski Your final word? :-)

@nalimilan
Copy link
Member Author

FWIW Pandas's categorical type preserves the order of appearance with unique. So does R with factor.

Fix the signature of the AbstractArray method, add a mention of isequal()
and of ordering, and use imperative form.
@nalimilan
Copy link
Member Author

I've abandoned the idea of changing the definition of unique. We'll use levels for that.

Though, since we've spent some time reviewing the docstrings, I have pushed a commit with some minor changes, which should be uncontroversial.

@nalimilan nalimilan changed the title Rework docstrings for unique() Improve docstrings for unique() May 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:collections Data structures holding multiple items, e.g. sets domain:docs This change adds or pertains to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants