Logical DTypes #1262

mrocklin · 2017-02-12T01:26:23Z

tl;dr: Can XArray enable user-defined logical dtypes on top of physical NumPy arrays ?

The Need for New Datatypes

NumPy's dtypes (int, float, etc.) are appropriate for many, but not all cases. There are a variety of situations where we want numpy-like array semantics (broadcasting, memory layout) but with different element properties. Use cases include the following:

Datetimes with timezones
Categorical values (such as for land-use in climate data)
IPv4 or IPv6 addresses
...

Currently dtypes need to be added directly to the NumPy source code. This is a high barrier for many community members, requires general approval (there can be only one datetime implementation) (this is good and bad), and limits experimentation. There is value to supporting user-definable datatypes.

This is hard to do in NumPy

Ideally we would implement extensible user-defined dtypes within NumPy (and there may be long-standing plans to do just this). However, changing NumPy today is hard, both because it's hard to find developers who are comfortable operating at that level and because the backwards compatibility pressure on NumPy is large.

So as an alternative, we might consider lightly wrapping NumPy arrays in a new object that also includes extra dtype information. For example we might wrap an int64 numpy array with some datetime/timezone metadata to achieve a logical datetime array using a physical int64 array. We continue using NumPy as is but use this higher layer when necessary for more complex dtypes.

However "lightly wrapping" NumPy arrays is hard to do while still maintaining a closed system where all operations remain consistent (raw NumPy arrays inevitably leak through). Additionally, asking communities to switch to new libraries is socially quite challenging.

XArray is well placed

Fortunately XArray appears to have already solved some of these technical and social challenges. XArray lightly wraps NumPy arrays in a consistent manner. NumPy-like operations on XArrays remain XArrays. Interactions with other NumPy arrays are well defined. XArray has also attracted an active user/developer community and has attained general respect from the broader ecosystem. XArray seems to be hackable, benefits from a decently active community, and is not yet under as much backwards compatibility pressure.

So question: Is it sensible to add logical dtype information to XArray? Can this be done with only moderate effort and maintenance costs to the XArray project? If the answer is "yes, probably", then what is the right way to go about this?

shoyer · 2017-02-13T20:00:53Z

So question: Is it sensible to add logical dtype information to XArray?

Sure, this would pretty sensible, especially if there is a nice story for wrapping upstream libraries providing alternate physical arrays such as dask.array and bolt (cc @freeman-lab).

There are certainly plenty of use-cases. A few more examples that would be particularly relevant for xarray:

a generic optional dtype for handling missing values (e.g., for integers)
a generic wrapper for 1D pandas dtypes into N-dimensional arrays
physical units (support for units #525)

Can this be done with only moderate effort and maintenance costs to the XArray project?

If we have a well defined interface that defines the right operations, my guess is indeed "yes, probably". See bolt-project/bolt#58 for a list of operations worth considering wrapping (obviously some of these, like arithmetic, are not needed for all dtypes).

If the answer is "yes, probably", then what is the right way to go about this?

I think it should start as a separate package to ensure a cleanly separated interface and because there are definitely other clients than xarray. We can quickly add it as an optional dependency to xarray for testing purposes.

I'm excited about this, but I'm unlikely to have much time available to work on this directly.

shoyer · 2017-02-13T20:18:42Z

One major API design challenge to solve with such a package (unresolved in NumPy) is how to handle dtype-specific methods/properties, e.g., year, month and day properties for a custom datetime dtype, or a .keys() methods for a structured dtype (numpy/numpy#8615).

Fitting these into a generic NDArray type is not very natural. So perhaps the solution is to use subclasses (fixed for each dtype) with some very strict design constraints (e.g., only add new methods/properties, don't override functionality). The contract would still be that the dtype defines all valid extension points for overriding functionality.

mrocklin · 2017-02-13T20:41:50Z

To be clear, my original question was more ambitious. It may be interpreted as "should such a system be integrated directly into the XArray codebase?"

The answer of "No, it should be a standalone library that XArray wraps much in the same way it wraps around numpy or dask.array" if fine with me. Just asking. Benefits in favor would be that I suspect XArray already has mechanisms for coercion and such and it would reduce the number of total libraries. Argument against is that XArray is currently only focused on indexed and labeled arrays, and possible it doesn't want to deal with the dtype mess. So, more broadly, the question is "What is the scope of XArray?"

shoyer · 2017-02-14T21:46:09Z

CC @pydata/xarray in case anyone else has opinions here

Benefits in favor would be that I suspect XArray already has mechanisms for coercion and such and it would reduce the number of total libraries.

We really don't have much existing machinery. Two things we have that might be useful:

a couple of mixin classes for easily defining custom array types. This could be a nice building block, but it's self-contained and only a few dozen lines of code.
some existing code for function dispatch to either numpy or dask.array. This is quite messy, somewhat xarray-specific and not worth copying.

Fewer libraries is definitely nice, but I see this as more of a secondary rather than primary goal.

More broadly, doing this project right will need strong separation of concerns from xarray's handling of labeled arrays. So there's not a huge amount to be gained by doing it in the same repository.

Argument against is that XArray is currently only focused on indexed and labeled arrays, and possible it doesn't want to deal with the dtype mess.

I would love to see this project be successful and integrated with xarray. But better dtypes is tangental to our current focus, and project maintenance is already stretched pretty thin -- there's still a lot of core functionality to build out for manipulation of labeled arrays.

So I'm not comfortable with building this in xarray at this time. But I would be happy to revisit this decision when you have a design document, prototype and someone committed to developing and maintaining the module.

max-sixty · 2017-02-14T22:17:40Z

Worth considering pandas 2.0 discussions around types wesm/pandas2#24, and some of their rejected considerations, such as https://github.com/libdynd/libdynd

clarkfitzg · 2017-02-14T22:47:28Z

Other datatypes would be extremely useful. But I think it would be better to start as a separate project and build some confidence in a system first.

@MaximilianR I was just typing nearly the same thing... 👍

we might consider lightly wrapping NumPy arrays in a new object that also includes extra dtype information

Pandas seems to be moving away from this approach now.

Any other existing alternatives? datashape?

mhvk · 2017-02-22T14:11:14Z

Just as a heads-up, there is indeed the realisation within numpy that subclassable dtype would be great -- see numpy/numpy#2899. If you have something like a design, I would certainly be interested (as maintainer of astropy's Quantity -- physical units should really be supported everywhere!), and I'd suggest to send a note to numpy-dev to get possible feedback/help.

stale · 2019-01-23T15:40:33Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

mhvk · 2019-01-23T16:01:02Z

See numpy/numpy#12630 for a numpy enhancement proposal that would end up making dtype more easily subclassable.

stale · 2020-12-25T11:49:53Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

keewis · 2020-12-26T14:25:57Z

there is now a series of NEPs starting with NEP-40 discussing this, so we should be able to wait until numpy releases a version that supports custom dtypes. Should we close this?

stale bot added the closed_as_stale label Jan 23, 2019

dcherian added upstream issue and removed closed_as_stale labels Jan 23, 2019

seberg mentioned this issue Jan 7, 2020

The epic dtype cleanup plan numpy/numpy#2899

Closed

2 tasks

stale bot added the stale label Dec 25, 2020

stale bot removed the stale label Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logical DTypes #1262

Logical DTypes #1262

mrocklin commented Feb 12, 2017 •

edited

Loading

shoyer commented Feb 13, 2017 •

edited

Loading

shoyer commented Feb 13, 2017

mrocklin commented Feb 13, 2017

shoyer commented Feb 14, 2017

max-sixty commented Feb 14, 2017

clarkfitzg commented Feb 14, 2017

mhvk commented Feb 22, 2017

stale bot commented Jan 23, 2019

mhvk commented Jan 23, 2019

stale bot commented Dec 25, 2020

keewis commented Dec 26, 2020

Logical DTypes #1262

Logical DTypes #1262

Comments

mrocklin commented Feb 12, 2017 • edited Loading

The Need for New Datatypes

This is hard to do in NumPy

XArray is well placed

shoyer commented Feb 13, 2017 • edited Loading

shoyer commented Feb 13, 2017

mrocklin commented Feb 13, 2017

shoyer commented Feb 14, 2017

max-sixty commented Feb 14, 2017

clarkfitzg commented Feb 14, 2017

mhvk commented Feb 22, 2017

stale bot commented Jan 23, 2019

mhvk commented Jan 23, 2019

stale bot commented Dec 25, 2020

keewis commented Dec 26, 2020

mrocklin commented Feb 12, 2017 •

edited

Loading

shoyer commented Feb 13, 2017 •

edited

Loading