Do not convert subclasses of `ndarray` unless required #1118

burnpanck · 2016-11-14T11:59:02Z

By changing a single np.asarray to np.asanyarray, it becomes possible to use ndarray subclasses within DataArray and Dataset. No guarantee is made at this point that their behaviour is retained under all circumstances, but for some use-cases it will work.

Particularly, this allows to store physical quantities represented using the Quantity subclass from the package python-quantities. In that sense, it is a partial fix for #525. Tests are included to highlight some of the things that do work, and some which don't. It does not work for coordinates when they are used as indexes, since pandas does not support ndarray subclasses. Most likely, it will also not work for any higher-level operation on the data such as those involving np.concatenate. Thus, any user of this feature should remain cautious.
Nonetheless, for me, this change provides enough added value, such that I'd consider it worthwhile to keep, assuming it does not harm anywhere else. It does pass the tests on my machine (no dask though), let's see what travis says.

I expect that astropy's units would behave similarly, though since I never worked with them yet, I did not include any tests.

shoyer · 2016-11-15T16:23:07Z

This does seem pretty reasonable to me.

But is this actually running in CI? I think you need to add quantities to requirements.yml files to get this to run in Travis-CI:
https://github.com/pydata/xarray/tree/master/ci

burnpanck · 2016-11-15T17:34:15Z

You are right. There seem to be quite a number of varying requirements. Should I add it to all of them? Also, I'm not very well versed in Travis-CI .yml: What repo are the requirements served from? I think python-quantities is in debian. Or should I just go for pip? There hasn't been a release in a long time, github master has progressed quite a bit, but the released version should be compatible too.

shoyer · 2016-11-15T17:41:16Z

It either uses conda (by default) or pip:
https://github.com/conda/conda-env

Please do add it to everything, except perhaps the build with minimal dependencies (py27-min)

dopplershift · 2016-11-15T18:07:58Z

xarray/test/test_quantities.py

+ has_quantities = False
+
+def requires_quantities(test):
+ return test if has_quantities else unittest.skip('requires dask')(test)


Need to update the message here.

… to CI requirements

burnpanck · 2016-11-16T10:55:00Z

Travis succeeds, though lots of failures under environments with allowed failure. They look unrelated to me, but I find it hard to tell. Appveyor doesn't seem to run the quantities tests, so I guess the requirements there are missing too. Where would I add requirements for Appveyor?

shoyer · 2016-11-16T16:53:06Z

Indeed, the allowed failures are unrelated: #1109

See here for Appveyor installation: https://github.com/pydata/xarray/blob/master/appveyor.yml

It would be good to switch it use conda env for consistency but haven't gotten around to that yet (#1127)

shoyer

Can you add a few more integration tests to verify that basic operations work with ndarray subclasses, even if they don't preserve the subclass nature?

For example, it would be to good to add such a test for xarray.concat.

shoyer · 2016-11-16T16:55:39Z

xarray/test/test_quantities.py

+ self.y = np.arange(20)
+ self.xp = np.arange(10) * pq.J
+ self.v = np.arange(10*20).reshape(10,20) * pq.V
+ self.da = DataArray(self.v,dims=['x','y'],coords=dict(x=self.x,y=self.y,xp=(['x'],self.xp)))


Please run a PEP8 check on this code -- we try to stick to the 80 character line width and use spaces after commas.

shoyer · 2016-11-16T16:57:18Z

xarray/test/test_quantities.py

+
+ @unittest.expectedFailure
+ def test_sel(self):
+ self.assertEqualWUnits(self.da.sel(y=self.y[0]).values,self.v[:,0])


Does this indicate that indexing does not preserve units?

I think we should be able to fix that, see NumpyIndexingAdapter in indexing.py.

Apparently yes. I didn't get around to actually look into this.

shoyer · 2016-11-16T16:59:07Z

xarray/test/test_quantities.py

+
+ @unittest.expectedFailure
+ def test_mean(self):
+ self.assertEqualWUnits(self.da.mean('x').values,self.v.mean(0))


Let's add a comment here noting that xarray uses a wrapped version of np.nanmean, and I don't think there's an easy way to make that work with duck-typing.

Actually, I don't fully understand this yet: Bare Quantities do work fine with nanmean. Are you saying that mean is wrapped in a more complicated way than other reduction methods, or would your comment apply to any reduction. I quickly tried finding the relevant code and converged at DataArrayGroupBy.reduce->.apply, which seems to be the case for most reductions. Is that what is called when I do .mean() on a DataArray?

Oh, interesting -- I did not realize that Quantities worked with np.nanmean.

We handle our wrapping of nanmean and other aggregation functions over in ops.py: https://github.com/pydata/xarray/blob/master/xarray/core/ops.py#L305

This dispatches to bottleneck.nanmean if available (which is much faster), falling back to use numpy.nanmean if necessary. It's quite possible that bottleneck doesn't handle subclasses even though NumPy does. If that's the case, then I think you could simply adjust the logic on this line to always use numpy instead of bottleneck for subclasses (e.g., if isinstance(values, np.ndarray) and type(values) != np.ndarray):

xarray/xarray/core/ops.py

Line 325 in 6c90583

if isinstance(axis, tuple) or not values.dtype.isnative or no_bottleneck:

shoyer · 2016-11-16T17:01:46Z

xarray/test/test_quantities.py

+ def test_units_in_data_and_coords(self):
+ da = self.da
+ self.assertEqualWUnits(da.xp.data,self.xp)
+ self.assertEqualWUnits(da.data,self.v)


Can we adjust things so that DataArray.values still always returns a base numpy.ndarray, but DataArray.data can return a subclass? That would be consistent with our current behavior for handling duck-type arrays with dask.

Fine, but I am not so happy about this.

can you elaborate? :)

The question is, what should be the exact guarantee that .values wants to give above .data. For me, units are an inherent part of the data (see my comment in #525) My view here is that the behaviour of an instance of Quantity is indeed very close to that of the bare ndarray, and thus shouldn't be treated any different. The problem here is of course that this will not necessarily be the case for a general subclass. So, the question is, where to draw the line between .values and .data. Should it be bare ndarray vs. everything else or just ndarray subclass vs. things that behave like arrays but aren't. The best answer will depend both on what is out there in terms of subclasses and objects that just look like arrays, as well as what the code using xarray expects to be able to do with what it gets from .value. I don't want to judge that, because I think I don't know enough about the eco-system xarray lives in. (E.g. dask: Is a dask array an ndarray subclass or not?). In any case, if you think .values should be bare ndarray only, then I'll respect that.

dask arrays are not ndarray subclasses, so agree, it's not entirely the same.

Either way, some users will need to wrap their calls to .values or .data with np.asanyarray() (to ensure the result is an ndarray subclass and not possibly a dask array) or np.asarray() (to ensure the result is a base ndarray). Actually, it's probably worth checking if np.asanyarray(data_array) does the right thing in your patch -- possibly DataArray.__array__ needs to be updated to call np.asanyarray.

Just speaking personally -- and I'm happy for other to chime in with their opinions -- I find a shortcut for np.asarray more essential than np.asanyarray. There are lots of libraries not written to handle anything other than base numpy array objects, and being able to convert into something that is guaranteed to work is valuable.

Ok, so if you say .values is really just meant to be a shortcut over asarray or asanyarray, then I agree the former is more important.

shoyer · 2016-11-16T17:07:15Z

xarray/test/test_quantities.py

@@ -0,0 +1,64 @@
+import numpy as np


Can you add a docstring explaining that this module uses quantities as a concrete example, but really it's just checking for subclass compatibility.

Also, I might call it something more genericl ike test_numpy_subclasses.py.

I see. I was rather thinking of this test-file as specialised towards physical units handling with quantities (i.e. as a regression-test database geared towards #525). I'd rather make a separate test-case for general subclasses, maybe with a dummy subclass.

Sure, test_units_subclass.py also seems reasonable.

shoyer · 2016-11-16T17:08:02Z

xarray/core/variable.py

@@ -185,7 +185,7 @@ def _as_array_or_item(data):

 TODO: remove this (replace with np.asarray) once these issues are fixed
 """
- data = np.asarray(data)
+ data = np.asanyarray(data)


This method should probably be renamed, e.g., _as_any_array_or_item, just to make it super clear what it's doing (also update the TODO in the docstring)

shoyer · 2016-11-16T17:14:21Z

xarray/test/test_quantities.py

+ def test_arithmetics(self):
+ da = self.da
+ f = DataArray(np.arange(10*20).reshape(10,20)*pq.A,dims=['x','y'],coords=dict(x=self.x,y=self.y))
+ self.assertEqualWUnits((da*f).data, da.data*f.data)


What about arithmetic that requires broadcasting? e.g., between arrays with different dimensions and/or different dimension order?

That will require ensuring that ops.transpose and ops.broadcast_to work on subclasses (e.g., xarray.Variable.transpose and xarray.Variable.expand_dims). I think the former should work already because of NumPy, but broadcast_to will probably require adding subok=True.

I'll have a go.

shoyer · 2016-11-16T17:14:47Z

xarray/test/test_quantities.py

+ def test_units_in_indexes(self):
+ """
+ Indexes are borrowed from Pandas, and Pandas does not support units.
+ Therefore, we currently don't intend to support units on indexes either.


usually we use comments instead of docstrings on test methods

Why? I do not see how docstrings could be worse than comments under any circumstance... On the other hand, docstrings allow for introspection from unit-testing tools. But then, I don't really care.

I agree, not important, but I'm really surprised by this.

This is the convention for NumPy. To be honest I'm not sure why, but I recall being told so at one point by @charris.

I see now that this isn't followed for pandas, so I guess we can break it here. I don't feel too strongly either way.

If I remember well, the test docstrings were changing the way nose printed out the test names at runtime, which was a bit annoying when not all tests had a consequent doctstring. I don't know about pytest though

pytest doesn't seem to with either -v or -vv

…ests

burnpanck · 2016-11-18T14:58:24Z

With the new changes, this will now conflict with #1128, though easy to solve.

shoyer · 2016-12-23T04:57:36Z

#1128 has been merged -- can you try rebasing? Thanks

lamorton · 2017-03-20T18:51:44Z

Is there anything I can do to help move this forward? I'd really like to have this capability.

shoyer · 2017-03-22T03:48:07Z

@lamorton This branch has conflicts with master (due to #1128) that need to be resolved with a merge or rebase. But otherwise, I think this is nearly ready to go in. Obviously it's not a complete fix, but it might be enough to be useful.

If you want to take a shot at resolving merge conflicts in a new PR, I think that would be welcome (I doubt @burnpanck is currently working on it).

max-sixty · 2019-01-14T21:18:42Z

Would anyone like to take this up? I was going to close as stale, but it's fairly close!

keewis · 2019-05-12T19:01:19Z

Is this still relevant? If so, I'd be willing to take over. However, I am not really familiar with github yet: would I open a new pull request for that?

max-sixty · 2019-05-12T22:54:26Z

Hi @keewis - yes!

You can start a new PR by branching off this PR (https://help.github.com/en/articles/checking-out-pull-requests-locally). I think you'll find GH easy to use - feel free to ping here with any issues and we'll be happy to help

keewis · 2019-12-25T14:12:33Z

closing for the same reason as #2956

burnpanck added 4 commits November 14, 2016 11:56

use asanyarray instead of asarray in data accessor of Variable.

40df7b0

added tests for a few operations

37b484c

small typo

921915e

marked failing tests as expected

b3a49b6

dopplershift reviewed Nov 15, 2016

View reviewed changes

burnpanck added 2 commits November 16, 2016 11:28

fixed dependency message for quantities testing, and added quantities…

28ef399

… to CI requirements

fixing python 2.7 unittests compatibility issue

c849b67

shoyer reviewed Nov 16, 2016

View reviewed changes

burnpanck added 2 commits November 18, 2016 14:59

implemented some of the improvements suggested by shoyer in the unitt…

39dedb1

…ests

changed preserving ndarray subclasses only for .data but not for .values

c58f92e

shoyer mentioned this pull request Jan 7, 2017

Use masked arrays while preserving int #1194

Open

shoyer mentioned this pull request Apr 14, 2017

Sparse arrays #1375

Closed

jhamman added the contrib-help-wanted label Jul 13, 2017

shoyer mentioned this pull request Oct 29, 2017

__array_ufunc__ for xarray #1617

Closed

dopplershift mentioned this pull request May 24, 2018

Advice on unit-aware arithmetic #2176

Closed

keewis mentioned this pull request May 12, 2019

Picking up #1118: Do not convert subclasses of ndarray unless required #2956

Closed

keewis closed this Dec 25, 2019

Do not convert subclasses of ndarray unless required #1118

Do not convert subclasses of ndarray unless required #1118

Conversation

burnpanck commented Nov 14, 2016

shoyer commented Nov 15, 2016

burnpanck commented Nov 15, 2016

shoyer commented Nov 15, 2016

Choose a reason for hiding this comment

burnpanck commented Nov 16, 2016

shoyer commented Nov 16, 2016

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burnpanck commented Nov 18, 2016

shoyer commented Dec 23, 2016

lamorton commented Mar 20, 2017

shoyer commented Mar 22, 2017 • edited Loading

max-sixty commented Jan 14, 2019

keewis commented May 12, 2019

max-sixty commented May 12, 2019

keewis commented Dec 25, 2019

Do not convert subclasses of `ndarray` unless required #1118

Do not convert subclasses of `ndarray` unless required #1118

shoyer commented Mar 22, 2017 •

edited

Loading