Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not convert subclasses of ndarray unless required #1118

Closed
wants to merge 8 commits into from

Conversation

burnpanck
Copy link
Contributor

By changing a single np.asarray to np.asanyarray, it becomes possible to use ndarray subclasses within DataArray and Dataset. No guarantee is made at this point that their behaviour is retained under all circumstances, but for some use-cases it will work.

Particularly, this allows to store physical quantities represented using the Quantity subclass from the package python-quantities. In that sense, it is a partial fix for #525. Tests are included to highlight some of the things that do work, and some which don't. It does not work for coordinates when they are used as indexes, since pandas does not support ndarray subclasses. Most likely, it will also not work for any higher-level operation on the data such as those involving np.concatenate. Thus, any user of this feature should remain cautious.
Nonetheless, for me, this change provides enough added value, such that I'd consider it worthwhile to keep, assuming it does not harm anywhere else. It does pass the tests on my machine (no dask though), let's see what travis says.

I expect that astropy's units would behave similarly, though since I never worked with them yet, I did not include any tests.

@shoyer
Copy link
Member

shoyer commented Nov 15, 2016

This does seem pretty reasonable to me.

But is this actually running in CI? I think you need to add quantities to requirements.yml files to get this to run in Travis-CI:
https://github.com/pydata/xarray/tree/master/ci

@burnpanck
Copy link
Contributor Author

You are right. There seem to be quite a number of varying requirements. Should I add it to all of them? Also, I'm not very well versed in Travis-CI .yml: What repo are the requirements served from? I think python-quantities is in debian. Or should I just go for pip? There hasn't been a release in a long time, github master has progressed quite a bit, but the released version should be compatible too.

@shoyer
Copy link
Member

shoyer commented Nov 15, 2016

It either uses conda (by default) or pip:
https://github.com/conda/conda-env

Please do add it to everything, except perhaps the build with minimal dependencies (py27-min)

has_quantities = False

def requires_quantities(test):
return test if has_quantities else unittest.skip('requires dask')(test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the message here.

@burnpanck
Copy link
Contributor Author

Travis succeeds, though lots of failures under environments with allowed failure. They look unrelated to me, but I find it hard to tell. Appveyor doesn't seem to run the quantities tests, so I guess the requirements there are missing too. Where would I add requirements for Appveyor?

@shoyer
Copy link
Member

shoyer commented Nov 16, 2016

Indeed, the allowed failures are unrelated: #1109

See here for Appveyor installation: https://github.com/pydata/xarray/blob/master/appveyor.yml

It would be good to switch it use conda env for consistency but haven't gotten around to that yet (#1127)

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few more integration tests to verify that basic operations work with ndarray subclasses, even if they don't preserve the subclass nature?

For example, it would be to good to add such a test for xarray.concat.

self.y = np.arange(20)
self.xp = np.arange(10) * pq.J
self.v = np.arange(10*20).reshape(10,20) * pq.V
self.da = DataArray(self.v,dims=['x','y'],coords=dict(x=self.x,y=self.y,xp=(['x'],self.xp)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run a PEP8 check on this code -- we try to stick to the 80 character line width and use spaces after commas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


@unittest.expectedFailure
def test_sel(self):
self.assertEqualWUnits(self.da.sel(y=self.y[0]).values,self.v[:,0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this indicate that indexing does not preserve units?

I think we should be able to fix that, see NumpyIndexingAdapter in indexing.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently yes. I didn't get around to actually look into this.


@unittest.expectedFailure
def test_mean(self):
self.assertEqualWUnits(self.da.mean('x').values,self.v.mean(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment here noting that xarray uses a wrapped version of np.nanmean, and I don't think there's an easy way to make that work with duck-typing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't fully understand this yet: Bare Quantities do work fine with nanmean. Are you saying that mean is wrapped in a more complicated way than other reduction methods, or would your comment apply to any reduction. I quickly tried finding the relevant code and converged at DataArrayGroupBy.reduce->.apply, which seems to be the case for most reductions. Is that what is called when I do .mean() on a DataArray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting -- I did not realize that Quantities worked with np.nanmean.

We handle our wrapping of nanmean and other aggregation functions over in ops.py: https://github.com/pydata/xarray/blob/master/xarray/core/ops.py#L305

This dispatches to bottleneck.nanmean if available (which is much faster), falling back to use numpy.nanmean if necessary. It's quite possible that bottleneck doesn't handle subclasses even though NumPy does. If that's the case, then I think you could simply adjust the logic on this line to always use numpy instead of bottleneck for subclasses (e.g., if isinstance(values, np.ndarray) and type(values) != np.ndarray):

if isinstance(axis, tuple) or not values.dtype.isnative or no_bottleneck:

def test_units_in_data_and_coords(self):
da = self.da
self.assertEqualWUnits(da.xp.data,self.xp)
self.assertEqualWUnits(da.data,self.v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adjust things so that DataArray.values still always returns a base numpy.ndarray, but DataArray.data can return a subclass? That would be consistent with our current behavior for handling duck-type arrays with dask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, but I am not so happy about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is, what should be the exact guarantee that .values wants to give above .data. For me, units are an inherent part of the data (see my comment in #525) My view here is that the behaviour of an instance of Quantity is indeed very close to that of the bare ndarray, and thus shouldn't be treated any different. The problem here is of course that this will not necessarily be the case for a general subclass. So, the question is, where to draw the line between .values and .data. Should it be bare ndarray vs. everything else or just ndarray subclass vs. things that behave like arrays but aren't. The best answer will depend both on what is out there in terms of subclasses and objects that just look like arrays, as well as what the code using xarray expects to be able to do with what it gets from .value. I don't want to judge that, because I think I don't know enough about the eco-system xarray lives in. (E.g. dask: Is a dask array an ndarray subclass or not?). In any case, if you think .values should be bare ndarray only, then I'll respect that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask arrays are not ndarray subclasses, so agree, it's not entirely the same.

Either way, some users will need to wrap their calls to .values or .data with np.asanyarray() (to ensure the result is an ndarray subclass and not possibly a dask array) or np.asarray() (to ensure the result is a base ndarray). Actually, it's probably worth checking if np.asanyarray(data_array) does the right thing in your patch -- possibly DataArray.__array__ needs to be updated to call np.asanyarray.

Just speaking personally -- and I'm happy for other to chime in with their opinions -- I find a shortcut for np.asarray more essential than np.asanyarray. There are lots of libraries not written to handle anything other than base numpy array objects, and being able to convert into something that is guaranteed to work is valuable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if you say .values is really just meant to be a shortcut over asarray or asanyarray, then I agree the former is more important.

@@ -0,0 +1,64 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring explaining that this module uses quantities as a concrete example, but really it's just checking for subclass compatibility.

Also, I might call it something more genericl ike test_numpy_subclasses.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was rather thinking of this test-file as specialised towards physical units handling with quantities (i.e. as a regression-test database geared towards #525). I'd rather make a separate test-case for general subclasses, maybe with a dummy subclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, test_units_subclass.py also seems reasonable.

@@ -185,7 +185,7 @@ def _as_array_or_item(data):

TODO: remove this (replace with np.asarray) once these issues are fixed
"""
data = np.asarray(data)
data = np.asanyarray(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should probably be renamed, e.g., _as_any_array_or_item, just to make it super clear what it's doing (also update the TODO in the docstring)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

def test_arithmetics(self):
da = self.da
f = DataArray(np.arange(10*20).reshape(10,20)*pq.A,dims=['x','y'],coords=dict(x=self.x,y=self.y))
self.assertEqualWUnits((da*f).data, da.data*f.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about arithmetic that requires broadcasting? e.g., between arrays with different dimensions and/or different dimension order?

That will require ensuring that ops.transpose and ops.broadcast_to work on subclasses (e.g., xarray.Variable.transpose and xarray.Variable.expand_dims). I think the former should work already because of NumPy, but broadcast_to will probably require adding subok=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a go.

def test_units_in_indexes(self):
"""
Indexes are borrowed from Pandas, and Pandas does not support units.
Therefore, we currently don't intend to support units on indexes either.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we use comments instead of docstrings on test methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I do not see how docstrings could be worse than comments under any circumstance... On the other hand, docstrings allow for introspection from unit-testing tools. But then, I don't really care.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, not important, but I'm really surprised by this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the convention for NumPy. To be honest I'm not sure why, but I recall being told so at one point by @charris.

I see now that this isn't followed for pandas, so I guess we can break it here. I don't feel too strongly either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember well, the test docstrings were changing the way nose printed out the test names at runtime, which was a bit annoying when not all tests had a consequent doctstring. I don't know about pytest though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest doesn't seem to with either -v or -vv

@burnpanck
Copy link
Contributor Author

With the new changes, this will now conflict with #1128, though easy to solve.

@shoyer
Copy link
Member

shoyer commented Dec 23, 2016

#1128 has been merged -- can you try rebasing? Thanks

@lamorton
Copy link

Is there anything I can do to help move this forward? I'd really like to have this capability.

@shoyer
Copy link
Member

shoyer commented Mar 22, 2017

@lamorton This branch has conflicts with master (due to #1128) that need to be resolved with a merge or rebase. But otherwise, I think this is nearly ready to go in. Obviously it's not a complete fix, but it might be enough to be useful.

If you want to take a shot at resolving merge conflicts in a new PR, I think that would be welcome (I doubt @burnpanck is currently working on it).

@max-sixty
Copy link
Collaborator

Would anyone like to take this up? I was going to close as stale, but it's fairly close!

@keewis
Copy link
Collaborator

keewis commented May 12, 2019

Is this still relevant? If so, I'd be willing to take over. However, I am not really familiar with github yet: would I open a new pull request for that?

@max-sixty
Copy link
Collaborator

Hi @keewis - yes!

You can start a new PR by branching off this PR (https://help.github.com/en/articles/checking-out-pull-requests-locally). I think you'll find GH easy to use - feel free to ping here with any issues and we'll be happy to help

@keewis
Copy link
Collaborator

keewis commented Dec 25, 2019

closing for the same reason as #2956

@keewis keewis closed this Dec 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants