Stateful user-defined accessors #3268

crusaderky · 2019-08-27T09:54:28Z

If anybody decorates a stateful class with @register_dataarray_accessor or @register_dataset_accessor, the instance will lose its state on any method that invokes _to_temp_dataset, as well as on a shallow copy.

In [1]: @xarray.register_dataarray_accessor('foo') 
   ...: class Foo: 
   ...:     def __init__(self, obj): 
   ...:         self.obj = obj 
   ...:         self.x = 1 
   ...:          
   ...:                                                                                                                                                                                                                                                        

In [2]: a = xarray.DataArray()                                                                                                                                                                                                                                 

In [3]: a.foo.x                                                                                                                                                                                                                                                
Out[3]: 1

In [4]: a.foo.x = 2                                                                                                                                                                                                                                            

In [5]: a.foo.x                                                                                                                                                                                                                                               
Out[5]: 2

In [6]: a.roll().foo.x                                                                                                                                                                                                                                        
Out[6]: 1

In [7]: a.copy(deep=False).foo.x                                                                                                                                                                                                                              
Out[7]: 1

While in the case of _to_temp_dataset it could be possible to spend (substantial) effort to retain the state, on the case of copy() it's impossible without modifying the accessor duck API, as one would need to tamper with the accessor instance in place and modify the pointer back to the DataArray/Dataset.

This issue is so glaring that it makes me strongly suspect that nobody saves any state in accessor classes. This kind of use would also be problematic in practical terms, as the accessor object would have a hard time realising when its own state is no longer coherent with the referenced DataArray/Dataset.

This design also carries the problem that it introduces a circular reference in the DataArray/Dataset. This means that, after someone invokes an accessor method on his DataArray/Dataset, then the whole object - including the numpy buffers! - won't be instantly collected when it's dereferenced by the user, and it will have to instead wait for the next gc pass. This could cause huge increases in RAM usage overnight in a user application, which would be very hard to logically link to a change that just added a custom method.

Finally, with #3250, this statefulness forces us to increase the RAM usage of all datasets and dataarrays by an extra slot, for all users, even if this feature is quite niche.

Proposed solution

Get rid of accessor caching altogether, and just recreate the accessor object from scratch every time it is invoked. In the documentation, clarify that the __init__ method should not perform anything computationally intensive.

The text was updated successfully, but these errors were encountered:

crusaderky · 2019-08-27T10:24:38Z

Demonstation on the circular reference issue:

import gc
import weakref
import xarray


class C:
    pass

@xarray.register_dataset_accessor('foo')
class Foo:
    def __init__(self, obj):
        self.obj = obj

ds = xarray.Dataset()
w = weakref.ref(ds)
print("No accessor, in scope:", w() is not None)
del ds
print("No accessor, descoped:", w() is not None)

ds = xarray.Dataset()
ds.foo
w = weakref.ref(ds)
print("with accessor, in scope:", w() is not None)
del ds
print("with accessor, descoped:", w() is not None)
gc.collect()
print("with accessor, after gc pass:", w() is not None)

Output:

No accessor, in scope: True
No accessor, descoped: False
with accessor, in scope: True
with accessor, descoped: True
with accessor, after gc pass: False

fmaussion · 2019-08-27T11:03:47Z

Interesting, thanks!

As an accessor maintainer, I can ensure that at least one accessor implementation is storing state ;-). But this state is based on the xarray object itself: for example, we derive georeferencing information and store the names of the coordinate variabless we know are going to be useful to us later. That is, every new call to __init__ based on a modified object will trigger a new parsing, and we don't come into the situation you describe above.

Getting rid of the caching logic would mean some performance loss to us, yes, but I don't know if it's "worse" than the circular reference issue you describe or not.

crusaderky · 2019-08-27T11:16:48Z

@fmaussion could you change your accessor code to store its state in Dataset.attrs instead?

crusaderky · 2019-08-27T11:20:50Z

store the names of the coordinate variabless we know are going to be useful to us later.
So you work on the assumption that no new potentially useful coords will be added after the first invocation of your accessor? Or do you have logic that invalidates your cache every time the state of the coords changes?

crusaderky · 2019-08-27T11:30:22Z

The circular reference issue could also be worked around in a user-friendly way by having the decorator automatically add methods to the decorated class, copying the design of @dataclass:

import weakref


class C:
    def __init__(self, owner):
        self._owner = weakref.ref(owner)
        if hasattr(self, "__post_init__"):
            self.__post_init__()

    @property
    def owner(self):
        out = self._owner()
        if out is None:
            raise AttributeError("Orphaned accessor")
        return out

This would also allow for shallow copies to change the pointer.

fmaussion · 2019-08-27T12:00:02Z

could you change your accessor code to store its state in Dataset.attrs instead?

Yes, although attrs would be bad as well because they are lost in many operations. The current "best practice" would be to use coords for this, as in #1271 (comment)

I never bothered to get this done (my library is quite niche and not mean to become widely used), but if we remove the caching mechanism in xarray, it would be one more incentive to make the switch.

So you work on the assumption that no new potentially useful coords will be added after the first invocation of your accessor?

Yes. In almost all use cases I know of this won't happen. Typically, the users have to create (or open) an xarray object that salem understands before calling the accessor anyway.

Or do you have logic that invalidates your cache every time the state of the coords changes?

No. I didn't think about that before today ;-). But you are right: if users of salem's accessor change the coordinates after calling the accessor, then bad things might happen. Experience shows that this rarely happens (never happened to me), but I can see how this can fire back.

Altogether, these are good arguments to remove the caching mechanism I believe.

shoyer · 2019-08-27T16:22:44Z

It isn't just methods that use _to_temp_dataset() that result in losing the accessor state -- any operation that creates a new DataArray will (by design) lose the accessor, which don't get propagated in any operations.

Accessors are also not preserved when indexing a DataArray out of a Dataset (#3205), cc @djhoese.

I had not contemplated the issue of circular references, which I agree is not ideal. If we had realized that when creating accessors in the first place we might have chosen a different design, but there are a number of users who rely upon it.

The circular reference issue could also be worked around in a user-friendly way by having the decorator automatically add methods to the decorated class

I like the look of this solution. It will still require users to update their code to avoid circular references (e.g., by removing their own __init__ method), but it will make the default behavior more sane.

dopplershift · 2019-08-27T17:32:32Z

I don't mind needing to update our accessor code. My only request is don't have a version that suddenly breaks it so that we only work on the old version or the new version. 😉

gmaze · 2019-10-04T16:07:21Z

Hi all,
I recently encountered an issue that look like this with accessor, but not sure.
Here is a peace of code that reproduces the issue.

Starting from a class with the core of the code and an accessor to implement the user API:

import xarray

class BaseEstimator():
    def fit(self, this_ds, x=None):
        # Do something with this_ds:
        x = x**2
        # and create a new array with results:
        da = xarray.DataArray(x).rename('fit_data')
        # Return results:
        return da
    
    def transform(self, this_ds, **kw):
        # Do something with this_ds:
        val = kw['y'] + this_ds['fit_data']
        # and create a new array with results:
        da = xarray.DataArray(val).rename('trf_data')
        # Return results:
        return da

@xarray.register_dataset_accessor('my_accessor')
class Foo:
    def __init__(self, obj):
        self.obj = obj
        self.added = list()
        
    def add(self, da):
        self.obj[da.name] = da
        self.added.append(da.name)
        return self.obj
    
    def clean(self):
        for v in self.added:
            self.obj = self.obj.drop(v)
            self.added.remove(v)
        return self.obj    
    
    def fit(self, estimator, **kw):
        this_da = estimator.fit(self, **kw)
        return self.add(this_da)
    
    def transform(self, estimator, **kw):
        this_da = estimator.transform(self.obj, **kw)
        return self.add(this_da)

Now if we consider this workflow:

ds = xarray.Dataset()
ds['ext_data'] = xarray.DataArray(1.)

my_estimator = BaseEstimator()
ds = ds.my_accessor.fit(my_estimator, x=2.)

print("Before clean:")
print("xr.DataSet var  :", list(ds.data_vars))
print("accessor.obj var:", list(ds.my_accessor.obj.data_vars))

print("\nAfter clean:")
# ds.my_accessor.clean() # This does nothing to ds but clean the accessor.obj
# ds = ds.my_accessor.clean() # Cleaning ok for both ds and accessor.obj
ds_clean = ds.my_accessor.clean() # Cleaning ok on new ds, does nothing to ds as expected but clean in accessor.obj
print("xr.DataSet var    :", list(ds.data_vars))
print("accessor.obj var  :", list(ds.my_accessor.obj.data_vars))
print("Cleaned xr.DataSet var:", list(ds_clean.data_vars))

We have the following output:

Before clean:
xr.DataSet var  : ['ext_data', 'fit_data']
accessor.obj var: ['ext_data', 'fit_data']

After clean:
xr.DataSet var    : ['ext_data', 'fit_data']
accessor.obj var  : ['ext_data']
Cleaned xr.DataSet var: ['ext_data']

The issue is clear here: the base space dataset has the 'fit_data' variable but not the accessor object: they've been "disconnected" and it's not apparent to users.

So if users later proceed to run the "transform":

ds.my_accessor.transform(my_estimator, y=2.)

they get an KeyError raised because the 'fit_data' is not in the accessor, although it still appears on the list of the ds variables, which is more than confusing.

Sorry for this long post, I'm not sure it's relevant to this issue but it seems so to me.
I don't see a solution to this from the accessor developer side, except for not "interfering" with the content of the accessed object.

crusaderky · 2019-10-04T22:14:38Z

@gmaze Dataset.drop does not mutate the state of the original object, so it's conceptually wrong for your clean() method to mutate the accessor state too. It should be:

    def clean(self):
        return self.obj.drop(self.added)

The new dataset returned will have no accessor cache, and will recreate an instance on the fly on first access.

gmaze · 2019-10-07T19:49:41Z

@crusaderky thanks for the explanation, that's a solution to my pb.

Although I understand that since accessor will be created from scratch, a dataset copy won't propagate the accessor properties (in this case the list of added variables):

ds = xarray.Dataset()
ds['ext_data'] = xarray.DataArray(1.)

my_estimator = BaseEstimator() # With "clean" method from @crusaderky 
ds.my_accessor.fit(my_estimator, x=2.)
ds.my_accessor.transform(my_estimator, y=3.)

ds2 = ds.copy()

ds = ds.my_accessor.clean()
ds2 = ds2.my_accessor.clean()

print(ds.data_vars)
print(ds2.data_vars)

gives:

Data variables:
    ext_data  float64 1.0
Data variables:
    ext_data  float64 1.0
    fit_data  float64 4.0
    trf_data  float64 7.0

"Cleaning" the dataset works as expected, but the copy (ds2) has en empty list of added variables so the "clean" method doesn't have the expected result. We have the same behavior for deep copy.

Would that make any sense that the xr.DataSet.copy() method also return a copy of the accessors ?

crusaderky · 2019-10-07T23:03:28Z

Would that make any sense that the xr.DataSet.copy() method also return a copy of the accessors ?

It's been discussed above in this same thread. It's impossible without breaking the accessor API, as it would require you (the accessor developer) to define a copy method.
The more high level discussion is that the statefulness of the accessor is something that is OK to use for caching and performance improvements, and not OK for storing functional information like yours.

Have you considered storing a flag in Variable.attrs instead?

    def add(self, da):
        da.attrs["cleanable"] = True
        self.obj[da.name] = da
        return self.obj

    def clean(self):
        return self.obj.drop([
            k for k, v in self.obj.variables.items()
            if v.attrs.get("cleanable")
        ])

gmaze · 2019-10-08T07:28:07Z

Ok, I get it. Probably the accessor is not the best solution in my case.
And yes, an attribute was in fact my first implementation of the add/clean idea. But I was afraid it would be less reliable than the internal list over a long term perspective (but that was before getting in the troubles described above).

But why is asking accessor developers to define a copy method an issue ? That wouldn't be mandatory but only required in situations where propagating functional informations could be useful. Sorry if that's a naive question for you guys.

crusaderky · 2019-10-08T11:04:06Z

Thing is, the whole thing is undefined.
What does the accessor state contain? As a xarray developer, I don't know.
Is it variable names? Is it references to objects that make up the Dataset, e.g. Variables or the attrs dict? Is it objects whose contents rely on the current state of the Dataset, e.g. aggregated measures? Is it objects whose contents rely on historical events (like in your case)?

Dataset.copy() will create a copy of everything up to and excluding the numpy arrays. In order to allow you to retain accessor state, we'd need to plant a hook in it and invoke some agreed duck-type API in your object that basically states, "I called copy(), and this is the new object I created, please create a copy of yourself accordingly making extra sure you don't retain references to components of the previous object".
And then there are all the other methods that currently nuke the accessor state - including many in-place ones - because they could potentially invalidate it. What should they do? Invoke a special API on the accessor? If not, why should copy() trigger special accessor API and e.g. roll() shouldn't?
Planting accessor-refresher hooks in every single method that currently just wipes it away is out of question as it would need to be almost everywhere and - more importantly - it would be born broken.

gmaze · 2019-10-08T11:13:25Z

Alright, I think I get it, thanks for the clarification @crusaderky

crusaderky mentioned this issue Aug 27, 2019

__slots__ #3250

Merged

crusaderky changed the title ~~Stateful accessors~~ Stateful user-defined accessors Aug 27, 2019

jthielen mentioned this issue Aug 27, 2019

Bug in vertical coordinate verification Unidata/MetPy#1124

Closed

crusaderky mentioned this issue Nov 28, 2019

New accessor for a biomechanics open source library #3582

Closed

TomNicholas mentioned this issue Apr 10, 2020

Extending Xarray for domain-specific toolkits #3959

Closed

dcherian mentioned this issue Mar 3, 2021

Flexible indexes refactoring notes #4979

Merged

tomvothecoder mentioned this issue Aug 6, 2021

Add wrappers for opening datasets and data variables xCDAT/xcdat#81

Merged

9 tasks

tomvothecoder mentioned this issue Aug 31, 2021

DataArray accessor bounds attributes don't persist for all xarray functions that return new DataArrays xCDAT/xcdat#99

Closed

TomNicholas mentioned this issue Oct 14, 2021

Cannot set dataarray accessor attributes if DataArray part of Dataset #5863

Closed

kdebrab mentioned this issue Jul 11, 2022

BUG: caching in CachedAccessor is problematic and (probably) unnecessary pandas-dev/pandas#47667

Closed

3 tasks

This was referenced Feb 17, 2023

Unable to use dataarray if its a variable of a dataset GenericMappingTools/pygmt#1984

Closed

Make a gmt xarray accessor to store metadata from grdinfo GenericMappingTools/pygmt#499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateful user-defined accessors #3268

Stateful user-defined accessors #3268

crusaderky commented Aug 27, 2019 •

edited

Loading

crusaderky commented Aug 27, 2019

fmaussion commented Aug 27, 2019 •

edited

Loading

crusaderky commented Aug 27, 2019

crusaderky commented Aug 27, 2019 •

edited

Loading

crusaderky commented Aug 27, 2019 •

edited

Loading

fmaussion commented Aug 27, 2019

shoyer commented Aug 27, 2019

dopplershift commented Aug 27, 2019

gmaze commented Oct 4, 2019 •

edited

Loading

crusaderky commented Oct 4, 2019 •

edited

Loading

gmaze commented Oct 7, 2019

crusaderky commented Oct 7, 2019 •

edited

Loading

gmaze commented Oct 8, 2019

crusaderky commented Oct 8, 2019

gmaze commented Oct 8, 2019

Stateful user-defined accessors #3268

Stateful user-defined accessors #3268

Comments

crusaderky commented Aug 27, 2019 • edited Loading

crusaderky commented Aug 27, 2019

fmaussion commented Aug 27, 2019 • edited Loading

crusaderky commented Aug 27, 2019

crusaderky commented Aug 27, 2019 • edited Loading

crusaderky commented Aug 27, 2019 • edited Loading

fmaussion commented Aug 27, 2019

shoyer commented Aug 27, 2019

dopplershift commented Aug 27, 2019

gmaze commented Oct 4, 2019 • edited Loading

crusaderky commented Oct 4, 2019 • edited Loading

gmaze commented Oct 7, 2019

crusaderky commented Oct 7, 2019 • edited Loading

gmaze commented Oct 8, 2019

crusaderky commented Oct 8, 2019

gmaze commented Oct 8, 2019

crusaderky commented Aug 27, 2019 •

edited

Loading

fmaussion commented Aug 27, 2019 •

edited

Loading

crusaderky commented Aug 27, 2019 •

edited

Loading

crusaderky commented Aug 27, 2019 •

edited

Loading

gmaze commented Oct 4, 2019 •

edited

Loading

crusaderky commented Oct 4, 2019 •

edited

Loading

crusaderky commented Oct 7, 2019 •

edited

Loading