Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function for applying vectorized functions for unlabeled arrays to xarray objects #964

Merged
merged 40 commits into from
Jan 6, 2017

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Aug 12, 2016

This PR creates new public facing function xarray.apply_ufunc which handles all the logic of applying numpy generalized universal functions to xarray's labelled arrays, including automatic alignment, merging coordinates, broadcasting and reapplying labels to the result.

Note that although we use the gufunc interface here, this works for far more than gufuncs. Any function that handles broadcasting in the usual numpy way will do. See below for examples.

Now that this logic is all in one place, we will even be able to (in a follow-up PR) include hooks for setting output array names and attributes based on input (e.g., to allow third party libraries to add unit support #525).

Xref #770

Examples

Calculate the vector magnitude of two arguments:

def magnitude(a, b):
    func = lambda x, y: np.sqrt(x ** 2 + y ** 2)
    return xr.apply_func(func, a, b)

Compute the mean (.mean)::

def mean(obj, dim):
    # note: apply_ufunc always moves core dimensions to the end
    sig = ([(dim,)], [()])
    kwargs = {'axis': -1}
    return xr.apply_ufunc(np.mean, obj, signature=sig, kwargs=kwargs)

Inner product over a specific dimension::

def gufunc_inner(x, y):
    result = np.matmul(x[..., np.newaxis, :], y[..., :, np.newaxis])
    return result[..., 0, 0]

def inner_product(a, b, dim):
    sig = ([(dim,), (dim,)], [()])
    return xr.apply_ufunc(gufunc_inner, a, b, signature=sig)

Stack objects along a new dimension (like xr.concat)::

def stack(objects, dim, new_coord):
    sig = ([()] * len(objects), [(dim,)])
    new_coords = [{dim: new_coord}]
    func = lambda *x: np.stack(x, axis=-1)
    return xr.apply_ufunc(func, *objects, signature=sig,
                          new_coords=new_coords,
                          dataset_fill_value=np.nan)

Singular value decomposition:

def dim_shape(obj, dim):
    # TODO: make this unnecessary, see #921
    try:
        return obj.dims.index(dim)
    except AttributeError:
        return obj.dims[dim]

def svd(obj, dim0, dim1, new_dim='singular_values'):
    sig = ([(dim0, dim1)], [(dim0, new_dim), (new_dim,), (new_dim, dim1)])
    K = min(dim_shape(obj, dim0), dim_shape(obj, dim1))
    new_coords = [{new_dim: np.arange(K)}] * 3
    return xr.apply_ufunc(np.linalg.svd, obj, signature=sig,
                          new_coords=new_coords,
                          kwargs={'full_matrices': False})

Signature/Docstring

apply_ufunc(func, *args, signature=None, join='inner', new_coords=None,
         exclude_dims=frozenset(), dataset_fill_value=None, kwargs=None,
         dask_array='forbidden')

Apply a vectorized function for unlabeled arrays to xarray objects.

The input arguments will be handled using xarray's standard rules for
labeled computation, including alignment, broadcasting, looping over
GroupBy/Dataset variables, and merging of coordinates.

Parameters
----------
func : callable
    Function to call like ``func(*args, **kwargs)`` on unlabeled arrays
    (``.data``). If multiple arguments with non-matching dimensions are
    supplied, this function is expected to vectorize (broadcast) over
    axes of positional arguments in the style of NumPy universal
    functions [1]_.
*args : Dataset, DataArray, GroupBy, Variable, numpy/dask arrays or scalars
    Mix of labeled and/or unlabeled arrays to which to apply the function.
signature : string or triply nested sequence, optional
    Object indicating core dimensions that should not be broadcast on
    the input and outputs arguments. If omitted, inputs will be broadcast
    to share all dimensions in common before calling ``func`` on their
    values, and the output of ``func`` will be assumed to be a single array
    with the same shape as the inputs.

    Two forms of signatures are accepted:
    (a) A signature string of the form used by NumPy's generalized
        universal functions [2]_, e.g., '(),(time)->()' indicating a
        function that accepts two arguments and returns a single argument,
        on which all dimensions should be broadcast except 'time' on the
        second argument.
    (a) A triply nested sequence providing lists of core dimensions for
        each variable, for both input and output, e.g.,
        ``([(), ('time',)], [()])``.

    Core dimensions are automatically moved to the last axes of any input
    variables, which facilitates using NumPy style generalized ufuncs (see
    the examples below).

    Unlike the NumPy gufunc signature spec, the names of all dimensions
    provided in signatures must be the names of actual dimensions on the
    xarray objects.
join : {'outer', 'inner', 'left', 'right'}, optional
    Method for joining the indexes of the passed objects along each
    dimension, and the variables of Dataset objects with mismatched
    data variables:
    - 'outer': use the union of object indexes
    - 'inner': use the intersection of object indexes
    - 'left': use indexes from the first object with each dimension
    - 'right': use indexes from the last object with each dimension
new_coords : list of dict-like, optional
    New coordinates to include on each output variable. Any core dimensions
    on outputs not found on the inputs must be provided here.
exclude_dims : set, optional
    Dimensions to exclude from alignment and broadcasting. Any inputs
    coordinates along these dimensions will be dropped. If you include
    these dimensions on any outputs, you must explicit set them in
    ``new_coords``. Each excluded dimension must be a core dimension in the
    function signature.
dataset_fill_value : optional
    Value used in place of missing variables on Dataset inputs when the
    datasets do not share the exact same ``data_vars``. Only relevant if
    ``join != 'inner'``.
kwargs: dict, optional
    Optional keyword arguments passed directly on to call ``func``.
dask_array: 'forbidden' or 'allowed', optional
    Whether or not to allow applying the ufunc to objects containing lazy
    data in the form of dask arrays. By default, this is forbidden, to
    avoid implicitly converting lazy data.

Returns
-------
Single value or tuple of Dataset, DataArray, Variable, dask.array.Array or
numpy.ndarray, the first type on that list to appear on an input.

coords = merge_coords_without_align(coord_variables)
name = result_name(args)

data_vars = [getattr(a, 'variable') for a in args]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here & variables above could be normal a.variable rather than getattr...?

@max-sixty
Copy link
Collaborator

This looks awesome! Would simplify a lot of the existing op stuff!

if not set(valid_core_dims_for_axis) <= axis_dims:
raise ValueError('axis dimensions %r have overlap with core '
'dimensions %r, but do not appear at the start'
% (axis_dims, core_dims))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the order of the passed dims matter? (i.e. why not transpose them all into the order that's needed)

@max-sixty
Copy link
Collaborator

When this is done & we can do where, I wonder whether

da[bool_array] = 5

...could be sugar for...

da.where(bool_array, 5)

i.e. do we get multidimensional indexing for free?

@shoyer
Copy link
Member Author

shoyer commented Aug 12, 2016

@MaximilianR Two issues come to mind with remapping da[condition] = other -> da = da.where(bool_array, other):

  1. What should da[bool_array] return? For consistency, I think we need to support both. We could alias that to where, too (da[bool_array] -> da.where(bool_array)), but this would be inconsistent with the existing behavior of da[bool_array] if da and bool_array are one-dimensional. This suggests maybe ds[bool_array] -> da.where(bool_array, drop=True).
  2. What is other is not a scalar value, but is a DataArray? Currently, I don't think we align when indexing, but in this case, by necessity we would. For __getitem__ indexing, it's pretty obvious that alignment should preserve the indexes of the object being indexed. For __setitem__ indexing, I'm not sure. At the least they would be a little different from the defaults for where (which does an inner join like most xarray operations by default). Maybe something like: left_join(da, inner_join(bool_array, other))? We need to make sure that every xarray assignment like obj[key] or obj[key] = value (and also .loc and .sel and so on) works sanely with these alignment rules.

@max-sixty
Copy link
Collaborator

max-sixty commented Aug 12, 2016

Thanks for thinking through these

This suggests maybe ds[bool_array] -> da.where(bool_array, drop=True).

I think that makes sense. drop=False would be too confusing

Maybe something like: left_join(da, inner_join(bool_array, other))?

The way I was thinking about it: both other and bool_array need a value for every value in da. So they both need to be subsets. So something like:

assert set(other.dims) =< set(da.dims)
assert set(bool_array.dims) =< set(da.dims)

other, _ = xr.broadcast(other, da)
bool_array, _ = xr.broadcast(bool_array, da)

da.where(bool_array, other)

Is that consistent with the joins you were thinking of?

@shoyer
Copy link
Member Author

shoyer commented Sep 13, 2016

This is now tested and ready for review. The API could particularly use feedback -- please take a look at the docstring and examples in the first comment. Long desired operations, like a fill value for where (#576) and cumsum (#791) should now be writable in only a few lines.

I have not yet hooked this up to the rest of xarray's code base, both because the set of changes we will be able to do with this are quite large, and because I'd like to give other contributors a chance to help/test. Note that the general version of apply_ufunc can include some significant overhead for doing the dispatch. For binary operations, we will probably want to use the pre-specialized versions (e.g., apply_dataset_ufunc).

Finally, given the generality of this operation, I'm considering renaming it from xr.apply_ufunc to simply xr.apply.

@shoyer shoyer changed the title WIP: rewrite the guts of the code for labeled computation New function: xarray.apply_ufunc Sep 13, 2016
@shoyer shoyer changed the title New function: xarray.apply_ufunc apply_ufunc for applying NumPy universal functions to xarray objects Sep 13, 2016
@shoyer
Copy link
Member Author

shoyer commented Sep 13, 2016

# type: List[object] -> Any
# use the same naming heuristics as pandas:
# https://github.com/blaze/blaze/issues/458#issuecomment-51936356
names = set(getattr(obj, 'name', None) for obj in objects)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this could be a set comprehension

@max-sixty
Copy link
Collaborator

Would it be possible to write something like np.einsum with xarray named dimensions?

I think it's possible, by supplying the dimensions to sum over, and broadcasting the others. Similar to the inner_product example, but taking *dims rather than dims. Is that right?

return tuple(x)


class Signature(object):
Copy link
Collaborator

@max-sixty max-sixty Sep 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a Signature class in Python3 FYI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to UFuncSignature to minimize potential confusion.

@shoyer
Copy link
Member Author

shoyer commented Sep 23, 2016

One of the tricky things with apply is that there are a lot of similar but distinct use cases to disambiguate. I'll outline a few of these below.

I'd appreciate feedback on which cases are most essential and which can wait until later (this PR is already getting pretty big).

Also, I'd appreciate ideas for how to make the API more easily understood. We will have extensive docs either way, but xarray.apply is probably already in the realm of "too many arguments for one function". The last thing I want to do is to make a swiss army knife so flexible (like (numpy.nditer)[https://docs.scipy.org/doc/numpy/reference/generated/numpy.nditer.html]) that nobody uses it because they don't understand how it works.

How func vectorizes

There are two main cases here:

  1. Functions already written to vectorize their arguments:

    1. Scalar functions built out of NumPy primitives (e.g., a + b + c). These work by default.
    2. Functions that use core dimension referred to by axis (e.g., np.mean). These work if you set axis=-1 and put the dimension in the signature, but the API is kind of awkward. You're rather that the wrapper just converts argument like dim='time' automatically into axis=2. Transposing these core dimensions to the end also feels unnecessary, though maybe not a serious concern given that transposing NumPy arrays involves no memory copies.
    3. Functions that work mostly like gufuncs, but aren't actually (e.g., np.svd). This is pretty common, because NumPy ufuncs have some serious limitations (e.g.., they can't handle non-vectorized arguments). These work about as well as we could hope, modulo possible improvements to the signature spec.
    4. True gufuncs, most likely written with numba.guvectorize. For these functions, we'd like a way to extract/use the signature automatically.
  2. Functions for which you only have the inner loop (e.g., np.polyfit or scipy.stats.pearsonr). Running these is going to entail large Python overhead, but often that's acceptable.

    One option for these is to wrap them into something that broadcasts like a gufunc, e.g., via a new function numpy.guvectorize (ENH: add signature argument to vectorize for vectorizing like generalized ufuncs numpy/numpy#8054). But as a user, this is a lot of wrappers to write. You'd rather just add something like vectorize=True and let xarray handle all the automatic broadcasting, e.g.,

    def poly_fit(x, y, dim='time', deg=1):
        return xr.apply(np.polyfit, x, y,
                        signature=([(dim,), (dim,)], [('poly_order',)]),
                        new_coords={'poly_order': range(deg + 1)},
                        kwargs={'deg': deg}, vectorize=True)

Whether func applies to "data only" or "everything"

Most "computation" functions/methods in xarray (e.g., arithmetic and reduce methods) follow the rule of merging coordinates, and only applying the core function to data variables. Coordinates that are no longer valid with new dimensions are dropped. This is currently what we do in apply.

On the other hand, there are also function/methods that we might refer to as "organizing" (e.g., indexing methods, concat, stack/unstack, transpose), which generally apply to every variable, including coordinates. It seems like there are definitely use cases for applying these sorts of functions, too, e.g., to wrap Cartopy's add_cyclic_point utility (#1005). So, I think we might need another option to toggle what happens to coordinates (e.g., variables='data' vs variables='all').

How to handle mismatched core dimensions

Xarray methods often have fallbacks to handle data with different dimensions. For example, if you write ds.mean(['x', 'y']), it matches on core dimensions to apply four different possible functions to each data variables:

  • mean over ('x', 'y'): for variables with both dimensions
  • mean over 'x': for variables with only 'x'
  • mean over 'y': for variables with only 'y'
  • identity: for variables with neither 'x' nor 'y'

Indexing is another example -- it applies to both data and coordinates, but only to matching dimensions for each variable. If you don't have the dimensions, we ignore the variable.

Writing something like mean with a single call to apply would entail the need for something like a dispatching system to pick which function to use, e.g., instead of a singular func/signature pair you pass a dispatcher function that chooses func/signature based on the core dimensions of passed variable. This feels like serious over engineering.

Instead, we might support a few pre-canned options for how to deal with mismatched dimensions. For example:

  • missing_core_dims='drop': silently drop these variables in the output(s)
  • missing_core_dims='error': raise an error. This is the current default behavior, which is probably only be useful with `variables='data' -- otherwise some coordinate variables would always error.
  • missing_core_dims='keep': keep these variables unchanged in the output(s) (use merge_variables to check for conflicts)
  • missing_core_dims='broadcast': broadcast all inputs to have the necessary core dimensions if they don't have them already

Another option would be to consolidate this with the variables option to allow only two modes of operation:

  • variables='data': For "computation" functions. Apply only to data variables, and error if any data variables are missing a core dimension. Merge coordinates on the output(s), silently dropping conflicts for variables that don't label a dimension.
  • variables='matching': For "organizing" functions. Apply to every variable with matching core dimensions. Merge everything else on the outputs(s), silently dropping conflicts for variables that don't label a dimension.

@crusaderky
Copy link
Contributor

crusaderky commented Oct 1, 2016

Any hope to get dask support? Even with the limitation of having 1:1 matching between input and output chunks, it would already be tremendously useful

In other words, it should be easy to automatically call dask.array.map_blocks

@crusaderky
Copy link
Contributor

I worked around the limitation. It would be nice if apply() did the below automatically!

from itertools import chain
from functools import wraps
import dask.array

def dask_kernel(func):
    """Invoke dask.array.map_blocks(func, *args, **kwds) if at least one
    of the arguments is a dask array; else invoke func(*args, **kwds)
    """
    @wraps(func)
    def wrapper(*args, **kwds):
        if any(isinstance(a, dask.array.Array) for a in chain(args, kwds.values())):
            return dask.array.map_blocks(func, *args, **kwds)
        else:
            return func(*args, **kwds)
    return wrapper

@shoyer
Copy link
Member Author

shoyer commented Oct 28, 2016

I'm thinking about making a few tweaks and merging this, but not exposing it to users yet as part of public API. The public API is not quite there yet, but even as it I think it would be a useful building point for internal functionality (e.g., for #1065), and then other people could start to build on this as well.

@max-sixty
Copy link
Collaborator

I'm thinking through how difficult it would be to add back-fill method to DataArray (that could be an argument to fillna or a bfill method - that's a separate discussion).

Would this PR help? I'm trying to wrap my head around the options. Thanks

@shoyer
Copy link
Member Author

shoyer commented Nov 30, 2016

I'm thinking through how difficult it would be to add back-fill method to DataArray (that could be an argument to fillna or a bfill method - that's a separate discussion).

Would this PR help? I'm trying to wrap my head around the options. Thanks

Yes, quite likely. In the current state, it would depend on if you want to back-fill all variables or just data variables (only the later is currently supported).

Either way, the first step is probably to write a function backfill(values, axis) that acts on NumPy arrays.

@max-sixty
Copy link
Collaborator

Either way, the first step is probably to write a function backfill(values, axis) that acts on NumPy arrays.

Right.
Surprisingly, I can't actually find something like this out there. The pandas code is good but highly 1-2 dimension specific.

Let me know if I'm missing (pun intended - long day) something. Is there a library of these sorts of functions over n-dims somewhere else (even R / Julia)? Or are we really the first people in the world to be doing this?

@shoyer
Copy link
Member Author

shoyer commented Nov 30, 2016

Surprisingly, I can't actually find something like this out there. The pandas code is good but highly 1-2 dimension specific.

Let me know if I'm missing (pun intended - long day) something. Is there a library of these sorts of functions over n-dims somewhere else (even R / Julia)? Or are we really the first people in the world to be doing this?

Usually I check numpy and bottleneck. It actually looks like bottleneck.push is what you're looking for. I think this is a recent addition to bottleneck, though.

@max-sixty
Copy link
Collaborator

Gave this a quick spin for filling. A few questions:

  • Is there an easy of way of merely translating dims into axes? Maybe that already exists?
  • Is there as easy way to keep a dimension? Or should it be in the signature and a new_dim?
da=xr.DataArray(np.random.rand(10,3), dims=('x','y'))
da = da.where(da>0.5)
In [43]: da
Out[43]:
<xarray.DataArray (x: 10, y: 3)>
array([[        nan,  0.57243305,  0.84363016],
       [        nan,  0.90788156,         nan],
       [        nan,  0.50739189,  0.93701278],
       [        nan,         nan,  0.86804167],
       [        nan,  0.50883914,         nan],
       [        nan,         nan,         nan],
       [        nan,  0.91547763,         nan],
       [ 0.72920182,         nan,  0.6982745 ],
       [ 0.73033449,  0.950719  ,  0.73077113],
       [        nan,         nan,  0.72463932]])

In [44]: xr.apply(bn.push, da) . # already better than `bn.push(da)`!
Out[44]:
<xarray.DataArray (x: 10, y: 3)>
array([[        nan,  0.57243305,  0.84363016],
       [        nan,  0.90788156,  0.90788156],
       [        nan,  0.50739189,  0.93701278],
       [        nan,         nan,  0.86804167],
       [        nan,  0.50883914,  0.50883914],
       [        nan,         nan,         nan],
       [        nan,  0.91547763,  0.91547763],
       [ 0.72920182,  0.72920182,  0.6982745 ],
       [ 0.73033449,  0.950719  ,  0.73077113],
       [        nan,         nan,  0.72463932]])

# but changing the axis is verbose and transposes the array - are there existing tools for this?

In [48]: xr.apply(bn.push, da, signature='(x)->(x)', new_coords=[dict(x=da.x)])
Out[48]:
<xarray.DataArray (y: 3, x: 10)>
array([[        nan,         nan,         nan,         nan,         nan,
                nan,         nan,  0.72920182,  0.73033449,  0.73033449],
       [ 0.57243305,  0.90788156,  0.50739189,  0.50739189,  0.50883914,
         0.50883914,  0.91547763,  0.91547763,  0.950719  ,  0.950719  ],
       [ 0.84363016,  0.84363016,  0.93701278,  0.86804167,  0.86804167,
         0.86804167,  0.86804167,  0.6982745 ,  0.73077113,  0.72463932]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9
  o y        (y) -
  • The triple nested signature is pretty tough to write! Two kwargs?

@jhamman
Copy link
Member

jhamman commented Dec 30, 2016

@shoyer - do we want to get this into 0.9 as a private api function and aim to complete it for the public api by 0.10 or so?

@crusaderky
Copy link
Contributor

@shoyer - any plans to add dask support as suggested above?

@shoyer
Copy link
Member Author

shoyer commented Dec 31, 2016

@crusaderky

any plans to add dask support as suggested above?

Yes, in fact I have a branch with some basic support for this that I was working on a few months ago. I haven't written tests yet but I can potentially push that WIP to another PR after merging this.

There are a couple of recent feature additions to dask.array.atop (dask/dask#1612 and dask/dask#1716) that should make this easier and more powerful. I have not built anything on top of these yet, so my prior work is somewhat outdated.

@jhamman

do we want to get this into 0.9 as a private api function and aim to complete it for the public api by 0.10 or so?

Yes, this seems like a good goal. I'll take another look over this next week when I have the chance, to remove any work-in-progress bits that have snuck in and remove the public facing API.

@shoyer
Copy link
Member Author

shoyer commented Jan 4, 2017

I removed the public facing API and renamed the (now private) apply function back to apply_ufunc. I also removed the new_coords argument, in favor of encouraging using .coords or .assign_coords.

As discussed above, the current API with signature is difficult to use, but this is probably fine for an internal function and we can revisit the public facing API later. Any objections to merging this?

@shoyer shoyer merged commit 27d04a1 into pydata:master Jan 6, 2017
@shoyer
Copy link
Member Author

shoyer commented Jan 6, 2017

OK, in it goes. Once again, there's no public API exposed yet.

@max-sixty
Copy link
Collaborator

Congrats!

@max-sixty
Copy link
Collaborator

FWIW the bn.push example still has some unanswered questions - would be interested to know if there's an easier way of doing that. Particularly if it's just a 'dim for axis' swap

@shoyer
Copy link
Member Author

shoyer commented Feb 2, 2017

#1245 replaces the unintuitive signature argument with separate input_core_dims and output_core_dims.

@shoyer shoyer deleted the apply_ufunc branch September 17, 2017 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants