Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fillna removes non-indexed coordinates in some cases #9124

Open
5 tasks done
JGuetschow opened this issue Jun 14, 2024 · 3 comments
Open
5 tasks done

fillna removes non-indexed coordinates in some cases #9124

JGuetschow opened this issue Jun 14, 2024 · 3 comments
Labels

Comments

@JGuetschow
Copy link

What happened?

fillna (and other functions which need aligning) remove non-indexed extra coordinates (along a dimensions which also have an indexed coordinates)

Below I have an example with fillna, But I have experienced similar things with combine_first and assume it will happen whenever alignment is needed. More on what I think is happening below.

What did you expect to happen?

I expected the extra coordinate to remain in place as it was defined consistently in both datasets. Alternatively an error message would also help to understand what's going on. Currently it's dropped silently.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

# create a dataset wit a few dimensions and random data
area_iso3 = np.array(["COL", "ARG", "MEX", "BOL"])

test_ds = xr.Dataset(
    { "CO2": 
          xr.DataArray(data=np.ones(len(area_iso3)),
                       coords={
                           "area (ISO3)": area_iso3,
                       },
                       dims=["area (ISO3)"])
    }
) 

# attach an additional coordinate the existing dimensions
country_names = ["Colombia", "Argentina", "Mexico", "Bolovia"]
test_ds = test_ds.assign_coords(country_name=("area (ISO3)", country_names))

#### use a loc to fill - dim with additional coordinate involved
test_ds_loc = test_ds.loc[{'area (ISO3)': ['COL', 'ARG']}]
print(f"test_ds_loc coords: {test_ds_loc.coords}")

# set some values to nan to fill later
test_ds["CO2"].loc[{'area (ISO3)': ['COL', 'ARG']}] = np.nan

# fill
test_ds = test_ds.fillna(test_ds_loc)
print(f"test_ds coords after fillna: {test_ds.coords}\n")
# additional coordinate gone

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No error is raise and no log output generated. My output of the example code is:

test_ds_loc coords: Coordinates:
  * area (ISO3)   (area (ISO3)) <U3 24B 'COL' 'ARG'
    country_name  (area (ISO3)) <U9 72B 'Colombia' 'Argentina'
test_ds coords after fillna: Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'

Anything else we need to know?

The problem only occurs when the .loc is done on the dimension with the additional coordinate. I think the reason for the problem is the following:

align fills the additional coordinate from the smaller dataset (same for dataarray) with np.nan to expand it to the larger dataset. Later in the process the aligned coordinates are passed to merge_coordinates_without_align which removes the additional coordinate as it contains np.nan in one of the datasets where the other dataset has values.

So the non-index coordinates are neither combined like the indexed coordinates, nor filled like the data variables.

Below a short example involving only align and merge_coordinates_without_align.

import xarray as xr
from xarray.core.merge import merge_coordinates_without_align
import numpy as np

# create a dataset wit a few dimensions and random data
area_iso3 = np.array(["COL", "ARG", "MEX", "BOL"])

test_ds = xr.Dataset(
    { "CO2":
          xr.DataArray(data=np.ones(4),
                       coords={
                           "area (ISO3)": area_iso3,
                       },
                       dims=["area (ISO3)"])
      }
)

# attach an additional coordinate to one of the existing dimensions
country_names = ["Colombia", "Argentina", "Mexico", "Bolovia"]
test_ds = test_ds.assign_coords(country_name=("area (ISO3)", country_names))

test_ds_loc = test_ds.loc[{'area (ISO3)': ['COL', 'ARG']}]
print(f"test_ds_loc coords: {test_ds_loc.coords}")

aligned = xr.align(test_ds,test_ds_loc,join='outer')
merged = merge_coordinates_without_align(aligned)
print(f"merged coords: {merged[0]}")

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 6.5.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: None

xarray: 2024.6.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: None
pydap: None
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.3.8
dask: 2023.12.1
distributed: None
matplotlib: 3.9.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.6.0
cupy: None
pint: 0.24
sparse: None
flox: None
numpy_groupies: None
setuptools: 70.0.0
pip: 24.0
conda: None
pytest: 7.4.4
mypy: 1.10.0
IPython: 8.25.0
sphinx: 5.3.0

@JGuetschow JGuetschow added bug needs triage Issue that has not been reviewed by xarray team member labels Jun 14, 2024
Copy link

welcome bot commented Jun 14, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty
Copy link
Collaborator

This does seem like confusing behavior. We'd def welcome a fix.

@max-sixty max-sixty removed the needs triage Issue that has not been reviewed by xarray team member label Jun 23, 2024
@JGuetschow
Copy link
Author

So far I just have a workaround for our use case which merged the additional coordinates independently (treat them like variables). I looked into the xarray code, but without sufficient time to understand it, I think chances are high that I break more than I fix. I probably won't have time to dig deeper before November, but if it's still open then I'll take a look.
An easy workaround is actually to use merge were possible as I could not reproduce the problem with merge (but I only tested the use cases in our code, so I might have missed something)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants