-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataTree: Align from_dict
and to_dict
behaviours to their Dataset equivalents
#9074
Comments
Reworked issue description with examples from the current implementation Is your feature request related to a problem?This feature request arises from the following use case: I rely on flowchart
subgraph Serialization
Dataset_in[Dataset] --> dict_in[dict]
dict_in[dict] --> JSON_out[JSON]
end
flowchart
subgraph Deserialization
JSON_in[JSON] --> dict_out[dict]
dict_out[dict] --> Dataset_out[Dataset]
end
JSON can be useful for small datasets, containing configuration values for instance, that should be easily openable/modifiable by a human directly in a text editor, without using any external library or script. Using xarray's While these capabilities do exist yet for DataArrays and Datasets, they do not exist yet for DataTree. It means that currently, using xarray to read and write JSON limited to flat structures. Describe the solution you'd likeI would like the Currently the
It means a JSON cannot be reloaded back. Currently the In the following code example:
Build an example DataTree import pandas as pd
import numpy as np
import xarray as xr
from xarray.core import datatree as dt
xdt = dt.DataTree.from_dict(
name="(root)",
d={
"/": xr.Dataset(
coords={
"time": xr.DataArray(
data=pd.date_range(start="2020-12-01", end="2020-12-02", freq="D")[
:2
],
dims="time",
attrs={
"units": "date",
"long_name": "Time of acquisition",
},
)
},
attrs={
"description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite"
},
),
"/weather_data": xr.Dataset(
coords={
"station": xr.DataArray(
data=list("abcdef"),
dims="station",
attrs={
"units": "dl",
"long_name": "Station of acquisition",
},
)
},
data_vars={
"wind_speed": xr.DataArray(
np.ones((2, 6)) * 2,
dims=("time", "station"),
attrs={
"units": "meter/sec",
"long_name": "Wind speed",
},
),
"pressure": xr.DataArray(
np.ones((2, 6)) * 3,
dims=("time", "station"),
attrs={
"units": "hectopascals",
"long_name": "Time of acquisition",
},
),
},
attrs={"description": "Weather data node, inheriting the 'time' dimension"},
),
"/weather_data/temperature": xr.Dataset(
data_vars={
"air_temperature": xr.DataArray(
np.ones((2, 6)) * 3,
dims=("time", "station"),
attrs={
"units": "kelvin",
"long_name": "Air temperature",
},
),
"dewpoint_temp": xr.DataArray(
np.ones((2, 6)) * 4,
dims=("time", "station"),
attrs={
"units": "kelvin",
"long_name": "Dew point temperature",
},
),
},
attrs={
"description": (
"Temperature, subnode of the weather data node, "
"inheriting the 'time' dimension from root and 'station' "
"dimension from the Temperature group."
)
},
),
"/satellite_image": xr.Dataset(
coords={"x": [10, 20, 30], "y": [90, 80, 70]},
data_vars={
"infrared": xr.DataArray(
np.ones((2, 3, 3)) * 5, dims=("time", "y", "x")
),
"true_color": xr.DataArray(
np.ones((2, 3, 3)) * 6, dims=("time", "y", "x")
),
},
),
},
)
print(xdt) DataTree('(root)', parent=None)
│ Dimensions: (time: 2)
│ Coordinates:
│ * time (time) datetime64[ns] 16B 2020-12-01 2020-12-02
│ Data variables:
│ *empty*
│ Attributes:
│ description: Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│ │ Dimensions: (station: 6, time: 2)
│ │ Coordinates:
│ │ * station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│ │ Dimensions without coordinates: time
│ │ Data variables:
│ │ wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│ │ pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│ │ Attributes:
│ │ description: Weather data node, inheriting the 'time' dimension
│ └── DataTree('temperature')
│ Dimensions: (time: 2, station: 6)
│ Dimensions without coordinates: time, station
│ Data variables:
│ air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│ dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│ Attributes:
│ description: Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0 Convert to dict with the existing xdt.to_dict() {'/': <xarray.Dataset> Size: 16B
Dimensions: (time: 2)
Coordinates:
* time (time) datetime64[ns] 16B 2020-12-01 2020-12-02
Data variables:
*empty*
Attributes:
description: Root Hypothetical DataTree with heterogeneous data: weather...,
'/weather_data': <xarray.Dataset> Size: 216B
Dimensions: (station: 6, time: 2)
Coordinates:
* station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
Dimensions without coordinates: time
Data variables:
wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
Attributes:
description: Weather data node, inheriting the 'time' dimension,
'/satellite_image': <xarray.Dataset> Size: 336B
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0,
'/weather_data/temperature': <xarray.Dataset> Size: 192B
Dimensions: (time: 2, station: 6)
Dimensions without coordinates: time, station
Data variables:
air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
Attributes:
description: Temperature, subnode of the weather data node, inheriting t...} Convert to dict with the proposed xdt.to_dict_nested() {'coords': {'time': {'dims': ('time',),
'attrs': {'units': 'date', 'long_name': 'Time of acquisition'},
'data': [datetime.datetime(2020, 12, 1, 0, 0),
datetime.datetime(2020, 12, 2, 0, 0)]}},
'attrs': {'description': 'Root Hypothetical DataTree with heterogeneous data: weather and satellite'},
'dims': {'time': 2},
'data_vars': {},
'name': '(root)',
'children': {'weather_data': {'coords': {'station': {'dims': ('station',),
'attrs': {'units': 'dl', 'long_name': 'Station of acquisition'},
'data': ['a', 'b', 'c', 'd', 'e', 'f']}},
'attrs': {'description': "Weather data node, inheriting the 'time' dimension"},
'dims': {'station': 6, 'time': 2},
'data_vars': {'wind_speed': {'dims': ('time', 'station'),
'attrs': {'units': 'meter/sec', 'long_name': 'Wind speed'},
'data': [[2.0, 2.0, 2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0, 2.0, 2.0]]},
'pressure': {'dims': ('time', 'station'),
'attrs': {'units': 'hectopascals', 'long_name': 'Time of acquisition'},
'data': [[3.0, 3.0, 3.0, 3.0, 3.0, 3.0],
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0]]}},
'name': 'weather_data',
'children': {'temperature': {'coords': {},
'attrs': {'description': "Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."},
'dims': {'time': 2, 'station': 6},
'data_vars': {'air_temperature': {'dims': ('time', 'station'),
'attrs': {'units': 'kelvin', 'long_name': 'Air temperature'},
'data': [[3.0, 3.0, 3.0, 3.0, 3.0, 3.0],
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0]]},
'dewpoint_temp': {'dims': ('time', 'station'),
'attrs': {'units': 'kelvin', 'long_name': 'Dew point temperature'},
'data': [[4.0, 4.0, 4.0, 4.0, 4.0, 4.0],
[4.0, 4.0, 4.0, 4.0, 4.0, 4.0]]}},
'name': 'temperature',
'children': {}}}},
'satellite_image': {'coords': {'x': {'dims': ('x',),
'attrs': {},
'data': [10, 20, 30]},
'y': {'dims': ('y',), 'attrs': {}, 'data': [90, 80, 70]}},
'attrs': {},
'dims': {'x': 3, 'y': 3, 'time': 2},
'data_vars': {'infrared': {'dims': ('time', 'y', 'x'),
'attrs': {},
'data': [[[5.0, 5.0, 5.0], [5.0, 5.0, 5.0], [5.0, 5.0, 5.0]],
[[5.0, 5.0, 5.0], [5.0, 5.0, 5.0], [5.0, 5.0, 5.0]]]},
'true_color': {'dims': ('time', 'y', 'x'),
'attrs': {},
'data': [[[6.0, 6.0, 6.0], [6.0, 6.0, 6.0], [6.0, 6.0, 6.0]],
[[6.0, 6.0, 6.0], [6.0, 6.0, 6.0], [6.0, 6.0, 6.0]]]}},
'name': 'satellite_image',
'children': {}}}} print(json.dumps(xdt.to_dict_nested(), indent=4, default=str)) {
"coords": {
"time": {
"dims": [
"time"
],
"attrs": {
"units": "date",
"long_name": "Time of acquisition"
},
"data": [
"2020-12-01 00:00:00",
"2020-12-02 00:00:00"
]
}
},
"attrs": {
"description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite"
},
"dims": {
"time": 2
},
"data_vars": {},
"name": "(root)",
"children": {
"weather_data": {
"coords": {
"station": {
"dims": [
"station"
],
"attrs": {
"units": "dl",
"long_name": "Station of acquisition"
},
"data": [
"a",
"b",
"c",
"d",
"e",
"f"
]
}
},
"attrs": {
"description": "Weather data node, inheriting the 'time' dimension"
},
"dims": {
"station": 6,
"time": 2
},
"data_vars": {
"wind_speed": {
"dims": [
"time",
"station"
],
"attrs": {
"units": "meter/sec",
"long_name": "Wind speed"
},
"data": [
[
2.0,
2.0,
2.0,
2.0,
2.0,
2.0
],
[
2.0,
2.0,
2.0,
2.0,
2.0,
2.0
]
]
},
"pressure": {
"dims": [
"time",
"station"
],
"attrs": {
"units": "hectopascals",
"long_name": "Time of acquisition"
},
"data": [
[
3.0,
3.0,
3.0,
3.0,
3.0,
3.0
],
[
3.0,
3.0,
3.0,
3.0,
3.0,
3.0
]
]
}
},
"name": "weather_data",
"children": {
"temperature": {
"coords": {},
"attrs": {
"description": "Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."
},
"dims": {
"time": 2,
"station": 6
},
"data_vars": {
"air_temperature": {
"dims": [
"time",
"station"
],
"attrs": {
"units": "kelvin",
"long_name": "Air temperature"
},
"data": [
[
3.0,
3.0,
3.0,
3.0,
3.0,
3.0
],
[
3.0,
3.0,
3.0,
3.0,
3.0,
3.0
]
]
},
"dewpoint_temp": {
"dims": [
"time",
"station"
],
"attrs": {
"units": "kelvin",
"long_name": "Dew point temperature"
},
"data": [
[
4.0,
4.0,
4.0,
4.0,
4.0,
4.0
],
[
4.0,
4.0,
4.0,
4.0,
4.0,
4.0
]
]
}
},
"name": "temperature",
"children": {}
}
}
},
"satellite_image": {
"coords": {
"x": {
"dims": [
"x"
],
"attrs": {},
"data": [
10,
20,
30
]
},
"y": {
"dims": [
"y"
],
"attrs": {},
"data": [
90,
80,
70
]
}
},
"attrs": {},
"dims": {
"x": 3,
"y": 3,
"time": 2
},
"data_vars": {
"infrared": {
"dims": [
"time",
"y",
"x"
],
"attrs": {},
"data": [
[
[
5.0,
5.0,
5.0
],
[
5.0,
5.0,
5.0
],
[
5.0,
5.0,
5.0
]
],
[
[
5.0,
5.0,
5.0
],
[
5.0,
5.0,
5.0
],
[
5.0,
5.0,
5.0
]
]
]
},
"true_color": {
"dims": [
"time",
"y",
"x"
],
"attrs": {},
"data": [
[
[
6.0,
6.0,
6.0
],
[
6.0,
6.0,
6.0
],
[
6.0,
6.0,
6.0
]
],
[
[
6.0,
6.0,
6.0
],
[
6.0,
6.0,
6.0
],
[
6.0,
6.0,
6.0
]
]
]
}
},
"name": "satellite_image",
"children": {}
}
} (minified version): {"coords":{"time":{"dims":["time"],"attrs":{"units":"date","long_name":"Time of acquisition"},"data":["2020-12-01 00:00:00","2020-12-02 00:00:00"]}},"attrs":{"description":"Root Hypothetical DataTree with heterogeneous data: weather and satellite"},"dims":{"time":2},"data_vars":{},"name":"(root)","children":{"weather_data":{"coords":{"station":{"dims":["station"],"attrs":{"units":"dl","long_name":"Station of acquisition"},"data":["a","b","c","d","e","f"]}},"attrs":{"description":"Weather data node, inheriting the 'time' dimension"},"dims":{"station":6,"time":2},"data_vars":{"wind_speed":{"dims":["time","station"],"attrs":{"units":"meter/sec","long_name":"Wind speed"},"data":[[2,2,2,2,2,2],[2,2,2,2,2,2]]},"pressure":{"dims":["time","station"],"attrs":{"units":"hectopascals","long_name":"Time of acquisition"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]}},"name":"weather_data","children":{"temperature":{"coords":{},"attrs":{"description":"Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."},"dims":{"time":2,"station":6},"data_vars":{"air_temperature":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Air temperature"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]},"dewpoint_temp":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Dew point temperature"},"data":[[4,4,4,4,4,4],[4,4,4,4,4,4]]}},"name":"temperature","children":{}}}},"satellite_image":{"coords":{"x":{"dims":["x"],"attrs":{},"data":[10,20,30]},"y":{"dims":["y"],"attrs":{},"data":[90,80,70]}},"attrs":{},"dims":{"x":3,"y":3,"time":2},"data_vars":{"infrared":{"dims":["time","y","x"],"attrs":{},"data":[[[5,5,5],[5,5,5],[5,5,5]],[[5,5,5],[5,5,5],[5,5,5]]]},"true_color":{"dims":["time","y","x"],"attrs":{},"data":[[[6,6,6],[6,6,6],[6,6,6]],[[6,6,6],[6,6,6],[6,6,6]]]}},"name":"satellite_image","children":{}}}} (screenshot of the minified version): Load back to DataTree: dt.DataTree.from_dict_nested(json.loads(json.dumps(xdt.to_dict_nested(), indent=4, default=str))) DataTree('(root)', parent=None)
│ Dimensions: (time: 2)
│ Coordinates:
│ * time (time) <U19 152B '2020-12-01 00:00:00' '2020-12-02 00:00:00'
│ Data variables:
│ *empty*
│ Attributes:
│ description: Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│ │ Dimensions: (station: 6, time: 2)
│ │ Coordinates:
│ │ * station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│ │ Dimensions without coordinates: time
│ │ Data variables:
│ │ wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│ │ pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│ │ Attributes:
│ │ description: Weather data node, inheriting the 'time' dimension
│ └── DataTree('temperature')
│ Dimensions: (time: 2, station: 6)
│ Dimensions without coordinates: time, station
│ Data variables:
│ air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│ dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│ Attributes:
│ description: Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0 Remark: the time dimension is downgraded to a str as it is not JSON serializable. The scope of this feature request is to focus on the However, the round-trip dt.DataTree.from_dict_nested(xdt.to_dict_nested()) DataTree('(root)', parent=None)
│ Dimensions: (time: 2)
│ Coordinates:
│ * time (time) datetime64[ns] 16B 2020-12-01 2020-12-02
│ Data variables:
│ *empty*
│ Attributes:
│ description: Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│ │ Dimensions: (station: 6, time: 2)
│ │ Coordinates:
│ │ * station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│ │ Dimensions without coordinates: time
│ │ Data variables:
│ │ wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│ │ pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│ │ Attributes:
│ │ description: Weather data node, inheriting the 'time' dimension
│ └── DataTree('temperature')
│ Dimensions: (time: 2, station: 6)
│ Dimensions without coordinates: time, station
│ Data variables:
│ air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│ dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│ Attributes:
│ description: Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0 |
Thanks for raising this @etienneschalk ! I generally agree that methods for going to/from JSON would be generally useful, and that the methods should be consistent across Instead I suggest we simply add new methods for your use case: Another option might be to add a |
Hello @TomNicholas , I kept and renamed the existing The existing Rather than renaming the existing older Regarding a switch, the only issue I see with an argument like |
Regarding
I saw it is possible to define the return type of a function based on a boolean flag (python/mypy#8634), so it might be possible to have both behaviours, with the same function name, only changing the flag. The default would remain the exising behaviour of datatree's from_dict and to_dict since it is already in use. I can propose Edit: I'm fine with just having |
Is your feature request related to a problem?
This feature request arises from a "real-life" use case: I rely on
Dataset.from_dict
andDataset.to_dict
to convert Datasets to a dict before serializing them to JSON, and then loading back the JSON back to a Dataset with xarray.JSON can be useful for small datasets, containing configuration with small values, that should be easily openable/modifiable by a human directly in a text editor, without using any library or script. Using xarray provide benefits as it solves questions like "how do I represent an array with coordinates in JSON": no need to reinvent super-languages above JSON, when the xarray serialization already does the job.
However, these capabilities do not exist (yet) for DataTree. It means that this "magic" method of using xarray as a way to dump to JSON is limited to flat structures.
Describe the solution you'd like
I would like the
DataTree.from_dict
andDataTree.to_dict
to have a similar behaviour as theirDataset
counterparts.Currently the
DataTree.from_dict
method (https://xarray-datatree.readthedocs.io/en/stable/generated/datatree.DataTree.from_dict.html) expects A mapping from path names to xarray.Dataset, xarray.DataArray, or DataTree objects. It means a JSON cannot be reloaded back.Currently the
DataTree.to_dict
method does not attempt to "serialize": the keys are paths and values are instance of Datasets. I would expect the Datasets to be replaced by their dictified version.The solution I would like resembled more this:
Describe alternatives you've considered
Until now, I have been storing PurePosixPath-like variable names in Datasets. This helps organizing the configuration data, however, this loses the benefits of having scoped dimension names that DataTree provide.
Note: I did not want to add any custom parsing logic written by myself, not-standard and potentially breakable. The whole point of the
from_dict
andto_dict
, to me, as I use them, is to be "universal-one-liners", a guarantee that an other xarray user can easily read the JSON I produced without writing themselves new parsing logic on their own.Example:
Root-level attrs are lost but can be added again.
Additional context
No response
The text was updated successfully, but these errors were encountered: