Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree: Align from_dict and to_dict behaviours to their Dataset equivalents #9074

Open
etienneschalk opened this issue Jun 7, 2024 · 4 comments · May be fixed by #9080
Open

DataTree: Align from_dict and to_dict behaviours to their Dataset equivalents #9074

etienneschalk opened this issue Jun 7, 2024 · 4 comments · May be fixed by #9080

Comments

@etienneschalk
Copy link
Contributor

etienneschalk commented Jun 7, 2024

Is your feature request related to a problem?

This feature request arises from a "real-life" use case: I rely on Dataset.from_dict and Dataset.to_dict to convert Datasets to a dict before serializing them to JSON, and then loading back the JSON back to a Dataset with xarray.

JSON can be useful for small datasets, containing configuration with small values, that should be easily openable/modifiable by a human directly in a text editor, without using any library or script. Using xarray provide benefits as it solves questions like "how do I represent an array with coordinates in JSON": no need to reinvent super-languages above JSON, when the xarray serialization already does the job.

However, these capabilities do not exist (yet) for DataTree. It means that this "magic" method of using xarray as a way to dump to JSON is limited to flat structures.

Describe the solution you'd like

I would like the DataTree.from_dict and DataTree.to_dict to have a similar behaviour as their Dataset counterparts.

Currently the DataTree.from_dict method (https://xarray-datatree.readthedocs.io/en/stable/generated/datatree.DataTree.from_dict.html) expects A mapping from path names to xarray.Dataset, xarray.DataArray, or DataTree objects. It means a JSON cannot be reloaded back.

Currently the DataTree.to_dict method does not attempt to "serialize": the keys are paths and values are instance of Datasets. I would expect the Datasets to be replaced by their dictified version.

In [49]: xdt.to_dict()
Out[49]: 
{'/': <xarray.Dataset> 0B
 Dimensions:  ()
 Data variables:
     *empty*
 Attributes:
     top_level_attr:  Ho,
 '/parent': <xarray.Dataset> 96B
 Dimensions:  (dim_one: 3, dim_two: 2)
 Coordinates:
   * dim_one  (dim_one) int64 24B 10 20 30
 Dimensions without coordinates: dim_two
 Data variables:
     child_1  (dim_one) int64 24B 1 2 3
     child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0}

The solution I would like resembled more this:

In [55]: datatree_dict = {path: xds.to_dict() for path, xds in xdt.to_dict().items()}

In [56]: datatree_dict
Out[56]: 
{'/': {'coords': {},
  'attrs': {'top_level_attr': 'Ho'},
  'dims': {},
  'data_vars': {}},
 '/parent': {'coords': {'dim_one': {'dims': ('dim_one',),
    'attrs': {},
    'data': [10, 20, 30]}},
  'attrs': {},
  'dims': {'dim_one': 3, 'dim_two': 2},
  'data_vars': {'child_1': {'dims': ('dim_one',),
    'attrs': {},
    'data': [1, 2, 3]},
   'child_2': {'dims': ('dim_two', 'dim_one'),
    'attrs': {'units': 'm', 'long_name': 'Hey'},
    'data': [[5, 6, 9], [7, 8, 0]]}}}}

In [58]: print(json.dumps(datatree_dict, indent=4))
{
    "/": {
        "coords": {},
        "attrs": {
            "top_level_attr": "Ho"
        },
        "dims": {},
        "data_vars": {}
    },
    "/parent": {
        "coords": {
            "dim_one": {
                "dims": [
                    "dim_one"
                ],
                "attrs": {},
                "data": [
                    10,
                    20,
                    30
                ]
            }
        },
        "attrs": {},
        "dims": {
            "dim_one": 3,
            "dim_two": 2
        },
        "data_vars": {
            "child_1": {
                "dims": [
                    "dim_one"
                ],
                "attrs": {},
                "data": [
                    1,
                    2,
                    3
                ]
            },
            "child_2": {
                "dims": [
                    "dim_two",
                    "dim_one"
                ],
                "attrs": {
                    "units": "m",
                    "long_name": "Hey"
                },
                "data": [
                    [
                        5,
                        6,
                        9
                    ],
                    [
                        7,
                        8,
                        0
                    ]
                ]
            }
        }
    }
}

Describe alternatives you've considered

Until now, I have been storing PurePosixPath-like variable names in Datasets. This helps organizing the configuration data, however, this loses the benefits of having scoped dimension names that DataTree provide.

Note: I did not want to add any custom parsing logic written by myself, not-standard and potentially breakable. The whole point of the from_dict and to_dict, to me, as I use them, is to be "universal-one-liners", a guarantee that an other xarray user can easily read the JSON I produced without writing themselves new parsing logic on their own.

Example:

  • Create a Dataset with all xarray features (top-level attrs, variable-level attrs, 1D and 2D array with a shared dimension ; a dimension with coordinates and a dimension without coordinates), with PurePosixPath-like variable names
  • Convert it to dict then dump to a JSON string
  • Load back the JSON string to a Dataset
  • Convert it to a DataTree to benefit from the tree hierarchy permitted by the path-like variable names
In [31]: xds = xr.Dataset({'parent/child_1': xr.DataArray([1,2,3], coords={"dim_one": [10,20,30]}), "parent/child_2": xr.DataArray([[5,6,9],[7,8,0]], di
    ...: ms=("dim_two", "dim_one"), attrs={"units": "m", "long_name": "Hey"})}, attrs={"top_level_attr": "Ho"})

In [32]: xds.to_dict()
Out[32]: 
{'coords': {'dim_one': {'dims': ('dim_one',),
   'attrs': {},
   'data': [10, 20, 30]}},
 'attrs': {'top_level_attr': 'Ho'},
 'dims': {'dim_one': 3, 'dim_two': 2},
 'data_vars': {'parent/child_1': {'dims': ('dim_one',),
   'attrs': {},
   'data': [1, 2, 3]},
  'parent/child_2': {'dims': ('dim_two', 'dim_one'),
   'attrs': {'units': 'm', 'long_name': 'Hey'},
   'data': [[5, 6, 9], [7, 8, 0]]}}}

In [33]: print(json.dumps(xds.to_dict(), indent=4))
{
    "coords": {
        "dim_one": {
            "dims": [
                "dim_one"
            ],
            "attrs": {},
            "data": [
                10,
                20,
                30
            ]
        }
    },
    "attrs": {
        "top_level_attr": "Ho"
    },
    "dims": {
        "dim_one": 3,
        "dim_two": 2
    },
    "data_vars": {
        "parent/child_1": {
            "dims": [
                "dim_one"
            ],
            "attrs": {},
            "data": [
                1,
                2,
                3
            ]
        },
        "parent/child_2": {
            "dims": [
                "dim_two",
                "dim_one"
            ],
            "attrs": {
                "units": "m",
                "long_name": "Hey"
            },
            "data": [
                [
                    5,
                    6,
                    9
                ],
                [
                    7,
                    8,
                    0
                ]
            ]
        }
    }
}
In [41]: reloaded = xr.Dataset.from_dict(json.loads(json.dumps(xds.to_dict(), indent=4)))

In [42]: reloaded
Out[42]: 
<xarray.Dataset> 96B
Dimensions:         (dim_one: 3, dim_two: 2)
Coordinates:
  * dim_one         (dim_one) int64 24B 10 20 30
Dimensions without coordinates: dim_two
Data variables:
    parent/child_1  (dim_one) int64 24B 1 2 3
    parent/child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0
Attributes:
    top_level_attr:  Ho

In [43]: import xarray.core.datatree as dt

In [44]: xdt = dt.DataTree()

In [45]: for varname in reloaded: xdt[varname] = reloaded[varname]

In [46]: xdt
Out[46]: 
DataTree('None', parent=None)
└── DataTree('parent')
        Dimensions:  (dim_one: 3, dim_two: 2)
        Coordinates:
          * dim_one  (dim_one) int64 24B 10 20 30
        Dimensions without coordinates: dim_two
        Data variables:
            child_1  (dim_one) int64 24B 1 2 3
            child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0

Root-level attrs are lost but can be added again.

In [47]: xdt.attrs.update(xds.attrs)

In [48]: xdt
Out[48]: 
DataTree('None', parent=None)
│   Dimensions:  ()
│   Data variables:
│       *empty*Attributes:
│       top_level_attr:  Ho
└── DataTree('parent')
        Dimensions:  (dim_one: 3, dim_two: 2)
        Coordinates:
          * dim_one  (dim_one) int64 24B 10 20 30
        Dimensions without coordinates: dim_two
        Data variables:
            child_1  (dim_one) int64 24B 1 2 3
            child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0

Additional context

No response

@etienneschalk
Copy link
Contributor Author

Reworked issue description with examples from the current implementation

Is your feature request related to a problem?

This feature request arises from the following use case: I rely on Dataset.to_dict to convert Datasets to dicts before serializing them to JSON, and Dataset.from_dict to then then load JSON files back into Datasets.

flowchart 
subgraph Serialization
  Dataset_in[Dataset] --> dict_in[dict] 
  dict_in[dict] --> JSON_out[JSON]
  end
Loading
flowchart
subgraph Deserialization 
  JSON_in[JSON] --> dict_out[dict]  
  dict_out[dict]  --> Dataset_out[Dataset]
  end
Loading

JSON can be useful for small datasets, containing configuration values for instance, that should be easily openable/modifiable by a human directly in a text editor, without using any external library or script. Using xarray's Dataset.from_dict and Dataset.to_dict methods provides an out-of-the-box answer to the following question: "How to persist and reload Datasets to and from JSON"? Using xarray also avoid using "raw JSON" to store configuration as it is often very error-prone and lack structure. So xarray provides more structure than raw JSON, while still allowing the flexibility (not having to define multiple schemas ; the only rule to follow is to have a JSON readable by xarray).

While these capabilities do exist yet for DataArrays and Datasets, they do not exist yet for DataTree. It means that currently, using xarray to read and write JSON limited to flat structures.

Describe the solution you'd like

I would like the DataTree.from_dict and DataTree.to_dict to have a similar behaviour as their Dataset counterparts.

Currently the DataTree.from_dict method (https://xarray-datatree.readthedocs.io/en/stable/generated/datatree.DataTree.from_dict.html) expects:

A mapping from path names to xarray.Dataset, xarray.DataArray, or DataTree objects.

It means a JSON cannot be reloaded back.

Currently the DataTree.to_dict method does not attempt to "serialize": the keys are paths and values are instance of xarray-related data structures. I would expect the Datasets to be replaced by their dictified version, following the same philosophy as the existing methods for Dataset. The existing methods are very useful, eg for creating test DataTrees, but its behaviour can be extended.

In the following code example:

  • An example DataTree is built using the existing dt.DataTree.from_dict
  • It is converted to a dict with the existing dt.DataTree.to_dict
  • It is converted to a dict with the proposed behaviour with a new method: dt.DataTree.to_dict_nested. It is then converted to JSON, and reloaded back to a DataTree with the complementary method DataTree.from_dict_nested

Build an example DataTree

import pandas as pd
import numpy as np
import xarray as xr
from xarray.core import datatree as dt

xdt = dt.DataTree.from_dict(
    name="(root)",
    d={
        "/": xr.Dataset(
            coords={
                "time": xr.DataArray(
                    data=pd.date_range(start="2020-12-01", end="2020-12-02", freq="D")[
                        :2
                    ],
                    dims="time",
                    attrs={
                        "units": "date",
                        "long_name": "Time of acquisition",
                    },
                )
            },
            attrs={
                "description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite"
            },
        ),
        "/weather_data": xr.Dataset(
            coords={
                "station": xr.DataArray(
                    data=list("abcdef"),
                    dims="station",
                    attrs={
                        "units": "dl",
                        "long_name": "Station of acquisition",
                    },
                )
            },
            data_vars={
                "wind_speed": xr.DataArray(
                    np.ones((2, 6)) * 2,
                    dims=("time", "station"),
                    attrs={
                        "units": "meter/sec",
                        "long_name": "Wind speed",
                    },
                ),
                "pressure": xr.DataArray(
                    np.ones((2, 6)) * 3,
                    dims=("time", "station"),
                    attrs={
                        "units": "hectopascals",
                        "long_name": "Time of acquisition",
                    },
                ),
            },
            attrs={"description": "Weather data node, inheriting the 'time' dimension"},
        ),
        "/weather_data/temperature": xr.Dataset(
            data_vars={
                "air_temperature": xr.DataArray(
                    np.ones((2, 6)) * 3,
                    dims=("time", "station"),
                    attrs={
                        "units": "kelvin",
                        "long_name": "Air temperature",
                    },
                ),
                "dewpoint_temp": xr.DataArray(
                    np.ones((2, 6)) * 4,
                    dims=("time", "station"),
                    attrs={
                        "units": "kelvin",
                        "long_name": "Dew point temperature",
                    },
                ),
            },
            attrs={
                "description": (
                    "Temperature, subnode of the weather data node, "
                    "inheriting the 'time' dimension from root and 'station' "
                    "dimension from the Temperature group."
                )
            },
        ),
        "/satellite_image": xr.Dataset(
            coords={"x": [10, 20, 30], "y": [90, 80, 70]},
            data_vars={
                "infrared": xr.DataArray(
                    np.ones((2, 3, 3)) * 5, dims=("time", "y", "x")
                ),
                "true_color": xr.DataArray(
                    np.ones((2, 3, 3)) * 6, dims=("time", "y", "x")
                ),
            },
        ),
    },
)
print(xdt)
DataTree('(root)', parent=None)
│   Dimensions:  (time: 2)
│   Coordinates:
│     * time     (time) datetime64[ns] 16B 2020-12-01 2020-12-02Data variables:
│       *empty*Attributes:
│       description:  Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│   │   Dimensions:     (station: 6, time: 2)
│   │   Coordinates:
│   │     * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│   │   Dimensions without coordinates: time
│   │   Data variables:
│   │       wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│   │       pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│   │   Attributes:
│   │       description:  Weather data node, inheriting the 'time' dimension
│   └── DataTree('temperature')
│           Dimensions:          (time: 2, station: 6)
│           Dimensions without coordinates: time, stationData variables:
│               air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0Attributes:
│               description:  Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
        Dimensions:     (x: 3, y: 3, time: 2)
        Coordinates:
          * x           (x) int64 24B 10 20 30
          * y           (y) int64 24B 90 80 70
        Dimensions without coordinates: time
        Data variables:
            infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
            true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0

Convert to dict with the existing DataTree.from_dict method:

xdt.to_dict()
{'/': <xarray.Dataset> Size: 16B
 Dimensions:  (time: 2)
 Coordinates:
   * time     (time) datetime64[ns] 16B 2020-12-01 2020-12-02
 Data variables:
     *empty*
 Attributes:
     description:  Root Hypothetical DataTree with heterogeneous data: weather...,
 '/weather_data': <xarray.Dataset> Size: 216B
 Dimensions:     (station: 6, time: 2)
 Coordinates:
   * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
 Dimensions without coordinates: time
 Data variables:
     wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
     pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
 Attributes:
     description:  Weather data node, inheriting the 'time' dimension,
 '/satellite_image': <xarray.Dataset> Size: 336B
 Dimensions:     (x: 3, y: 3, time: 2)
 Coordinates:
   * x           (x) int64 24B 10 20 30
   * y           (y) int64 24B 90 80 70
 Dimensions without coordinates: time
 Data variables:
     infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
     true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0,
 '/weather_data/temperature': <xarray.Dataset> Size: 192B
 Dimensions:          (time: 2, station: 6)
 Dimensions without coordinates: time, station
 Data variables:
     air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
     dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
 Attributes:
     description:  Temperature, subnode of the weather data node, inheriting t...}

Convert to dict with the proposed DataTree.from_dict_nested method:

xdt.to_dict_nested()
{'coords': {'time': {'dims': ('time',),
   'attrs': {'units': 'date', 'long_name': 'Time of acquisition'},
   'data': [datetime.datetime(2020, 12, 1, 0, 0),
    datetime.datetime(2020, 12, 2, 0, 0)]}},
 'attrs': {'description': 'Root Hypothetical DataTree with heterogeneous data: weather and satellite'},
 'dims': {'time': 2},
 'data_vars': {},
 'name': '(root)',
 'children': {'weather_data': {'coords': {'station': {'dims': ('station',),
     'attrs': {'units': 'dl', 'long_name': 'Station of acquisition'},
     'data': ['a', 'b', 'c', 'd', 'e', 'f']}},
   'attrs': {'description': "Weather data node, inheriting the 'time' dimension"},
   'dims': {'station': 6, 'time': 2},
   'data_vars': {'wind_speed': {'dims': ('time', 'station'),
     'attrs': {'units': 'meter/sec', 'long_name': 'Wind speed'},
     'data': [[2.0, 2.0, 2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0, 2.0, 2.0]]},
    'pressure': {'dims': ('time', 'station'),
     'attrs': {'units': 'hectopascals', 'long_name': 'Time of acquisition'},
     'data': [[3.0, 3.0, 3.0, 3.0, 3.0, 3.0],
      [3.0, 3.0, 3.0, 3.0, 3.0, 3.0]]}},
   'name': 'weather_data',
   'children': {'temperature': {'coords': {},
     'attrs': {'description': "Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."},
     'dims': {'time': 2, 'station': 6},
     'data_vars': {'air_temperature': {'dims': ('time', 'station'),
       'attrs': {'units': 'kelvin', 'long_name': 'Air temperature'},
       'data': [[3.0, 3.0, 3.0, 3.0, 3.0, 3.0],
        [3.0, 3.0, 3.0, 3.0, 3.0, 3.0]]},
      'dewpoint_temp': {'dims': ('time', 'station'),
       'attrs': {'units': 'kelvin', 'long_name': 'Dew point temperature'},
       'data': [[4.0, 4.0, 4.0, 4.0, 4.0, 4.0],
        [4.0, 4.0, 4.0, 4.0, 4.0, 4.0]]}},
     'name': 'temperature',
     'children': {}}}},
  'satellite_image': {'coords': {'x': {'dims': ('x',),
     'attrs': {},
     'data': [10, 20, 30]},
    'y': {'dims': ('y',), 'attrs': {}, 'data': [90, 80, 70]}},
   'attrs': {},
   'dims': {'x': 3, 'y': 3, 'time': 2},
   'data_vars': {'infrared': {'dims': ('time', 'y', 'x'),
     'attrs': {},
     'data': [[[5.0, 5.0, 5.0], [5.0, 5.0, 5.0], [5.0, 5.0, 5.0]],
      [[5.0, 5.0, 5.0], [5.0, 5.0, 5.0], [5.0, 5.0, 5.0]]]},
    'true_color': {'dims': ('time', 'y', 'x'),
     'attrs': {},
     'data': [[[6.0, 6.0, 6.0], [6.0, 6.0, 6.0], [6.0, 6.0, 6.0]],
      [[6.0, 6.0, 6.0], [6.0, 6.0, 6.0], [6.0, 6.0, 6.0]]]}},
   'name': 'satellite_image',
   'children': {}}}}
print(json.dumps(xdt.to_dict_nested(), indent=4, default=str))
{
    "coords": {
        "time": {
            "dims": [
                "time"
            ],
            "attrs": {
                "units": "date",
                "long_name": "Time of acquisition"
            },
            "data": [
                "2020-12-01 00:00:00",
                "2020-12-02 00:00:00"
            ]
        }
    },
    "attrs": {
        "description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite"
    },
    "dims": {
        "time": 2
    },
    "data_vars": {},
    "name": "(root)",
    "children": {
        "weather_data": {
            "coords": {
                "station": {
                    "dims": [
                        "station"
                    ],
                    "attrs": {
                        "units": "dl",
                        "long_name": "Station of acquisition"
                    },
                    "data": [
                        "a",
                        "b",
                        "c",
                        "d",
                        "e",
                        "f"
                    ]
                }
            },
            "attrs": {
                "description": "Weather data node, inheriting the 'time' dimension"
            },
            "dims": {
                "station": 6,
                "time": 2
            },
            "data_vars": {
                "wind_speed": {
                    "dims": [
                        "time",
                        "station"
                    ],
                    "attrs": {
                        "units": "meter/sec",
                        "long_name": "Wind speed"
                    },
                    "data": [
                        [
                            2.0,
                            2.0,
                            2.0,
                            2.0,
                            2.0,
                            2.0
                        ],
                        [
                            2.0,
                            2.0,
                            2.0,
                            2.0,
                            2.0,
                            2.0
                        ]
                    ]
                },
                "pressure": {
                    "dims": [
                        "time",
                        "station"
                    ],
                    "attrs": {
                        "units": "hectopascals",
                        "long_name": "Time of acquisition"
                    },
                    "data": [
                        [
                            3.0,
                            3.0,
                            3.0,
                            3.0,
                            3.0,
                            3.0
                        ],
                        [
                            3.0,
                            3.0,
                            3.0,
                            3.0,
                            3.0,
                            3.0
                        ]
                    ]
                }
            },
            "name": "weather_data",
            "children": {
                "temperature": {
                    "coords": {},
                    "attrs": {
                        "description": "Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."
                    },
                    "dims": {
                        "time": 2,
                        "station": 6
                    },
                    "data_vars": {
                        "air_temperature": {
                            "dims": [
                                "time",
                                "station"
                            ],
                            "attrs": {
                                "units": "kelvin",
                                "long_name": "Air temperature"
                            },
                            "data": [
                                [
                                    3.0,
                                    3.0,
                                    3.0,
                                    3.0,
                                    3.0,
                                    3.0
                                ],
                                [
                                    3.0,
                                    3.0,
                                    3.0,
                                    3.0,
                                    3.0,
                                    3.0
                                ]
                            ]
                        },
                        "dewpoint_temp": {
                            "dims": [
                                "time",
                                "station"
                            ],
                            "attrs": {
                                "units": "kelvin",
                                "long_name": "Dew point temperature"
                            },
                            "data": [
                                [
                                    4.0,
                                    4.0,
                                    4.0,
                                    4.0,
                                    4.0,
                                    4.0
                                ],
                                [
                                    4.0,
                                    4.0,
                                    4.0,
                                    4.0,
                                    4.0,
                                    4.0
                                ]
                            ]
                        }
                    },
                    "name": "temperature",
                    "children": {}
                }
            }
        },
        "satellite_image": {
            "coords": {
                "x": {
                    "dims": [
                        "x"
                    ],
                    "attrs": {},
                    "data": [
                        10,
                        20,
                        30
                    ]
                },
                "y": {
                    "dims": [
                        "y"
                    ],
                    "attrs": {},
                    "data": [
                        90,
                        80,
                        70
                    ]
                }
            },
            "attrs": {},
            "dims": {
                "x": 3,
                "y": 3,
                "time": 2
            },
            "data_vars": {
                "infrared": {
                    "dims": [
                        "time",
                        "y",
                        "x"
                    ],
                    "attrs": {},
                    "data": [
                        [
                            [
                                5.0,
                                5.0,
                                5.0
                            ],
                            [
                                5.0,
                                5.0,
                                5.0
                            ],
                            [
                                5.0,
                                5.0,
                                5.0
                            ]
                        ],
                        [
                            [
                                5.0,
                                5.0,
                                5.0
                            ],
                            [
                                5.0,
                                5.0,
                                5.0
                            ],
                            [
                                5.0,
                                5.0,
                                5.0
                            ]
                        ]
                    ]
                },
                "true_color": {
                    "dims": [
                        "time",
                        "y",
                        "x"
                    ],
                    "attrs": {},
                    "data": [
                        [
                            [
                                6.0,
                                6.0,
                                6.0
                            ],
                            [
                                6.0,
                                6.0,
                                6.0
                            ],
                            [
                                6.0,
                                6.0,
                                6.0
                            ]
                        ],
                        [
                            [
                                6.0,
                                6.0,
                                6.0
                            ],
                            [
                                6.0,
                                6.0,
                                6.0
                            ],
                            [
                                6.0,
                                6.0,
                                6.0
                            ]
                        ]
                    ]
                }
            },
            "name": "satellite_image",
            "children": {}
        }
    }

(minified version):

{"coords":{"time":{"dims":["time"],"attrs":{"units":"date","long_name":"Time of acquisition"},"data":["2020-12-01 00:00:00","2020-12-02 00:00:00"]}},"attrs":{"description":"Root Hypothetical DataTree with heterogeneous data: weather and satellite"},"dims":{"time":2},"data_vars":{},"name":"(root)","children":{"weather_data":{"coords":{"station":{"dims":["station"],"attrs":{"units":"dl","long_name":"Station of acquisition"},"data":["a","b","c","d","e","f"]}},"attrs":{"description":"Weather data node, inheriting the 'time' dimension"},"dims":{"station":6,"time":2},"data_vars":{"wind_speed":{"dims":["time","station"],"attrs":{"units":"meter/sec","long_name":"Wind speed"},"data":[[2,2,2,2,2,2],[2,2,2,2,2,2]]},"pressure":{"dims":["time","station"],"attrs":{"units":"hectopascals","long_name":"Time of acquisition"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]}},"name":"weather_data","children":{"temperature":{"coords":{},"attrs":{"description":"Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."},"dims":{"time":2,"station":6},"data_vars":{"air_temperature":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Air temperature"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]},"dewpoint_temp":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Dew point temperature"},"data":[[4,4,4,4,4,4],[4,4,4,4,4,4]]}},"name":"temperature","children":{}}}},"satellite_image":{"coords":{"x":{"dims":["x"],"attrs":{},"data":[10,20,30]},"y":{"dims":["y"],"attrs":{},"data":[90,80,70]}},"attrs":{},"dims":{"x":3,"y":3,"time":2},"data_vars":{"infrared":{"dims":["time","y","x"],"attrs":{},"data":[[[5,5,5],[5,5,5],[5,5,5]],[[5,5,5],[5,5,5],[5,5,5]]]},"true_color":{"dims":["time","y","x"],"attrs":{},"data":[[[6,6,6],[6,6,6],[6,6,6]],[[6,6,6],[6,6,6],[6,6,6]]]}},"name":"satellite_image","children":{}}}}

(screenshot of the minified version):

Screenshot from 2024-06-09 13-36-14

Load back to DataTree:

dt.DataTree.from_dict_nested(json.loads(json.dumps(xdt.to_dict_nested(), indent=4, default=str)))
DataTree('(root)', parent=None)
│   Dimensions:  (time: 2)
│   Coordinates:
│     * time     (time) <U19 152B '2020-12-01 00:00:00' '2020-12-02 00:00:00'Data variables:
│       *empty*Attributes:
│       description:  Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│   │   Dimensions:     (station: 6, time: 2)
│   │   Coordinates:
│   │     * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│   │   Dimensions without coordinates: time
│   │   Data variables:
│   │       wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│   │       pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│   │   Attributes:
│   │       description:  Weather data node, inheriting the 'time' dimension
│   └── DataTree('temperature')
│           Dimensions:          (time: 2, station: 6)
│           Dimensions without coordinates: time, stationData variables:
│               air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0Attributes:
│               description:  Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
        Dimensions:     (x: 3, y: 3, time: 2)
        Coordinates:
          * x           (x) int64 24B 10 20 30
          * y           (y) int64 24B 90 80 70
        Dimensions without coordinates: time
        Data variables:
            infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
            true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0

Remark: the time dimension is downgraded to a str as it is not JSON serializable. The scope of this feature request is to focus on the DataTree -> dict and dict -> DataTree. Any serialization not supported by default by JSON is the responsibility of the user to deal with (as it is currently with Dataset.to_dict and Dataset.from_dict).

However, the round-trip DataTree -> dict -> DataTree is guaranteed:

dt.DataTree.from_dict_nested(xdt.to_dict_nested())
DataTree('(root)', parent=None)
│   Dimensions:  (time: 2)
│   Coordinates:
│     * time     (time) datetime64[ns] 16B 2020-12-01 2020-12-02Data variables:
│       *empty*Attributes:
│       description:  Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│   │   Dimensions:     (station: 6, time: 2)
│   │   Coordinates:
│   │     * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│   │   Dimensions without coordinates: time
│   │   Data variables:
│   │       wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│   │       pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│   │   Attributes:
│   │       description:  Weather data node, inheriting the 'time' dimension
│   └── DataTree('temperature')
│           Dimensions:          (time: 2, station: 6)
│           Dimensions without coordinates: time, stationData variables:
│               air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0Attributes:
│               description:  Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
        Dimensions:     (x: 3, y: 3, time: 2)
        Coordinates:
          * x           (x) int64 24B 10 20 30
          * y           (y) int64 24B 90 80 70
        Dimensions without coordinates: time
        Data variables:
            infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
            true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0

@TomNicholas
Copy link
Contributor

Thanks for raising this @etienneschalk !

I generally agree that methods for going to/from JSON would be generally useful, and that the methods should be consistent across Dataset/DataTree, but the .to_dict and .from_dict methods on DataTree are quite important and IMO shouldn't be changed. (They matter because many internal operations are currently implemented by first turning the DataTree into a dict, manipulating it, then turning the altered dict back into a DataTree.)

Instead I suggest we simply add new methods for your use case: .to_json_dict (or perhaps some other name) and .from_json_dict. The existing .to_dict method on Dataset should be aliases to point to the new method, with a deprecation warning raised.

Another option might be to add a json=True kwarg to .to_dict or similar to switch between the two behaviours.

@etienneschalk
Copy link
Contributor Author

Hello @TomNicholas ,

I kept and renamed the existing to_dict and from_dict method of datatree to from_paths_dict and to_paths_dict as they are mappings of string paths to xarray's data structures ; they can still be used in the internal code.

The existing Dataset.to_dict removes entirely any trace of xarray's data structures, and do convert to native python data structures: dicts and lists, that are more easily serializable to JSON. My implementation relies a lot on reusing the Dataset.to_dict method itself, with the logic being pretty lite.

Rather than renaming the existing older Dataset.to_dict method, would it be possible to make a change of API in datatree while it is still not yet fully integrated into xarray and changes like this are more acceptable?

Regarding a switch, the only issue I see with an argument like json=True would be for typing: the to_dict method would now return a union of two types, and this can be annoying for users (the burden of type narrowing is passed onto the user).

@etienneschalk
Copy link
Contributor Author

etienneschalk commented Jun 22, 2024

Regarding

Another option might be to add a json=True kwarg to .to_dict or similar to switch between the two behaviours.

I saw it is possible to define the return type of a function based on a boolean flag (python/mypy#8634), so it might be possible to have both behaviours, with the same function name, only changing the flag. The default would remain the exising behaviour of datatree's from_dict and to_dict since it is already in use. I can propose native as a flag, as it really converts xarray datastructures to native python ones, easilier serializable to JSON (but it does not produce JSON directly).

Edit: I'm fine with just having to_native_dict and from_native_dict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants