Replace constants.py with data + region specification from yaml-file #23

joeloskarsson · 2024-05-03T08:32:11Z

This supersedes #2 and #3.

Motivation

It is currently very hard to work with neural lam on different regions due to everything related to data and the forecast region being specified hard-coded in constants.py. It would be much better to specify this in a config file that you can then point to. Yaml seems like a suitable format for this.

Proposition

The main training/eval script takes a flag --spec that should be given a path to a yaml-file. This yaml file specifies all the data that goes into the model and information about the region the model should be working with.

Current options in constants.py that relate to what to plot should all be turned into flags.

The yaml file should be read in and turned into a single object that contains all useful information and can be passed around in the code (since this is needed almost everywhere). Having this as an object means that it can also compute things not directly in the yaml file, such as units of variables that can be retrieved from loaded xarray data.

Design

Here is an idea for how the yaml file could be laid out with examples:

Start of file:

# Data config for MEPS dataset
---

Some comments to keep track of what this specification is about. We don't enforce any content in there.

Forecast area data configuration

This describes the data configuration for the actual limited area that you are training on. Explicitly, the "inner region", not the "boundary". What is specified is what zarrs to load state, forcing and static grid features from and which variables in these to use for each.

forecast_area:
  zarrs: # List of zarrs containing fields related to state
    fields: # Names on this level are arbitrary
      path: data/meps_example/fields.zarr # Path to zarr
      dims: # Name of dimensions in zarr, to be used for indexing
        time: time
        level: level
        grid: grid_node # Either give "grid" (flattened) dimension or "x" and "y"
    forcing: 
      path: data/meps_example/forcing.zarr
      dims:
        time: time
        level: level
        x: x_coord 
        y: y_coord
  state: # Variables forecasted by the model
    surface: # Single-field variables
      -  2t
      - 10u
      - ...
    atmospheric: # Variables with vertical levels
      - z
      - u
      - v
      - ...
    levels: # Levels to use for atmospheric variables 
      - 700
      - 850
      - ...
    units:
      z: m2/s2
      2t: K
      ...
  forcing: # Forcing variables, dynamic inputs to the model
      surface:
        - land_sea_mask # Dynamic in MEPS
        - ...
      atmospheric:
        - ... # Nothing for MEPS, but we allow it
      levels:
        - ...
  static: # Static inputs
      surface:
        - topography 
        - ...
      atmospheric:
        - ... # Nothing for MEPS, but we allow it
      levels:
        - ...
  lat_lon_names: # Name of variables/coordinates in zarrs specifying latitude and longitude of grid cells
    lat: latitude
    lon: longitude

One or more zarrs can be specified. These are all opened as xarray Datasets, coordinates renamed to common names, then joined into one dataset (object in memory).
When specifying the names of dimensions in the zarr, the user needs to either specify the name of a flattened grid dimension or the name of x and y dimensions. If x and y then we will flatten ourselves in the data loading code. If only grid is specified, a grid_shape entry has to be given (see below).
Names of surface and atmospheric fields listed in the different sections should map exactly to variable names in the joined xarray Dataset.
State, forcing and static specifications have exact same structure. Static fields should just lack time-dimension.
Entries not specified is just assumed to not be used, or take default values (variable dimensions). Naturally some fields are required (state variables, at least one zarr).
For units the variable names correspond to both surface and atmospheric variables (we do not allow naming clashes between these). Format for units is up to the user. We could require that all variables need a specified unit, or we accept that some have unspecified units.
We use the latlon information to know where the forecasting area is positioned within the projection. Convert these to coordinates in the projection and take the min/max to get the extent of the forecasting area. These latlon:s are read from the xarray Dataset, from the given variables (can be coordinates, as these are also variables).

Boundary data configuration

The boundary data configuration follows exactly the same structure as the forecast_area, with two differences:

No state entry is allowed, as we do not forecast the boundary nodes atm.
There is a mask entry, specifying what grid cells of the boundary to include.
The boundary has its own list of zarrs, to avoid variable name clashes with the forecast area zarrs. Note that we enforce no spatial structure of the boundary w.r.t. forecast area. The boundary nodes can be placed anywhere.

boundary:
  zarrs:
  ...
  mask: boundary_mask # Name of variable containing boolean mask, true for grid nodes to be used in boundary.

Grid shape

If the zarrs already contain flattened grid dimensions we need knowledge of the original 2d spatial shape in order to be able to plot data. For such cases this can be given by an optional grid_shape entry:

grid_shape:
   x: 238
   y: 268

Subset splits

The train/val/test split is defined based on timestamps:

splits:
  train:
    start: 2021-04-01T00
    end: 2022-05-31T23
  val:
    start: 2022-06-01T00
    end: 2022-06-30T23
  test:
    start: 2022-07-01T00
    end: 2023-03-31T23

Used by the dataset class to .sel the different subsets.

Forecast area projection

In order to be able to plot data in the forecasting area we need to know what projection the area is defined in. By plotting in this projection we end up with a flat rectangular area where the data sits. This should be specified as a reference to a cartopy.crs object.

projection: LambertConformal # Name of class in cartopy.crs
kwargs: # Parsed and used directly as kwargs to class above
    central_longitude: 15.0
    central_latitude: 63.3
    standard_parallels:
       - 63.3
       - 63.3

Normalization zarr

We also need information about statistics of variables, boundary and forcing for normalization (mean and std). Additionally we need the inverse variances used in the loss computation. As we compute and save this in a pre-processing script we can enforce a specific format, so lets put all of those also in its own zarr. Then we only need to specify a path here to that zarr to load it from.

normalization_zarr: data/meps_example/norm.zarr

The text was updated successfully, but these errors were encountered:

sadamov · 2024-05-03T12:22:42Z

Suggestion how the projection and regional extent can be specified:

projections:

  lambert_conformal_conic:
    proj_class: LambertConformal
    proj_params:
      central_longitude: 15.0
      central_latitude: 63.3
      standard_parallels: [63.3, 63.3]

  mercator:
    proj_class: Mercator
    proj_params:
      central_longitude: 0.0
      min_latitude: -80.0
      max_latitude: 84.0

  stereographic:
    proj_class: Stereographic
    proj_params:
      central_longitude: 0.0
      central_latitude: 90.0
      true_scale_latitude: 60.0

  rotated_pole:
    proj_class: RotatedPole
    proj_params:
      pole_longitude: 10.0
      pole_latitude: -43.0

  robinson:
    proj_class: Robinson
    proj_params:
      central_longitude: 0.0

  plate_carree:
    proj_class: PlateCarree
    proj_params:
      central_longitude: 0.0

limits:
  x_limits: [-6.82, 4.8]
  y_limits: [-4.42, 3.36]

ThomasRieutord · 2024-05-15T13:59:00Z

Hi all, thanks for adding me. Indeed, there is something that I didn't find in this PR (as it got big and scattered in many issues it may be there but I missed it) and that was causing me some trouble: the nature of the data (reforecast vs reanalysis) and the variables that go with it (maximum lead time -forecast only-, time step, number of time steps in a single file...).

Specifically, the number of time step in a file is currently hard-coded (N_t' = 65 in the WeatherDataset class) and I had to change it to read my data. I can give you more details on how I worked around it but a new issue would probably be more appropriate.

What do you think?

sadamov · 2024-05-15T17:49:06Z

Hi @ThomasRieutord, that is a good point. We did not have reforecast support on the roadmap, but I agree that it should be. Yes, there are quite a few Issues and PRs open right now. I have a suggestion:
Since the MEPS data is actually a reforecast, you could implement your solution for handling such reforecast data in #31? We are getting rid of everything hardcoded, so I think our goals are quite aligned :) What do you think?

PS: Do you also require the sample_length flag for model training based on reforecasts. I feel like this flag makes the model much harder to understand and would vote to remove that option. Thoughts?

joeloskarsson · 2024-05-15T19:26:20Z

I agree that having a Dataset class also for training on (re-)forecasts would be nice. But at the same time I do not mind changing everything to assuming (re-)analysis now first. Then we can add back such handling of forecast data again later.

The current handling of sample_length is indeed convoluted and make things hard to understand. This can likely be cleaned up by a nicer implementation. There still needs to be 1) a way to set pred_length, and extract a sub-forecast to train on as a sample and 2) a check that you are not trying to unroll the model for more steps than there are time steps in the forecast.

TomasLandelius · 2024-05-16T06:47:40Z

Regarding the boundary data configuration. How do we specify the grid for the host model if it is on a global grid? Most ERA5 versions are on a regular lat-lon grid (although native grid is reduced Gaussian) white the default grid for the (A)IFS HRES (probable choice in an operational setting) I guess is a (reduced) Gaussian grid (https://confluence.ecmwf.int/display/UDOC/Gaussian+grids).

Start by only supporting (assuming) regular lat-lon? However this causes the grid node distribution to be quite uneven towards the poles (probably quite noticeable for the MEPS domain). This will then have an effect on the g2m where some nodes will have contributions from many more grid points than others? @joeloskarsson maybe already solved this for the global model?

TomasLandelius · 2024-05-16T06:54:30Z

Does splitting the "state" into surface, atmospheric and levels add value (e.g. what if all variables are not at all levels)? What about just having a single list with state variables defined by names from ECMWF or CF name tables:
https://codes.ecmwf.int/grib/param-db/
https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html

ThomasRieutord · 2024-05-16T11:12:53Z

Hi @sadamov and @joeloskarsson, I think #31 is in good shape and answering a slightly different problem, so I would rather have it done and remove the hard-coded 65 time steps in another issue.

Just to let you know how I did so far.
On the one hand, the MEPS example is an ensemble forecast archive (so format = reforecast). For example the file nwp_2022040100_mbr000 contains 65-h hourly forecast from member 0 starting from analysis at 2022-04-01 00Z.
On the other hand, MERA is a deterministic reanalysis with 3h between each analysis. To stick as much as possible to the provided MEPS example, the file nwp_2022040100_mbr000 contains 21 3-hourly analysis starting at 2022-04-01 00Z. The it can be read by the WeatherDataset class with the parameters subsample_step = 1, control_only = True. But the hard-coded N_t'=65 is still a problem (it should be 21). So I included a variable n_time_steps_per_file that I load from a yaml file (see this commit).

In my opinion, the current sample_length flag (= pred_length + 2 in the WeatherDataset class) is an independent parameter defining the maximum lead time of the forecast the trained model will do, which can be different to the maximum lead time of the reforecast it uses for training.

I will raise a specific issue about the n_time_steps_per_file and make a pull request once #31 has gone through, because it can probably be solved with an additional key in the yaml file. Are you OK with that?

joeloskarsson · 2024-05-16T16:09:28Z

I agree with everything you describe Thomas, but given what we are moving towards with the data loading I am not sure if it is necessary to fix these kinds of constants in the current Dataset class. If we look at what @sadamov has started working on in terms of a zarr-loading Dataset (https://github.com/mllam/neural-lam/blob/feature_dataset_yaml/neural_lam/weather_dataset.py) all of this is handled very differently, so the problems you describe should not be present. I'd much rather try to get this zarr-based (re)analysis Dataset class in place and then use this as a basis for an improved (re)forecast Dataset class.

@leifdenby

**Summary** This PR replaces the `constants.py` file with a `data_config.yaml` file. Dataset related settings can be defined by the user in the new yaml file. Training specific settings were added as additional flags to the `train_model.py` routine. All respective calls to the old files were replaced. **Rationale** - Using a Yaml file for data config gives much more flexibility for various datasets used in the community. It also facilitates the future use of forcing and boundary datasets. In a follow-up PR the dataset paths will be defined in the yaml file, removing the dependency on a pre-structured `/data` folder. - It is best practice to define user input in a yaml file, the usage of python scripts for that purpose is not common. - The old `constants.py` actually combined both constants and variables, many "constants" should rather be flags to `train_models.py` - The introduction of a new ConfigClass in `utils.py` allows for very specific queries of the yaml and calculations based thereon. This branch shows future possibilities of such a class https://github.com/joeloskarsson/neural-lam/tree/feature_dataset_yaml **Testing** Both training and evaluation of the model were succesfully tested with the `meps_example` dataset. **Note** @leifdenby Could you invite Thomas R. to this repo, in case he wanted to give his input on the yaml file? This PR should mostly serve as a basis for discussion. Maybe we should add more information to the yaml file as you outline in https://github.com/mllam/mllam-data-prep. I think we should always keep in mind how the repository will look like with realistic boundary conditions and zarr-archives as data-input. This PR solves parts of #23 --------- Co-authored-by: Simon Adamov <[email protected]>

leifdenby · 2024-05-22T10:30:25Z

Can we close this now that @sadamov has finished #31? 🚀 Some of the discussion here is relevant to the zarr-based dataset discussions, but maybe we should make a new issue for that?

leifdenby · 2024-05-22T10:32:07Z

Sorry, I had forgotten about #24. Maybe the zarr-dataset discussion can continue there?

sadamov · 2024-05-25T13:55:29Z

This issue was closed an superceded by #24 where the work on zarr-archives for boundaries, normalizations statistics and more can continues.

This was referenced May 3, 2024

Variable names and weights as part of static dataset files #3

Closed

Area definition as part of static dataset files #2

Closed

sadamov self-assigned this May 3, 2024

sadamov added the enhancement New feature or request label May 3, 2024

sadamov mentioned this issue May 3, 2024

Pytorch Dataset Class that Reads From Zarr Archive #24

Open

sadamov mentioned this issue May 14, 2024

Replace constants.py with data_config.yaml #31

Merged

sadamov closed this as completed May 25, 2024

joeloskarsson mentioned this issue May 31, 2024

Three minor bugfixes for data_config.yaml workflow #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace constants.py with data + region specification from yaml-file #23

Replace constants.py with data + region specification from yaml-file #23

joeloskarsson commented May 3, 2024 •

edited

Loading

sadamov commented May 3, 2024 •

edited

Loading

ThomasRieutord commented May 15, 2024

sadamov commented May 15, 2024 •

edited

Loading

joeloskarsson commented May 15, 2024

TomasLandelius commented May 16, 2024

TomasLandelius commented May 16, 2024

ThomasRieutord commented May 16, 2024

joeloskarsson commented May 16, 2024

leifdenby commented May 22, 2024

leifdenby commented May 22, 2024

sadamov commented May 25, 2024

Replace constants.py with data + region specification from yaml-file #23

Replace constants.py with data + region specification from yaml-file #23

Comments

joeloskarsson commented May 3, 2024 • edited Loading

Motivation

Proposition

Design

Start of file:

Forecast area data configuration

Boundary data configuration

Grid shape

Subset splits

Forecast area projection

Normalization zarr

sadamov commented May 3, 2024 • edited Loading

ThomasRieutord commented May 15, 2024

sadamov commented May 15, 2024 • edited Loading

joeloskarsson commented May 15, 2024

TomasLandelius commented May 16, 2024

TomasLandelius commented May 16, 2024

ThomasRieutord commented May 16, 2024

joeloskarsson commented May 16, 2024

leifdenby commented May 22, 2024

leifdenby commented May 22, 2024

sadamov commented May 25, 2024

joeloskarsson commented May 3, 2024 •

edited

Loading

sadamov commented May 3, 2024 •

edited

Loading

sadamov commented May 15, 2024 •

edited

Loading