-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace constants.py with data + region specification from yaml-file #23
Comments
Suggestion how the projection and regional extent can be specified: projections:
lambert_conformal_conic:
proj_class: LambertConformal
proj_params:
central_longitude: 15.0
central_latitude: 63.3
standard_parallels: [63.3, 63.3]
mercator:
proj_class: Mercator
proj_params:
central_longitude: 0.0
min_latitude: -80.0
max_latitude: 84.0
stereographic:
proj_class: Stereographic
proj_params:
central_longitude: 0.0
central_latitude: 90.0
true_scale_latitude: 60.0
rotated_pole:
proj_class: RotatedPole
proj_params:
pole_longitude: 10.0
pole_latitude: -43.0
robinson:
proj_class: Robinson
proj_params:
central_longitude: 0.0
plate_carree:
proj_class: PlateCarree
proj_params:
central_longitude: 0.0
limits:
x_limits: [-6.82, 4.8]
y_limits: [-4.42, 3.36] |
Hi all, thanks for adding me. Indeed, there is something that I didn't find in this PR (as it got big and scattered in many issues it may be there but I missed it) and that was causing me some trouble: the nature of the data (reforecast vs reanalysis) and the variables that go with it (maximum lead time -forecast only-, time step, number of time steps in a single file...). Specifically, the number of time step in a file is currently hard-coded (N_t' = 65 in the WeatherDataset class) and I had to change it to read my data. I can give you more details on how I worked around it but a new issue would probably be more appropriate. What do you think? |
Hi @ThomasRieutord, that is a good point. We did not have reforecast support on the roadmap, but I agree that it should be. Yes, there are quite a few Issues and PRs open right now. I have a suggestion: PS: Do you also require the |
I agree that having a Dataset class also for training on (re-)forecasts would be nice. But at the same time I do not mind changing everything to assuming (re-)analysis now first. Then we can add back such handling of forecast data again later. The current handling of |
Regarding the boundary data configuration. How do we specify the grid for the host model if it is on a global grid? Most ERA5 versions are on a regular lat-lon grid (although native grid is reduced Gaussian) white the default grid for the (A)IFS HRES (probable choice in an operational setting) I guess is a (reduced) Gaussian grid (https://confluence.ecmwf.int/display/UDOC/Gaussian+grids). Start by only supporting (assuming) regular lat-lon? However this causes the grid node distribution to be quite uneven towards the poles (probably quite noticeable for the MEPS domain). This will then have an effect on the g2m where some nodes will have contributions from many more grid points than others? @joeloskarsson maybe already solved this for the global model? |
Does splitting the "state" into surface, atmospheric and levels add value (e.g. what if all variables are not at all levels)? What about just having a single list with state variables defined by names from ECMWF or CF name tables: |
Hi @sadamov and @joeloskarsson, I think #31 is in good shape and answering a slightly different problem, so I would rather have it done and remove the hard-coded 65 time steps in another issue. Just to let you know how I did so far. In my opinion, the current I will raise a specific issue about the |
I agree with everything you describe Thomas, but given what we are moving towards with the data loading I am not sure if it is necessary to fix these kinds of constants in the current Dataset class. If we look at what @sadamov has started working on in terms of a zarr-loading Dataset (https://github.com/mllam/neural-lam/blob/feature_dataset_yaml/neural_lam/weather_dataset.py) all of this is handled very differently, so the problems you describe should not be present. I'd much rather try to get this zarr-based (re)analysis Dataset class in place and then use this as a basis for an improved (re)forecast Dataset class. |
**Summary** This PR replaces the `constants.py` file with a `data_config.yaml` file. Dataset related settings can be defined by the user in the new yaml file. Training specific settings were added as additional flags to the `train_model.py` routine. All respective calls to the old files were replaced. **Rationale** - Using a Yaml file for data config gives much more flexibility for various datasets used in the community. It also facilitates the future use of forcing and boundary datasets. In a follow-up PR the dataset paths will be defined in the yaml file, removing the dependency on a pre-structured `/data` folder. - It is best practice to define user input in a yaml file, the usage of python scripts for that purpose is not common. - The old `constants.py` actually combined both constants and variables, many "constants" should rather be flags to `train_models.py` - The introduction of a new ConfigClass in `utils.py` allows for very specific queries of the yaml and calculations based thereon. This branch shows future possibilities of such a class https://github.com/joeloskarsson/neural-lam/tree/feature_dataset_yaml **Testing** Both training and evaluation of the model were succesfully tested with the `meps_example` dataset. **Note** @leifdenby Could you invite Thomas R. to this repo, in case he wanted to give his input on the yaml file? This PR should mostly serve as a basis for discussion. Maybe we should add more information to the yaml file as you outline in https://github.com/mllam/mllam-data-prep. I think we should always keep in mind how the repository will look like with realistic boundary conditions and zarr-archives as data-input. This PR solves parts of #23 --------- Co-authored-by: Simon Adamov <[email protected]>
Sorry, I had forgotten about #24. Maybe the zarr-dataset discussion can continue there? |
This issue was closed an superceded by #24 where the work on zarr-archives for boundaries, normalizations statistics and more can continues. |
This supersedes #2 and #3.
Motivation
It is currently very hard to work with neural lam on different regions due to everything related to data and the forecast region being specified hard-coded in
constants.py
. It would be much better to specify this in a config file that you can then point to. Yaml seems like a suitable format for this.Proposition
The main training/eval script takes a flag
--spec
that should be given a path to a yaml-file. This yaml file specifies all the data that goes into the model and information about the region the model should be working with.Current options in
constants.py
that relate to what to plot should all be turned into flags.The yaml file should be read in and turned into a single object that contains all useful information and can be passed around in the code (since this is needed almost everywhere). Having this as an object means that it can also compute things not directly in the yaml file, such as units of variables that can be retrieved from loaded xarray data.
Design
Here is an idea for how the yaml file could be laid out with examples:
Start of file:
# Data config for MEPS dataset ---
Some comments to keep track of what this specification is about. We don't enforce any content in there.
Forecast area data configuration
This describes the data configuration for the actual limited area that you are training on. Explicitly, the "inner region", not the "boundary". What is specified is what zarrs to load state, forcing and static grid features from and which variables in these to use for each.
grid
dimension or the name ofx
andy
dimensions. Ifx
andy
then we will flatten ourselves in the data loading code. If onlygrid
is specified, agrid_shape
entry has to be given (see below).Boundary data configuration
The boundary data configuration follows exactly the same structure as the forecast_area, with two differences:
state
entry is allowed, as we do not forecast the boundary nodes atm.The boundary has its own list of zarrs, to avoid variable name clashes with the forecast area zarrs. Note that we enforce no spatial structure of the boundary w.r.t. forecast area. The boundary nodes can be placed anywhere.
Grid shape
If the zarrs already contain flattened grid dimensions we need knowledge of the original 2d spatial shape in order to be able to plot data. For such cases this can be given by an optional
grid_shape
entry:Subset splits
The train/val/test split is defined based on timestamps:
Used by the dataset class to
.sel
the different subsets.Forecast area projection
In order to be able to plot data in the forecasting area we need to know what projection the area is defined in. By plotting in this projection we end up with a flat rectangular area where the data sits. This should be specified as a reference to a cartopy.crs object.
Normalization zarr
We also need information about statistics of variables, boundary and forcing for normalization (mean and std). Additionally we need the inverse variances used in the loss computation. As we compute and save this in a pre-processing script we can enforce a specific format, so lets put all of those also in its own zarr. Then we only need to specify a path here to that zarr to load it from.
The text was updated successfully, but these errors were encountered: