Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data update 2023 #55

Open
VeruGHub opened this issue Apr 22, 2024 · 6 comments
Open

data update 2023 #55

VeruGHub opened this issue Apr 22, 2024 · 6 comments
Assignees

Comments

@VeruGHub
Copy link
Owner

Two options (@cpucher):

  • Only update the year 2023, adding to the current data. We would still only have v4 of the data.
  • Update the whole time series from 1950 to 2023, which will likely also lead to slightly different values for the previous years as the E-Obs data set is constantly updated (stations added/removed, wrong data corrected/removed etc.). We would have v4 (1950-2022) and v5 (1950-2023) of the data.
@VeruGHub
Copy link
Owner Author

The change from v3 to v4 was because E-Obs data was updated with a different spatial resolution, right?
I don't have a strong opinion on what is better now, but I find important to keep all versions stored and accessible to the users, so maybe we need to be conservative and not create new versions very often. An option could be to inform in the documentation that slightly changes can happen in the data with yearly updates due to E-Obs updates.
This made me think that maybe we would need to define a reference period to calculate monthly and yearly values so values do not change every year. What do you think?

@cpucher
Copy link

cpucher commented Apr 22, 2024

As we changed the resolution to 500 m it was clear that we will need to re-calculate the whole time series again. However, in previous iterations (v2 and v3) we also always updated the whole time series, although the resolution didn't change.

These are the changes (apart from continuing previous time series) in the two E-Obs version released since our last calculation:
v28.0e: New series are included for Campania and Trentino in Italy and the elevation is corrected for German precipitation stations.
v29.0e: Included new stations and updates for Ukraine, Portugal and Belgium Included data from Global Summary of the Day for southeast Europe Updated Polish precipitation series that were wrongly included. Included radiation series for Trentino in Italy.

They may warrant a re-calculation of the whole time series. There is also a reason why E-Obs always releases a new version instead of just "updating" the old one I guess. We could also decide on some update policy, e.g. a new version only every 3 years and inbetween just updating the current version.

I find important to keep all versions stored and accessible to the users
I don't agree that we need to keep all versions stored and accessible to the users, having 2-3 versions available should be enough. If it comes to reproduceability, the users have to store the data they have used for their analysis and it shouldn't be dependent on us still providing for instance v1 of the data.

This made me think that maybe we would need to define a reference period to calculate monthly and yearly values so values do not change every year. What do you think?
This comment is not clear for me : -)

@Pakillo
Copy link
Collaborator

Pakillo commented May 1, 2024

Hi,

I agree from an ideal point of view we should store all data versions for the sake of reproducibility. Last time we talked about this we discarded the idea for lack of resources ($$). But it would be nice to secure some online hosting to save all data versions.

Alternatively, we could publish the source code that takes the E-Obs dataset and produces the rasters that are then hosted in the FTP server and served through easyclimate. Archiving the source code is trivial and free (e.g. in Zenodo), and would permit anyone to reproduce the rasters in case they needed to. We would just need to specify which version of the E-Obs dataset was used in each of our data versions.

That would free us from having to store all former data versions, and serve only the most recent and updated rasters (perhaps storing the penultimate version too just in case). It looks like users will often request the latest year to be added soon, and IMO it looks better to serve the most correct, updated version whenever possible, rather than waiting 2-3 years between releases.

So, I think we could publish the source code and update the dataset yearly, but storing only the latest and penultimate version in the server. Does that sound like a good option to you?

@VeruGHub
Copy link
Owner Author

I would like to relaunch this discussion! We get lost in details I think. I propose:

  • Update the whole time series every year (not only the new year) after the E-Obs release.
  • Inform in documentation of the database and the package about possible reproducibility issues if data is not stored because of slightly modifications in E-Obs data every year.
  • Store all data versions that we create, but only create a new version if there are substancial changes as the step from v3 to v4, and having in BOKU servers only the last two versions (the old one at least for a couple of years giving time to all research based on the data to be published). I will try to find a storage solution for older data version with Paloma.

What do you think?

And in relation to this:
"This made me think that maybe we would need to define a reference period to calculate monthly and yearly values so values do not change every year."
What I wanted to bring here is wether average monthly/annual values need to be updated every year with the E-Obs release or we define a reference period (e.g. 1980-2010) to calculate de averages and keep them more fixed

@Pakillo
Copy link
Collaborator

Pakillo commented Jul 20, 2024

Sounds good to me!

When you say "only create a new version if there are substantial changes", if we "update the whole time series every year", that means we will have one new version every year, right?

So according to this plan the server would store current and last year versions..

we define a reference period (e.g. 1980-2010) to calculate de averages

I understand you want to include climatological averages besides monthly and yearly rasters. I'm fine with that, but if we update the whole series every year, the averages should be updated too, otherwise the data would be incoherent. But I'm fine with setting a reference period (maybe 1990-2020 would be more useful). This average would have to be recalculated every year with the yearly E-OBS update.

@cpucher
Copy link

cpucher commented Jul 29, 2024

When you say "only create a new version if there are substantial changes", if we "update the whole time series every year", that means we will have one new version every year, right?

I understand it the same way, a new version each year.

I understand you want to include climatological averages besides monthly and yearly rasters. I'm fine with that, but if we update the whole series every year, the averages should be updated too, otherwise the data would be incoherent. But I'm fine with setting a reference period (maybe 1990-2020 would be more useful). This average would have to be recalculated every year with the yearly E-OBS update.

I'm not sure we have to provide this. We already give them yearly data, so they can calculate their periodic averages for whatever period they like if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants