Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]: are there some xcdat test files (that can be predownloaded) ? #277

Open
jypeter opened this issue Jul 22, 2022 · 8 comments
Open
Labels
good-first-issue Good first issue for new contributors type: docs Updates to documentation

Comments

@jypeter
Copy link

jypeter commented Jul 22, 2022

Describe your documentation update

I wonder if there are xCDAT (or xarray) test files that can be (pre)downloaded and can be used for :

  • testing xCDAT with known (and local) files
  • examples and tutorials
  • having local data files that you can use when you have no network or low bandwidth

I'm thinking of (something like) the cdms2/vcs test data

I think these files are the ones listed in CDMS Sample Dataset and they are still online!

@jypeter jypeter added the type: docs Updates to documentation label Jul 22, 2022
@pochedls
Copy link
Collaborator

I like this idea, but I'm wondering how this be implemented in a way that is easy to maintain. Perhaps we could add some functionality to directly download (e.g., from ESGF) example netCDF files (e.g., xcdat.get_test_data())?

I was curious about what xarray does – it seems like they generate toy data rather than providing data.

Should this be a discussion item?

@jypeter
Copy link
Author

jypeter commented Jul 29, 2022

This is the up-to-date link for toy data you mentioned, but I'd rather have data coming from actual netCDF files than toy data generated in memory!

Some not-too-big test data files could come from ESGF, the way I've done it in #284, but we also need a way to get other static/known test data files:

  • subset (e.g a few time steps) of real ESGF data, because you don't want huge files with all the time steps when you have lots of time steps, or vertical levels. A script using xcdat to download and then save a subset of ESGF data (e.g first 10 time steps, and just a few pressure or depth levels of Northern Hemisphere) would be a useful example anyway
  • data with some known errors (e. g. [Bug]: open_dataset should handle missing bounds on ORCA grid more gracefully #284, or incorrectly masked data, or incorrect metadata, ...) that you want to be sure xcdat can handle, and also provide example scripts to show how to correct the files and save corrected files

I have just checked that cartopy mostly generates toy data on the fly for its examples, but iris uses a directory with actual data files (the way vcs and cdms2 did)

>>> import iris
>>> help(iris.sample_data_path)
sample_data_path(*path_to_join)
    Given the sample data resource, returns the full path to the file.

    .. note::

        This function is only for locating files in the iris sample data
        collection (installed separately from iris). It is not needed or
        appropriate for general file access.

>>> iris.sample_data_path("E1_north_america.nc")
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/E1_north_america.nc'

ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/
total 24M
-rw-rw-r-- 2 jypeter lsce 110K Jun 25  2020 A1B.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25  2020 A1B_north_america.nc
-rw-rw-r-- 2 jypeter lsce  28K Jun 25  2020 air_temp.pp
-rw-rw-r-- 2 jypeter lsce  34K Jun 25  2020 atlantic_profiles.nc
-rw-rw-r-- 2 jypeter lsce 3.5M Jun 25  2020 colpex.pp
-rw-rw-r-- 2 jypeter lsce 110K Jun 25  2020 E1.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25  2020 E1_north_america.nc
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10  2021 GloSea4/
-rw-rw-r-- 2 jypeter lsce 662K Jun 25  2020 hybrid_height.nc
-rw-rw-r-- 2 jypeter lsce 7.5M Jun 25  2020 NAME_output.txt
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10  2021 NEMO/
-rw-rw-r-- 2 jypeter lsce 2.0M Jun 25  2020 orca2_votemper.nc
-rw-rw-r-- 2 jypeter lsce 1.7M Jun 25  2020 ostia_monthly.nc
-rw-rw-r-- 2 jypeter lsce  26K Jun 25  2020 polar_stereo.grib2
-rw-rw-r-- 2 jypeter lsce 110K Jun 25  2020 pre-industrial.pp
-rw-rw-r-- 2 jypeter lsce  19K Jun 25  2020 rotated_pole.nc
-rw-rw-r-- 2 jypeter lsce 163K Jun 25  2020 SOI_Darwin.nc
-rw-rw-r-- 2 jypeter lsce 243K Jun 25  2020 space_weather.nc
-rw-rw-r-- 2 jypeter lsce 514K Jun 25  2020 toa_brightness_stereographic.nc
-rw-rw-r-- 2 jypeter lsce 3.3M Jun 25  2020 uk_hires.pp
drwxr-xr-x 2 jypeter lsce  12K Sep 10  2021 UM/
-rw-rw-r-- 2 jypeter lsce 2.4K Jun 25  2020 wind_speed_lake_victoria.pp

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Aug 1, 2022

Thanks for this @jypeter. This has been discussed and was in-mind, although a GH issue was not opened for it.

I explored a possible implementation similar to xarray. xarray uses a GH repo (https://github.com/pydata/xarray-data) to host test datasets, and provides xarray.tutorial methods to open up the test datasets using a package called pooch.

We didn't pursue this idea since xarray supports direct download of data using OpenDAP. However, I think this idea is worthwhile because it standardizes and streamlines the testing processes with easy access to the same real-world datasets.

@jypeter
Copy link
Author

jypeter commented Aug 2, 2022

Hmmm, I had a quick look at the pooch GH page. It looks really nice and fancy but:

  • it may be an overkill for our purpose, from the end user point-of-view. But xCDAT could indeed use it behind the scene! Or possibly just use requests
  • specifying the input files seems a bit complicated, but it's OK if it only happens behind the scene. The end user should only have to specify a file name, and some xCDAT function should provide the path (either the directory where the file is located, or a full path)
  • you have to be careful where the data files are located! I'm not too sure about a cache that usually depends on the user login or something. When, like me, you install a python distribution for multiple users (where the person installing can write, but other users can't), it's convenient to have files installed in a fixed sub directory of the distribution's lib directory. And I hate default cache locations in hidden sub-directories of the users' home dir. We have nightly backups of the the home dirs at LSCE, and we archive the interns' home dir when they are finished. I don't want to have backups of hidden test files!
  • See also the Clarifying the 'under the hood' data download (and other GIS stuff) SciTools/cartopy#1325 ongoing issue about file location and cache problems

Having a dedicated python package with just the data could also be an easy solution: e.g. basemap-data-hires

@jypeter
Copy link
Author

jypeter commented Aug 2, 2022

Another data sample example from xoa

>>> import xoa

>>> xoa.show_data_samples()
gdp-6203641.csv hycom.gdp.u.nc hycom.gdp.v.nc hycom.gdp.h.nc croco.south-africa.surf.nc hycom.cfg croco.cfg gdp.cfg mercator.cfg argo.cfg croco.south-africa.zonal.nc croco.south-africa.meridional.nc ibi-argo-7900573.nc argo-7900573.nc

>>> xoa.get_data_sample('hycom.gdp.u.nc')
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples/hycom.gdp.u.nc'

> du -sh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
1.1M    /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples

>ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
total 1.1M
-rw-rw-r-- 2 jypeter lsce  92K Feb 25 09:56 argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce  305 Feb 25 09:56 argo.cfg
-rw-rw-r-- 2 jypeter lsce  714 Feb 25 09:56 croco.cfg
-rw-rw-r-- 2 jypeter lsce  61K Feb 25 09:56 croco.south-africa.meridional.nc
-rw-rw-r-- 2 jypeter lsce 190K Feb 25 09:56 croco.south-africa.surf.nc
-rw-rw-r-- 2 jypeter lsce  61K Feb 25 09:56 croco.south-africa.zonal.nc
-rw-rw-r-- 2 jypeter lsce  43K Feb 25 09:56 gdp-6203641.csv
-rw-rw-r-- 2 jypeter lsce   73 Feb 25 09:56 gdp.cfg
-rw-rw-r-- 2 jypeter lsce  487 Feb 25 09:56 hycom.cfg
-rw-rw-r-- 2 jypeter lsce 174K Feb 25 09:56 hycom.gdp.h.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.u.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.v.nc
-rw-rw-r-- 2 jypeter lsce  71K Feb 25 09:56 ibi-argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce  195 Feb 25 09:56 mercator.cfg

@durack1
Copy link
Collaborator

durack1 commented Aug 10, 2022

@tomvothecoder was there a plan to have a test suite with just the kind of (few timesteps) data that @jypeter was describing? It seems that CDAT was using the sample_data subdir which enabled testing in the CI envs, similar to what iris appears to do (#277 (comment) above)

@jypeter
Copy link
Author

jypeter commented Aug 12, 2022

Note: see example usage of vcs.sample_data + '/tas_mo.nc' in #310 (comment)

@jypeter
Copy link
Author

jypeter commented Dec 14, 2023

I have added an Easy to use datasets section to my python page, with test/tutorials datasets from several packages

@tomvothecoder It seems that xarray uses xarray.tutorial.load_dataset. Maybe xcdat could have a similar xcdat.tutorial.load_dataset pointing to some useful sample CMIP6 data (and possibly the equivalent CMIP5 data, if somebody wants to make a CMIP5/CMIP6 comparison example)

@tomvothecoder tomvothecoder added the good-first-issue Good first issue for new contributors label Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-first-issue Good first issue for new contributors type: docs Updates to documentation
Projects
Status: Todo
Development

No branches or pull requests

4 participants