Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standardise metadata as Metadata{X,Dict{String,Any}} #304

Merged
merged 2 commits into from
Sep 25, 2022
Merged

Conversation

rafaqz
Copy link
Owner

@rafaqz rafaqz commented Sep 4, 2022

@ryofurue this PR standardise all RasterStack, Raster and LookupArray metadata to be Metadata{SOURCE,Dict{String,Any}}.

The intention is to allow more strings keys that CF standards allow that are either awkward or impossible with Symbol keys. It's also a breaking change that will require a minor version bump. See also: rafaqz/DimensionalData.jl#406

What is further needed is to define the lists of allowed keys for all metadata subtypes, and allow users to use a plain Dict to define keys, that we can check against allowed keys to warn on mistakes.

@ryoforue if you have any time to collate these allowed key/value combinations from the CF standards, GDAL and Rs GRI/GRI files, or for even some of these types, we could get started on implementing proper metadata checking, conversion, and writing (as outlined in #3 the oldest open issue here!). Seems I haven't found time to do it in the last 3 years 😅

@ryofurue
Copy link

ryofurue commented Sep 7, 2022

collate these allowed key/value combinations from the CF standards, GDAL and Rs GRI/GRI files

Checks to ensure that the metadata conforms to each particular standard would be nice to have, but I was wondering if they would be necessary . . .

  1. I imagine there are pieces of metadata which would cause error when the Julia library (Raster etc) tries to write them into the file. The user needs to be at least notified of such errors. Even then, the Julia library doesn't have to take any kind actions as long as the presence of the error gets known to the user. For example, Raster could write a key like "a#b" into a netCDF file, even though the netCDF standard doesn't allow for it. In such a case, the underlying netCDF library will complain and that would be sufficient.

  2. If the metadata simply breaks "conventions" without causing error, we could let it go. For example, a netCDF file is usable (and can be useful) even if it doesn't conform to the CF conventions. The "units" attribute can be useful even if it doesn't conform to the CF conventions. And so on.

So, most of such checks as you describe can wait, I would think.

@rafaqz
Copy link
Owner Author

rafaqz commented Sep 7, 2022

The reason they are necessary is we can already write a tiff to a netcdf file, and the reverse.

Before we go writing arbitrary metadata between file types we should know the consequences of that.

(One option is to write any metadata from Dict and Metadata{Nothing} and disallow it from Metadata{SOURCE} - then we can be surer the user is intending to write this metadata, rather than doing it unintentionally. We just need to be clear that is what we are doing, rather than just writing all metadata. Ultimately I would prefer you can write all metadata or only same source metadata - thus my idea to do this properly now)

@ryofurue
Copy link

ryofurue commented Sep 8, 2022

Before we go writing arbitrary metadata between file types we should know the consequences of that.

I agree. But, I was concerned about your mention of CF convensions and units. Violations of these conventions wouldn't have "serious" consequences. You don't need to implement checks against these. That was my point.

As to serious consequences for netCDF . . .

  1. Names of variables, dimensions, and attributes: In addition to the regular alphanum ASCII characters, several special characters are allowed, such as ".", "_" . . . (Is there a length limit?)

  2. Data types are precisely defined: Float64, Float32, String (but I don't know whether it's a C-string or not, or what charatersets are allowed), Int32, and so on.

  3. Endianness. I don't know if it's really possible to cause inconsistency in this area. But, if there is any chance of inconsistency, one must check against that, I think. For example, the metadata typically includes the missing value (the _FillValue attribute ), and what if the variable itself and the missing value are stored in different endiannesses?

I can imagine that violation of these would cause serious problems.

@mauro3
Copy link
Collaborator

mauro3 commented Sep 8, 2022

Regarding endianness: isn't that taken care of by the netcdf-library through which the reading-writing of the files goes?

@rafaqz
Copy link
Owner Author

rafaqz commented Sep 8, 2022

Yeah I doubt endianness will be an issue.

Another thing not mentioned here is the fields that NCDatasets.jl writes itself and how it interprets having those fields manually set. That may cause some problems with the current write of metadata as well.

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2022

Codecov Report

Merging #304 (d6bba74) into main (5024776) will increase coverage by 0.05%.
The diff coverage is 81.81%.

@@            Coverage Diff             @@
##             main     #304      +/-   ##
==========================================
+ Coverage   80.02%   80.08%   +0.05%     
==========================================
  Files          38       38              
  Lines        2889     2897       +8     
==========================================
+ Hits         2312     2320       +8     
  Misses        577      577              
Impacted Files Coverage Δ
src/sources/smap.jl 0.00% <0.00%> (ø)
src/sources/ncdatasets.jl 87.55% <75.00%> (ø)
src/sources/gdal.jl 91.17% <100.00%> (ø)
src/sources/grd.jl 95.76% <100.00%> (ø)
src/utils.jl 79.72% <100.00%> (+2.45%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rafaqz
Copy link
Owner Author

rafaqz commented Sep 25, 2022

@ryofurue what I meant by consequences was the ability to write e.g. a scale_factor field that changes the values that NCDatasets.jl will return (as it will automatically adjust the values). Or something similar to that.

It is not clear which of those netcdf attribute fields will actually result in reading different data to what is written, or if that is possible with NCDatasets.jl.

@rafaqz rafaqz merged commit f382455 into main Sep 25, 2022
@rafaqz rafaqz deleted the better_metadata branch September 25, 2022 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants