String-valued dimension incorrectly loaded as matrix of characters #237

Datseris · 2023-11-08T14:06:45Z

Describe the bug

A colleague of mine that uses Python and xarray has sent me a .nc file. One of the dimensions of the .nc file has string values (i.e., it is like a list of names). When I try to load this file I get:

Dimensions        
   time = 4000    
   diagnostic = 19
   ic = 101       
   string13 = 13  

Variables
  values   (101 × 19 × 4000)
    Datatype:    Union{Missing, Float64} (Float64)
    Dimensions:  ic × diagnostic × time
    Attributes:
     _FillValue           = NaN

  time   (4000)
    Datatype:    Union{Missing, Float64} (Float64)
    Dimensions:  time
    Attributes:
     _FillValue           = NaN

  ic   (101)
    Datatype:    Int32 (Int32)
    Dimensions:  ic

  diagnostic   (13 × 19)
    Datatype:    Char (Char)
    Dimensions:  string13 × diagnostic
    Attributes:
     _Encoding            = utf-8

and accessing the diagnostic variable gives:

julia> v = data["diagnostic"]; v[:]
13×19 Matrix{Char}:
 's'   's'   's'   't'   't'   …  's'  's'   'a'   'a'   'a'        
 'a'   'a'   'a'   'e'   'e'      'a'  'e'   'm'   'm'   'a'        
 'l'   'l'   'l'   'm'   'm'      'l'  'a'   'o'   'o'   'b'        
 't'   't'   't'   'p'   'p'      't'  'i'   'c'   'c'   'w'        
 '_'   '_'   '_'   '_'   '_'      '_'  'c'   '_'   '_'   '\0'       
 't'   's'   's'   's'   's'   …  'f'  'e'   'm'   'E'   '\0'       
 'o'   'u'   'u'   'u'   'u'      'o'  '\0'  'a'   'Q'   '\0'       
 't'   'b'   'b'   'b'   'b'      'r'  '\0'  'x'   '\0'  '\0'       
 '\0'  '_'   '_'   '_'   '_'      'c'  '\0'  '\0'  '\0'  '\0'       
 '\0'  'N'   'S'   'N'   'S'      '_'  '\0'  '\0'  '\0'  '\0'       
 '\0'  'A'   'A'   'A'   'A'   …  't'  '\0'  '\0'  '\0'  '\0'       
 '\0'  '\0'  '\0'  '\0'  '\0'     'o'  '\0'  '\0'  '\0'  '\0'       
 '\0'  '\0'  '\0'  '\0'  '\0'     't'  '\0'  '\0'  '\0'  '\0'

each column here is a variable name. So each column should have been a string.

To Reproduce

Please give me an email address I can give access to to the file, as it is not possible to share the data publicly on GitHub. Once the file is downloaded, to reproduce do simply:

data = NCDataset("filename.nc")
v = data["diagnostic"]
v[:]

Expected behavior

The dimension values for "diagnostic" should be a vector of strings instead of a matrix of chars.
I admit, I do not know where the problem comes from. My colleague insists that he saves the data "correctly" with xarray and once he loads the data he gets the dimension as a vector of strings.

Environment

operating system: Windows 10
Julia version: 1.9.3
NCDatasets version: ⌅ [85f8d34a] NCDatasets v0.12.17 (currently checking if problem persists in new version 0.13)

The text was updated successfully, but these errors were encountered:

Alexander-Barth · 2023-11-08T15:53:22Z

What is the output of ncdump -h file.nc? George Datseris ***@***.***> schrieb am Mi., 8. Nov. 2023, 15:06:

…

*Describe the bug* A colleague of mine that uses Python and xarray has sent me a .nc file. One of the dimensions of the .nc file has string values (i.e., it is like a list of names). When I try to load this file I get: Dimensions time = 4000 diagnostic = 19 ic = 101 string13 = 13 Variables values (101 × 19 × 4000) Datatype: Union{Missing, Float64} (Float64) Dimensions: ic × diagnostic × time Attributes: _FillValue = NaN time (4000) Datatype: Union{Missing, Float64} (Float64) Dimensions: time Attributes: _FillValue = NaN ic (101) Datatype: Int32 (Int32) Dimensions: ic diagnostic (13 × 19) Datatype: Char (Char) Dimensions: string13 × diagnostic Attributes: _Encoding = utf-8 and accessing the diagnostic variable gives: julia> v = data["diagnostic"]; v[:] 13×19 Matrix{Char}: 's' 's' 's' 't' 't' … 's' 's' 'a' 'a' 'a' 'a' 'a' 'a' 'e' 'e' 'a' 'e' 'm' 'm' 'a' 'l' 'l' 'l' 'm' 'm' 'l' 'a' 'o' 'o' 'b' 't' 't' 't' 'p' 'p' 't' 'i' 'c' 'c' 'w' '_' '_' '_' '_' '_' '_' 'c' '_' '_' '\0' 't' 's' 's' 's' 's' … 'f' 'e' 'm' 'E' '\0' 'o' 'u' 'u' 'u' 'u' 'o' '\0' 'a' 'Q' '\0' 't' 'b' 'b' 'b' 'b' 'r' '\0' 'x' '\0' '\0' '\0' '_' '_' '_' '_' 'c' '\0' '\0' '\0' '\0' '\0' 'N' 'S' 'N' 'S' '_' '\0' '\0' '\0' '\0' '\0' 'A' 'A' 'A' 'A' … 't' '\0' '\0' '\0' '\0' '\0' '\0' '\0' '\0' '\0' 'o' '\0' '\0' '\0' '\0' '\0' '\0' '\0' '\0' '\0' 't' '\0' '\0' '\0' '\0' *To Reproduce* Please give me an email address I can give access to to the file, as it is not possible to share the data publicly on GitHub. Once the file is downloaded, to reproduce do simply: data = NCDataset("filename.nc") v = data["diagnostic"] v[:] *Expected behavior* The dimension values for "diagnostic" should be a vector of strings instead of a matrix of chars. I admit, I do not know where the problem comes from. My colleague insists that he saves the data "correctly" with xarray and once he loads the data he gets the dimension as a vector of strings. *Environment* - operating system: Windows 10 - Julia version: 1.9.3 - NCDatasets version: ⌅ [85f8d34a] NCDatasets v0.12.17 (currently checking if problem persists in new version 0.13) — Reply to this email directly, view it on GitHub <#237>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLMPA6CDNX5BLVU4F7YSMLYDOGYBAVCNFSM6AAAAAA7C6DUFGVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DGNRZGU2DKNA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Datseris · 2023-11-09T12:36:24Z

Hi, I do not know where I have to run this command, my shell does not have ncdump name and it doesn't appear to be a Julia command.

Meanwhile, my colleague has given me a way to reproduce the problem. In Python's xarray do:

Ntime = 4000
Nobs = 19
N = 101
data = np.empty((Ntime, Nobs, N))

observables = ['salt_tot', 'salt_sub_NA', 'salt_sub_SA', 'temp_sub_NA', 'temp_sub_SA', 'sst_NA', 'sst_SA', 'sss_NA', 'sss_SA', 'rho_sub_NA', 'rho_sub_SA', 'rho_NA', 'rho_SA', 'salt_forc', 'salt_forc_tot', 'seaice', 'amoc_max', 'amoc_EQ', 'aabw']

time_vector = 5.*np.arange(Ntime)
initial_cond = np.arange(Nobs)

ds = xr.Dataset({'values': (['time', 'diagnostic', 'ic'], data)}, coords={'time': time_vector, 'diagnostic': observables, 'ic': initial_cond})

ds.to_netcdf('file.nc')

wobagi · 2023-11-09T13:21:48Z

Just stumbled upon this topic and checked the output on my machine. The python code has a typo. It should have
initial_cond = np.arange(N)
and it is lacking imports:

import xarray as xr
import numpy as np

Anyways. Everything's looking good on OS X 12.6.6 and Linux Centos 7 using NCDatasets 0.12.17

julia> v = data["diagnostic"]
diagnostic (19)
  Datatype:    String
  Dimensions:  diagnostic

julia> v[:]
19-element Vector{String}:
 "salt_tot"
 "salt_sub_NA"
 "salt_sub_SA"
 "temp_sub_NA"
 "temp_sub_SA"
 "sst_NA"
 "sst_SA"
 "sss_NA"
 "sss_SA"
 "rho_sub_NA"
 "rho_sub_SA"
 "rho_NA"
 "rho_SA"
 "salt_forc"
 "salt_forc_tot"
 "seaice"
 "amoc_max"
 "amoc_EQ"
 "aabw"

ncdump gives correct data here as well (I lowered the numbers for time and ic dimensions):

dimensions:
	time = 24 ;
	diagnostic = 19 ;
	ic = 3 ;
variables:
	double values(time, diagnostic, ic) ;
		values:_FillValue = NaN ;
	double time(time) ;
		time:_FillValue = NaN ;
	string diagnostic(diagnostic) ;
	int64 ic(ic) ;
}

Hopefully that helps.

Alexander-Barth · 2023-11-09T18:51:58Z

@Datseris If you need in future ncdump, here is some information for windows users: https://docs.unidata.ucar.edu/netcdf-c/current/winbin.html
https://pjbartlein.github.io/REarthSysSci/install_netCDF.html

It helps me a lot when users provide this additional information as ncdump is independent of NCDatasets (and xarray) and gives the metadata in the NetCDF as it is stored. I know it can take some time to get these installed on windows, but ncdump is really valuable to troubleshoot issues with NetCDF files. With the shell tool ncgen I can generate a NetCDF file with exactly the same metadata as your file (but with "blank" data). So you do not even need to share your data file and we can still have a reproducible issue report.

@wobagi Thanks a lot for your input and correcting @Datseris example. I just installed xarray ( 2023.10.1) and I get :

abarth@GHERLaptop ~ $ ncdump -h file.nc 
netcdf file {
dimensions:
	time = 4000 ;
	diagnostic = 19 ;
	ic = 101 ;
	string13 = 13 ;
variables:
	double values(time, diagnostic, ic) ;
		values:_FillValue = NaN ;
	double time(time) ;
		time:_FillValue = NaN ;
	int ic(ic) ;
	char diagnostic(diagnostic, string13) ;
		diagnostic:_Encoding = "utf-8" ;
}

So the data is indeed stored as a matrix of chars. Also the file is a NetCDF 3 file (NetCDF 3 does not support strings).
But If I install the package netCDF4 (pip install -U netCDF4), the data is stored by xarray as:

netcdf file {
dimensions:
	time = 4000 ;
	diagnostic = 19 ;
	ic = 101 ;
variables:
	double values(time, diagnostic, ic) ;
		values:_FillValue = NaN ;
	double time(time) ;
		time:_FillValue = NaN ;
	string diagnostic(diagnostic) ;
	int64 ic(ic) ;
}

Now the data is a vector of strings (as in @wobagi case) and the format is in NetCDF4 (note the data type of ic also changed).

I think that NCDatasets is correct to read a matrix as a matrix and a vector as a vector.

(Maybe xarray should give the user a warning when NetCDF 4 features are "approximated" (when python-netCDF4 is not installed) as in this case. )

Datseris · 2023-11-10T08:08:22Z

Thank you very much, you have proven concretely that this is not an issue with NCDatasets.jl. I will ask my colleague to update to NetCDF4.

Datseris closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String-valued dimension incorrectly loaded as matrix of characters #237

String-valued dimension incorrectly loaded as matrix of characters #237

Datseris commented Nov 8, 2023 •

edited

Loading

Alexander-Barth commented Nov 8, 2023 via email

Datseris commented Nov 9, 2023

wobagi commented Nov 9, 2023

Alexander-Barth commented Nov 9, 2023 •

edited

Loading

Datseris commented Nov 10, 2023

String-valued dimension incorrectly loaded as matrix of characters #237

String-valued dimension incorrectly loaded as matrix of characters #237

Comments

Datseris commented Nov 8, 2023 • edited Loading

Alexander-Barth commented Nov 8, 2023 via email

Datseris commented Nov 9, 2023

wobagi commented Nov 9, 2023

Alexander-Barth commented Nov 9, 2023 • edited Loading

Datseris commented Nov 10, 2023

Datseris commented Nov 8, 2023 •

edited

Loading

Alexander-Barth commented Nov 9, 2023 •

edited

Loading