Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netcdf error on remote data #457

Closed
ryofurue opened this issue Jun 1, 2023 · 9 comments · Fixed by #458
Closed

netcdf error on remote data #457

ryofurue opened this issue Jun 1, 2023 · 9 comments · Fixed by #458

Comments

@ryofurue
Copy link

ryofurue commented Jun 1, 2023

In #421 , @rafaqz said

Best to put this in a real issue so its visible.

So, here I post this new issue.

The source code

using Rasters
url = "http:https://apdrc.soest.hawaii.edu:80/dods/public_data/" *
  "Reanalysis_Data/NCEP/NCEP2/daily/surface/mslp"
r = Raster(url; name="mslp", source=:netcdf)

results in the error shown at the end of this message. If you replace the url with a local netCDF filename, the code works.

I use Rasters v0.7.2 and NCDatasets v0.12.16 on julia 1.9.0 on macOS 13.4 .

ERROR: MethodError: no method matching _open(::Type{Array}, ::Type{Rasters.GDALsource}, ::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDatasets.NCDataset{Nothing}}, NCDatasets.Attributes{NCDatasets.NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Vararg{Nothing, 5}}}}; key::String)

Closest candidates are:
  _open(::Any, ::Type{Rasters.NCDsource}, ::CommonDataModel.CFVariable; kw...)
   @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:248
  _open(::Any, ::Type{Rasters.GDALsource}, ::AbstractString; write, kw...)
   @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/gdal.jl:128
  _open(::Any, ::Type{Rasters.GDALsource}, ::ArchGDAL.RasterDataset; kw...)
   @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/gdal.jl:144
  ...

Stacktrace:
  [1] Raster(ds::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDatasets.NCDataset{Nothing}}, NCDatasets.Attributes{NCDatasets.NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Vararg{Nothing, 5}}}}, filename::String, key::String; crs::Nothing, mappedcrs::Nothing, dims::Nothing, refdims::Tuple{}, name::Symbol, metadata::DimensionalData.Dimensions.LookupArrays.Metadata{Rasters.NCDsource, Dict{String, Any}}, missingval::Missing, source::Nothing, write::Bool, lazy::Bool, dropband::Bool)
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/array.jl:321
  [2] Raster(ds::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDatasets.NCDataset{Nothing}}, NCDatasets.Attributes{NCDatasets.NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Vararg{Nothing, 5}}}}, filename::String, key::String)
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/array.jl:308
  [3] Raster(ds::NCDatasets.NCDataset{Nothing}, filename::String, key::String; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:57
  [4] Raster(ds::NCDatasets.NCDataset{Nothing}, filename::String, key::String)
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:46
  [5] (::Rasters.var"#61#62"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, String})(ds::NCDatasets.NCDataset{Nothing})
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/array.jl:305
  [6] _open(f::Rasters.var"#61#62"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, String}, ::Type{Rasters.NCDsource}, ds::NCDatasets.NCDataset{Nothing}; key::Nothing, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:246
  [7] _open
    @ ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:244 [inlined]
  [8] #911
    @ ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:241 [inlined]
  [9] NCDatasets.NCDataset(::Rasters.var"#911#912"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Rasters.var"#61#62"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, String}}, ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ NCDatasets ~/.julia/packages/NCDatasets/wsGwu/src/dataset.jl:255
 [10] NCDatasets.NCDataset(::Function, ::String, ::Vararg{String})
    @ NCDatasets ~/.julia/packages/NCDatasets/wsGwu/src/dataset.jl:252
 [11] #_open#910
    @ ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:240 [inlined]
 [12] _open(f::Function, ::Type{Rasters.NCDsource}, filename::String)
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/ncdatasets.jl:237
 [13] _open(f::Function, filename::String; source::Type, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Rasters ~/.julia/packages/Rasters/0lwLA/src/sources/sources.jl:52
 [14] _open
    @ ~/.julia/packages/Rasters/0lwLA/src/sources/sources.jl:51 [inlined]
 [15] #Raster#60
    @ ~/.julia/packages/Rasters/0lwLA/src/array.jl:303 [inlined]
 [16] top-level scope
    @ REPL[3]:1
@rafaqz
Copy link
Owner

rafaqz commented Jun 1, 2023

Can you see that the probem is that GDALsource is in your error?

Somehow NCDsource is not propagating now when you specify it manually. Although it gets most of the way. But somehow on array.jl line 321 source is now GDALsource. Probably there somewhere we have missed passing the source kw through.

Likely this snuck through because the tests for this propagation are with GDALsource but we clearly need to test it for every source separately.

@rafaqz
Copy link
Owner

rafaqz commented Jun 3, 2023

This will be fixed in the next release. But just note on your MWE:

Your likely want to write this:

r = Raster(url; name=:mslp, source=:netcdf, lazy=true)

With lazy=true you can load just the slices you need. This variable is 700mb and the download speed is pretty bad. But single time slices (or better chunks of 10 or so) are nearly instant cos theyre so small.

@ryofurue
Copy link
Author

ryofurue commented Jun 4, 2023

Thank you for the very fast fix!

With lazy=true you can load just the slices you need.

Would you elaborate on it? When you directly use the netCDF library, it's always lazy, whether reading a local file or fetching data from a remote OPeNDAP server.

@rafaqz
Copy link
Owner

rafaqz commented Jun 4, 2023

Yes, maybe its lazy with those libs...

But Rasters.jl will always load files eagerly unless you use lazy=true, to keep things standardised.

If you load a local file or small file from url you usually want it to load to memory. You can do more things with an in-memory array, like index it in a for loop.

We could check the size and switch modes automatically like some R packages do. But someone would have to write those heuristics, and it could lead to scripts behaving differently in different circumstances. So Im personally happy with forcing people to be explicit about what they want.

@ryofurue
Copy link
Author

ryofurue commented Jun 4, 2023

But Rasters.jl will always load files eagerly unless you use lazy=true, to keep things standardised.

I see! But, I was wondering whether it would be more helpful to default to the backend's default when lazy is not specified.

The reason for the netCDF library does things lazily is that it's designed to handle huge files. Its goal is to present to the user an interface to treat a big file as if it were a multidimensional array.

This is because our datasets (meteorology, oceanography, climate science) are often hundreds of gigabytes and sometimes several terabytes or more.

Currently, it's not common to have a single netCDF file that big. Typically you have one netCDF file for one time step (snapshot). But, the OPeNDAP server presents the time series of snapshots as if it were a single huge 4D netCDF variable to the remote user.

To give you an idea, we sometimes use a timeseries (t) of a 3D field (x,y,z). Each snapshot is 2.14 GiB and we sometimes have something like 10,000 snapshots for one dataset. We have a plan to serve this data on an OPeNDAP server in the near future.

Because this is our situation, no netCDF user expects that merely opening a netCDF file will read the whole array into memory. (I deliberately chose a time series of small 2D (x,y) array as an example in my initial post to save network traffic.)

@rafaqz
Copy link
Owner

rafaqz commented Jun 4, 2023

lazy=true is not really the same as the backends lazy. Its very lazy. Doing it by default is probably not what you think it is.

The file is not even open (so we dont hit open file limits for a timeseries of a decade where every day has its own file with hourly slices). And if you broadcast over the array it will be instant. The data is never touched until you index into it with getindex or call read on the whole raster. This is DiskArrays.jl underneath, and in future will be DiskArrayEngine.jl as an optimised lazy.

So Rasters.jl is an abstraction over the backends to make them work the same way, but not actually that similarly to the source behaviour. My concern is that introducing various caveats to that means code will not be transferable between tiff and netcdf files like it is currently, and the return object will not be guessable from the code. With your idea, if you wrote something for a tif with a for loop over the array using plain Raster, it would fail or take an hour if you swapped the file a netcdf.

Basically, there is a cost to making caveats like you suggest.

Everything used to be lazy by default, but that proved to not be tge majority use case. Writing lazy=true is pretty easy, and incredibly readable.

@rafaqz
Copy link
Owner

rafaqz commented Jun 4, 2023

Sorry closing wasn't meant to be an end to the conversation! It was just automatic from the PR ;)

@ryofurue
Copy link
Author

ryofurue commented Jun 6, 2023

lazy=true is not really the same as the backends lazy. Its very lazy. Doing it by default is probably not what you think it is.

Sorry for my denseness. I feel stupid, but I'm not sure if I understand you correctly.

Let me proceed slowly . . . by asking some random questions.

What happens if you open a geoTiff using lazy?

# no data transfer from the file to the memory:
r = Raster("somedata.tiff"; name="temp", lazy=true)
for j = 1:jmax
   a = r[:,j] # slice j is tranfered from the file to the memory?
   # do something with a
end

Would this code be much faster with lazy=false?

Another question (or just confirmation): I think you said that this code

r = Raster("somedata.nc"; name = "temp")

will transfer the whole array to the memory. Is that right?

If the answers to both questions are yes, then it would be helpful that lazy=true is the default for netCDF and lazy=false is the default for tiff. Or perhaps use some heuristics, as you say, to determine lazy or eager depending on the total size to be transferred to memory.

if you wrote something for a tif with a for loop over the array using plain Raster, it would fail or take an hour if you swapped the file a netcdf.

I agree, and your argument indicates that lazy=true should be the default for all formats! :-) With lazy=true, your code may be less fast for tiff than optimal, but that shouldn't be a big problem because we are talking about "small" dataset that fits in the main memory in this case.

If you really want uniformity across different file formats, then defaulting tolazy=true is the way to go, for this reason. Tiff files still work reasonably fast and your program keeps working even if the file is huge. If you really need fast code, you add lazy=false for you tiff data as a performance tuning.

But, I'm not sure what you buy with such uniformity. I don't see any real-world drawback if we default to lazy=true for netCDF and to lazy=false for tiff.


Off topic: Currently I'm the only julia users in my group. My colleagues are satisfied with Python's xarray. But there is a great opportunity for julia, because looping over arrays is extremely slow in Python and some of our algorithms are much easier to write with explicit looping. With python, you look for an existing package that does what you want, or write a very unnatural code using existing packages, or suffer from slowness, or write a C or Fortran program to be called from python.

In this context, I have great hope that Raster will be/is a great alternative to xarray and much better one than xarray.

@rafaqz
Copy link
Owner

rafaqz commented Jun 6, 2023

I really get your argument, and the argument about a default value of lazy=true is a totally valid one.

Another key reason I changed it to false is that DiskArrays.jl is not totally mature, so is less reliable as a default than Array, with some edge case failures I would rather avoid where its not a tually necessary to be using it.

With time and stability we can maybe revert that decision.

Last, reducing complexity is important here because this package is a small fraction of my spare time, when xarray has a full time team and many contributors.

What I buy by not making this change is not doing work I personally dont need, adding more tests, a bunch of functions to set lazyness based on source type, and documentation for the differences.

So for this to happen DiskArrays.jl will need to get more stable (and Im one of 2 very part time devs there too), and someone will have to make a PR that did all of these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants