-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netcdf error on remote data #457
Comments
Can you see that the probem is that Somehow Likely this snuck through because the tests for this propagation are with |
This will be fixed in the next release. But just note on your MWE: Your likely want to write this: r = Raster(url; name=:mslp, source=:netcdf, lazy=true) With |
Thank you for the very fast fix!
Would you elaborate on it? When you directly use the netCDF library, it's always lazy, whether reading a local file or fetching data from a remote OPeNDAP server. |
Yes, maybe its lazy with those libs... But Rasters.jl will always load files eagerly unless you use If you load a local file or small file from url you usually want it to load to memory. You can do more things with an in-memory array, like index it in a for loop. We could check the size and switch modes automatically like some R packages do. But someone would have to write those heuristics, and it could lead to scripts behaving differently in different circumstances. So Im personally happy with forcing people to be explicit about what they want. |
I see! But, I was wondering whether it would be more helpful to default to the backend's default when The reason for the netCDF library does things lazily is that it's designed to handle huge files. Its goal is to present to the user an interface to treat a big file as if it were a multidimensional array. This is because our datasets (meteorology, oceanography, climate science) are often hundreds of gigabytes and sometimes several terabytes or more. Currently, it's not common to have a single netCDF file that big. Typically you have one netCDF file for one time step (snapshot). But, the OPeNDAP server presents the time series of snapshots as if it were a single huge 4D netCDF variable to the remote user. To give you an idea, we sometimes use a timeseries (t) of a 3D field (x,y,z). Each snapshot is 2.14 GiB and we sometimes have something like 10,000 snapshots for one dataset. We have a plan to serve this data on an OPeNDAP server in the near future. Because this is our situation, no netCDF user expects that merely opening a netCDF file will read the whole array into memory. (I deliberately chose a time series of small 2D (x,y) array as an example in my initial post to save network traffic.) |
The file is not even open (so we dont hit open file limits for a timeseries of a decade where every day has its own file with hourly slices). And if you broadcast over the array it will be instant. The data is never touched until you index into it with So Rasters.jl is an abstraction over the backends to make them work the same way, but not actually that similarly to the source behaviour. My concern is that introducing various caveats to that means code will not be transferable between tiff and netcdf files like it is currently, and the return object will not be guessable from the code. With your idea, if you wrote something for a tif with a for loop over the array using plain Basically, there is a cost to making caveats like you suggest. Everything used to be lazy by default, but that proved to not be tge majority use case. Writing |
Sorry closing wasn't meant to be an end to the conversation! It was just automatic from the PR ;) |
Sorry for my denseness. I feel stupid, but I'm not sure if I understand you correctly. Let me proceed slowly . . . by asking some random questions. What happens if you open a geoTiff using lazy? # no data transfer from the file to the memory:
r = Raster("somedata.tiff"; name="temp", lazy=true)
for j = 1:jmax
a = r[:,j] # slice j is tranfered from the file to the memory?
# do something with a
end Would this code be much faster with Another question (or just confirmation): I think you said that this code r = Raster("somedata.nc"; name = "temp") will transfer the whole array to the memory. Is that right? If the answers to both questions are yes, then it would be helpful that
I agree, and your argument indicates that If you really want uniformity across different file formats, then defaulting to But, I'm not sure what you buy with such uniformity. I don't see any real-world drawback if we default to Off topic: Currently I'm the only julia users in my group. My colleagues are satisfied with Python's In this context, I have great hope that |
I really get your argument, and the argument about a default value of Another key reason I changed it to With time and stability we can maybe revert that decision. Last, reducing complexity is important here because this package is a small fraction of my spare time, when xarray has a full time team and many contributors. What I buy by not making this change is not doing work I personally dont need, adding more tests, a bunch of functions to set lazyness based on source type, and documentation for the differences. So for this to happen DiskArrays.jl will need to get more stable (and Im one of 2 very part time devs there too), and someone will have to make a PR that did all of these things. |
In #421 , @rafaqz said
So, here I post this new issue.
The source code
results in the error shown at the end of this message. If you replace the url with a local netCDF filename, the code works.
I use
Rasters
v0.7.2 andNCDatasets
v0.12.16 on julia 1.9.0 on macOS 13.4 .The text was updated successfully, but these errors were encountered: