Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where selection for Netcdf raster slower than for in Memory raster with same dimensions #594

Closed
felixcremer opened this issue Jan 17, 2024 · 9 comments

Comments

@felixcremer
Copy link
Contributor

When I am doing a Where Selector it seems to be quite slow compared to other selection methods.
I expect that the Where selector is slower because it has to check the function for every element of the axis.
But in my case the selection takes 2 seconds while the selection function takes 30 nanoseconds and when I multiply it with the number of elements I get 10 microseconds.
This selection is much faster when I do it with random data that I generate compared to the selection on a netcdf based Raster.
When I construct a Raster with the same dimensions as ras but a random array the timing is similar to the fully random based Raster.

# Selection on random data 
julia> r = Raster(rand(800,800, 289), (X(1:800), Y(1:800), Ti(DateTime(1995):Month(1):DateTime(2019))))
800×800×289 Raster{Float64,3} with dimensions: 
  X Sampled{Int64} 1:800 ForwardOrdered Regular Points,
  Y Sampled{Int64} 1:800 ForwardOrdered Regular Points,
  Ti Sampled{DateTime} DateTime("1995-01-01T00:00:00"):Month(1):DateTime("2019-01-01T00:00:00") ForwardOrdered Regular Points
extent: Extent(X = (1, 800), Y = (1, 800), Ti = (DateTime("1995-01-01T00:00:00"), DateTime("2019-01-01T00:00:00")))
missingval: missing
parent:


julia> @btime r[Ti=Where(x->year(x)==1995)]
  29.809 ms (19 allocations: 58.60 MiB)
800×800×12 Raster{Float64,3} with dimensions: 
  X Sampled{Int64} 1:800 ForwardOrdered Regular Points,
  Y Sampled{Int64} 1:800 ForwardOrdered Regular Points,
  Ti Sampled{DateTime} DateTime[1995-01-01T00:00:00, , 1995-12-01T00:00:00] ForwardOrdered Irregular Points
extent: Extent(X = (1, 800), Y = (1, 800), Ti = (DateTime("1995-01-01T00:00:00"), DateTime("1995-12-01T00:00:00")))
missingval: missing
parent:


# results of a NetCDF backed raster
julia> @btime ras[Ti=Where(x->year(x)==1995)]
  1.994 s (643532 allocations: 240.25 MiB)
848×824×12 Raster{Union{Missing, Float32},3} with dimensions: 
  Dim{:rlon} Sampled{Float64} -28.4029998779297:0.0550011811599821:18.18300056457514 ForwardOrdered Regular Points,
  Dim{:rlat} Sampled{Float64} -23.4029998779297:0.0550012159753107:21.863000869751005 ForwardOrdered Regular Points,
  Ti Sampled{DateTime} DateTime[1995-01-16T12:00:00, , 1995-12-16T12:00:00] ForwardOrdered Irregular Points
extent: Extent(rlon = (-28.4029998779297, 18.18300056457514), rlat = (-23.4029998779297, 21.863000869751005), Ti = (DateTime("1995-01-16T12:00:00"), DateTime("1995-12-16T12:00:00")))
missingval: missing

# Results of a random array with the same dimensions as ras 
ulia> ra = Raster(rand(size(ras)...), dims(ras))
848×824×288 Raster{Float64,3} with dimensions: 
  Dim{:rlon} Sampled{Float64} -28.4029998779297:0.0550011811599821:18.18300056457514 ForwardOrdered Regular Points,
  Dim{:rlat} Sampled{Float64} -23.4029998779297:0.0550012159753107:21.863000869751005 ForwardOrdered Regular Points,
  Ti Sampled{DateTime} DateTime[1995-01-16T12:00:00, , 2018-12-16T12:00:00] ForwardOrdered Irregular Points
extent: Extent(rlon = (-28.4029998779297, 18.18300056457514), rlat = (-23.4029998779297, 21.863000869751005), Ti = (DateTime("1995-01-16T12:00:00"), DateTime("2018-12-16T12:00:00")))
missingval: missing


julia> @btime ra[Ti=Where(x->year(x)==1995)]
  32.535 ms (19 allocations: 63.97 MiB)
848×824×12 Raster{Float64,3} with dimensions: 
  Dim{:rlon} Sampled{Float64} -28.4029998779297:0.0550011811599821:18.18300056457514 ForwardOrdered Regular Points,
  Dim{:rlat} Sampled{Float64} -23.4029998779297:0.0550012159753107:21.863000869751005 ForwardOrdered Regular Points,
  Ti Sampled{DateTime} DateTime[1995-01-16T12:00:00, , 1995-12-16T12:00:00] ForwardOrdered Irregular Points
extent: Extent(rlon = (-28.4029998779297, 18.18300056457514), rlat = (-23.4029998779297, 21.863000869751005), Ti = (DateTime("1995-01-16T12:00:00"), DateTime("1995-12-16T12:00:00")))
missingval: missing

[and 11 more slices...]
@felixcremer
Copy link
Contributor Author

This might be a Rasters issue in the end.
There is a difference between a YAXArrays and a Raster on the same data.

The above timings are from a Raster wrapping a YAXArray and this might contribute to the long timings.

But when I do the selection on a Raster compared to a YAXArray the selection is a factor four roughly slower. And the YAXArrays selection is also a factor two slower than the selection on an in memory based Raster.

urls = Dict(y=> "https://esgf1.dkrz.de/thredds/fileServer/cosmo-rea/reanalysis/EUR-6km/DWD/ECMWF-ERAINT/REA6/r1i1p1f1/COSMO/v1/mon/atmos/tas/v20230314/tas_EUR-6km_ECMWF-ERAINT_REA6_r1i1p1f1_COSMO_v1_mon_$(y)01-$(y)12.nc" for y in 1995:2018)


filenames = Dict(y=> joinpath("data", split(url, "/")[end]) for (y,url) in urls)
mkpath("data/")
#pathdict = Dict(y => download(url, filenames[y]) for (y, url) in urls)
# After download this should be this:
pathdict = [glob("tas_EUR*$y*.nc", "/home/fcremer/Documents/NFDI4Earth/deliverables/data")[1] for y in 1995:2018]

rlist = Raster.(pathdict, lazy=true, key="tas");
r = cat(rlist..., dims=Ti)
julia> dslist  = open_dataset.(pathdict)
taslist = getproperty.(dslist, :tas)
c = cat(taslist..., dims=Ti)

julia> @btime c[Ti=Where(x->year(x)==1995)];
  66.370 ms (26354 allocations: 1.98 MiB)

julia> @btime r[Ti=Where(x->year(x)==1995)];
  232.956 ms (7357 allocations: 232.82 MiB)

@rafaqz
Copy link
Owner

rafaqz commented Jan 30, 2024

This looks like the problem:

(643532 allocations: 240.25 MiB)

Which is probably because youre using lazy=true with NCDatasets.jl... maybe try main ?

@rafaqz rafaqz transferred this issue from rafaqz/DimensionalData.jl Jan 30, 2024
@rafaqz
Copy link
Owner

rafaqz commented May 19, 2024

Is this still true? the files dont seem to be there anymore to test it.

@felixcremer
Copy link
Contributor Author

The data was moved. I try to test it this week.

@felixcremer
Copy link
Contributor Author

The location of the files have been moved.
The new position should be this:

path = "https://esgf1.dkrz.de/thredds/fileServer/cosmo-rea/reanalysis/EUR-6km/DWD/ECMWF-ERAINT/REA6/r1i1p1f1/COSMO/v1/mon/atmos/tas/v20230918/tas_EUR-6km_ECMWF-ERAINT_REA6_r1i1p1f1_COSMO_v1_mon_199501-199512.nc"

I will try it again in the next days.

@felixcremer
Copy link
Contributor Author

I tested it again and these are the new timings:

julia> @btime c[Ti=Where(x->year(x)==1995)]; # YAXArrays with NetCDF backend
  71.301 ms (26373 allocations: 1.98 MiB)

julia> @btime r[Ti=Where(x->year(x)==1995)]; # Rasters with NCDatasets backend
  252.776 ms (6246 allocations: 232.31 MiB)

This might also be some difference between the backends.

This is with these package versions:

(Rasters) pkg> st
Project Rasters v0.11.1
Status `~/.julia/dev/Rasters/Project.toml`
  [79e6a3ab] Adapt v4.0.4
  [3da002f7] ColorTypes v0.11.5
  [1fbeeb36] CommonDataModel v0.3.6
  [187b0558] ConstructionBase v1.5.5
⌃ [0703355e] DimensionalData v0.27.2
⌅ [3c3547ce] DiskArrays v0.3.23
  [411431e0] Extents v0.1.2
  [1a297f60] FillArrays v1.11.0
  [4c728ea3] Flatten v0.4.3
  [68eda718] GeoFormatTypes v0.4.2
  [cf35fbd7] GeoInterface v1.3.4
  [e1d29d7a] Missings v1.2.0
  [30363a11] NetCDF v0.11.8
  [6fe1bfb0] OffsetArrays v1.14.0
  [92933f4c] ProgressMeter v1.10.0
  [3cdcf5f2] RecipesBase v1.3.4
  [189a3867] Reexport v1.2.2
  [efcf1570] Setfield v1.1.1
  [c21b50f5] YAXArrays v0.5.6
  [ade2ca70] Dates
  [a63ad114] Mmap

@rafaqz
Copy link
Owner

rafaqz commented May 20, 2024

Ok this not just the time to load the data?

getindex is not lazy in Rasters like in YAXArrays, it moves data to memory. But view is lazy!

@felixcremer
Copy link
Contributor Author

Ah yeah right. I forgot that. So this is a useless comparison. I will test the view but most likely this can be closed

@rafaqz
Copy link
Owner

rafaqz commented May 20, 2024

It is kind of annoying that the lazy syntax is harder to write in julia generally.

@rafaqz rafaqz closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants