Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDALarray doesn't load lazily #89

Closed
maxfreu opened this issue Nov 5, 2020 · 13 comments
Closed

GDALarray doesn't load lazily #89

maxfreu opened this issue Nov 5, 2020 · 13 comments

Comments

@maxfreu
Copy link
Contributor

maxfreu commented Nov 5, 2020

Hi! When I'm loading a large image via GDALarray, my RAM gets flooded. I expected it to only read the parts I index into with ranges.

using ArchGDAL
using GeoData
img = GeoData.GDALarray("large file");

# Result:
# ┌ Warning: No data value from GDAL -1.0e10 is not convertible to data type UInt16. `missingval` is probably # incorrect.
# └ @ GeoData ~/.julia/packages/GeoData/Vnpza/src/sources/gdal.jl:269

# RAM flooded

julia v1.5.0
ArchGDAL v0.5.2
GeoData v0.2.1

Maybe the strange block size plays a role?

Size is 21171, 67760
...
Band 1 Block=21171x1 Type=UInt16, ColorInterp=Red
...

EDIT: I just noticed with a different file which barely fits into memory, that when I call show(img) the RAM usage drops temporarily, then I get the printout, and then it goes up again.

@rafaqz
Copy link
Owner

rafaqz commented Nov 6, 2020

Ok interesting. I don't actually test memory use anywhere yet (but we should), and I rarely use tiff files over a few hundred MB. But in theory that should work. GDALarray shouldn't load any array data, or that's a bug.

We have to load the ArchGDAL.RasterDataset to know anything about the file for GDALarray (like how big it is, how many dimensions it has etc). I assumed that that didn't actually load data until you index into it, but I may be wrong.

The file is opened here inside a funtion that closes itself at the end:
https://github.com/rafaqz/GeoData.jl/blob/master/src/sources/gdal.jl#L90

Which opens the dataset which is here in ArchGDAL:

https://github.com/yeesian/ArchGDAL.jl/blob/c22481854e3dfce16070d8b379de151dbbcf39a0/src/raster/array.jl#L27

and passes it to:

https://github.com/rafaqz/GeoData.jl/blob/master/src/sources/gdal.jl#L96

That warning is from the last thing that GeoData.jl does in that chain before building the struct (which has no data connected to it) and closing the dataset.

So it does get that far without loading anything...

Maybe show is being called in the REPL after all of that? We should probably write a custom show method that doesn't ever load the data.

Try to define:

Base.show(io::IO, ::GDALarray) = nothing

Maybe also try just getting something from the struct to demonstrate that it has completely loaded:

dims(GDALarray("large file"))

@rafaqz
Copy link
Owner

rafaqz commented Nov 6, 2020

See #90

You can do:
] add GeoData#no_show

To see if it works for you.

@maxfreu
Copy link
Contributor Author

maxfreu commented Nov 6, 2020

Setting show to nothing helps indeed! Now I can index into a 50GB tiff without problems. However I wonder what the fastest way for storing and reading the data is (I'm thinking of block size, pixel or band interleaved storage, compression algorithm & file type). But I think this is close :)

@rafaqz
Copy link
Owner

rafaqz commented Nov 6, 2020

Great! I never actually tried that on a file that big I'm glad it just works!! I'll get that PR merged too. You will be able to show the dims and filename and not worry about this - it just wont show the array data.

For blocking, DiskArrays.jl should handle all of that under the hood for ArchGDAL.jl. @meggart has done amazing work to make that happen, and maybe has some comments on how we should proceed.

The aim of GeoData.jl is to wrap spatial data in a standardised easy to use interface on the user end, but to avoid low-level work where possible to keep things modular. So DiskArrays standardizes the disk interface - chunking etc.

But GeoData.jl should do more to make information from DiskArrays.jl available from the GDALarray object if you need it anywhere - like if you need to know chunk sizes. I always intended that but there is only so much time for this... and I mostly work with huge time-series of small files, so have different problems to handle. But if you let me know what you need from there I can make sure it's supported and available from the GDALarray object.

@rafaqz
Copy link
Owner

rafaqz commented Nov 6, 2020

Also writing to a file that big isn't supported yet in GeoData. Well need a way to load, modify and write chunks one at a time - again this will be about coordinating things properly with DiskArrays.

I can help with making it work in the context of GeoData once it's clear what you need to do if you were just using ArchGDAL directly, but its really not my field of expertise. @meggart will know a lot more.

I think we can probably just use setindex! on the RasterDataset object, but probably want to know the chunk size to do that efficiently:
https://github.com/meggart/DiskArrays.jl/blob/7eec7761cb564a458b2d4b02f4c59b7b5ea8194e/src/DiskArrays.jl#L193

Writing with setindex! directly would also mean opening and closing the file every time. Instead we might want to do this inside a do block, like:

A = GDALarray("somefile.tif")
write(A) do a
     for chunk in chunks(a)
        a[chunk...]  = somechunk
    end
end

Or we can wrap broadcast so you ca do

A .= somfunc.(A)

And it writes in a chunked way - I think DiskArrays.jl does that already, we just need to expose it.

Maybe give me example of the kind of pipline you would need to run between reading and writing from disk

@maxfreu
Copy link
Contributor Author

maxfreu commented Nov 6, 2020

Ok first things first: The initial issue is solved, so this can be closed.

Now the messy part:
Actually I dont have a real application at the moment. In general I work on classifying satellite and aerial images via deep learning. Currently I'm using PyTorch for that and don't see that changing yet; I hope to call it via PyCall or use Torch.jl, but still have a lot of work to do. The data loading and preprocessing could possibly be done much faster than in Python, so I'm experimenting with that a bit in julia. The ability to process large files is important, because in the end I would like to seamlessly process the entirety of Germany at 20-30cm resolution, so some TB. The output would be some other raster (or vector) file, so I don't really need chunked writes. I usually create a GDAL in memory dataset and write that out. But of course one could make that easier, too, allowing to write files larger than RAM.

But GeoData.jl should do more to make information from DiskArrays.jl available from the GDALarray object if you need it anywhere - like if you need to know chunk sizes.

Yes, as I have lots of RAM available, the chunk sizes could be larger than what is used at the moment. I didn't find out how to change it yet, but I'm just at the start. Having loaded a big chunk, I can then process it in smaller chunks which fit onto the GPU - or just load the small chunks directly? I dont know yet. I'm looking at TiledIteration for that (which could maybe replace DiskArray's internal chunking).

Finally another thing I have noticed: Timing getindex of a GDALarray and a RasterDataset gives different results.

using ArchGDAL
using GeoData

geoimg = GeoData.GDALarray("file")
archimg = ArchGDAL.readraster("file")

# access different regions
@time geoimg[1000:2000,  1000:2000, 2:8]
@time geoimg[2000:3000,  2000:3000, 2:8]
@time archimg[1000:2000, 1000:2000, 2:8]
@time archimg[2000:3000, 2000:3000, 2:8]

#  output of the first run:
#  1.055887 seconds (106 allocations: 13.382 MiB)
#  1.226506 seconds (192 allocations: 13.383 MiB, 0.21% gc time)
#  1.054638 seconds (22 allocations: 13.379 MiB)
#  1.261703 seconds (22 allocations: 13.379 MiB)

#  output of the second run:
#  1.065243 seconds (122 allocations: 13.382 MiB, 0.13% gc time)
#  1.221636 seconds (106 allocations: 13.382 MiB)
#  0.004448 seconds (22 allocations: 13.379 MiB)
#  0.004583 seconds (22 allocations: 13.379 MiB)

I can't nail it down but the GDALarray load time remains constantly high, whereas the ArchGDAL load time decreases for successive accesses of the same region. I haven't profiled yet, but maybe it's using some slow fallback routine or not utilizing the buffered chunks. So there is still work for our poor souls :D

Edit: Looking at the type tree it becomes apparent that AG.RasterDataset is a subtype of DiskArrays.AbstractDiskArray with all its merits and GDALarray is not.

@rafaqz
Copy link
Owner

rafaqz commented Nov 6, 2020

Ok, will be good to have workloads that big to test this package out. Fast transfer of big 100GB/TB datsets to GPU memory is increasingly my main use-case too. But the time dimension breaks it into 1000s of separate slices so it's a different problem.

The load time checks out. GeoData.jl doesn't leave hanging open files. Your archimage is an open file you need to close at some point. ArchGDAL discourages you from doing that too - it's better to do everything inside a do block. Maybe the second round is much faster for archgdal because the file is still open and cached by the os or something - it only hits disk the first time. Or maybe DiskArrays has caching? I don't know.

Mostly the way I use it GeoData I only load part the data once to memory so this point makes no real difference. But for multiple reads from an open file we will need a do-block syntax backed by the opened RasterDataset. That's not too hard to do, everything is already written like that underneath. I guess we could have a type you have to manuall close too, but it's not a priority for me to write that.

Edit: Looking at the type tree it becomes apparent that AG.RasterDataset is a subtype of DiskArrays.AbstractDiskArray with all its merits and GDALarray is not.

I don't think this is the reason though. We are just indexing the RasterDataset underneath. The difference is that the file opens and closes again. Also I'm not sure what you mean by "with all its merits" - what is there that is lost by wrapping it and indexing into it? You can wrap a CuArray with a GeoArray and it still broadcasts on the GPU.

GDALarray is a AbstractDimArray to get the indexing behaviour. It can't be two kinds of array. I tried hard to make DimensionalData.jl a trait only interface, but its hard to do for arrays.

@maxfreu
Copy link
Contributor Author

maxfreu commented Nov 10, 2020

The load time checks out. GeoData.jl doesn't leave hanging open files. Your archimage is an open file you need to close at some point. ArchGDAL discourages you from doing that too - it's better to do everything inside a do block.

Yes, you are right. The overhead seems to come from opening and closing the file and I was unaware that I should use the do block.

Maybe the second round is much faster for archgdal because the file is still open and cached by the os or something - it only hits disk the first time. Or maybe DiskArrays has caching?

DiskArrays has caching. But I'm unaware how it works and yet have to find out how to use it best. But I assume the speedup comes from there.

Also I'm not sure what you mean by "with all its merits" - what is there that is lost by wrapping it and indexing into it? You can wrap a CuArray with a GeoArray and it still broadcasts on the GPU.

By merits I meant the caching behavior, but as you said, its lost by opening/closing what I was unaware of and didn't think it through.

@meggart
Copy link

meggart commented Nov 10, 2020

DiskArrays has caching. But I'm unaware how it works and yet have to find out how to use it best. But I assume the speedup comes from there.

In your example the caching does not come from DiskArrays, it rather comes from the underlying GDAL library. DiskArrays does not implement an explicit cache, all it does is to read data in efficient blocks when you apply high-level functions like sum or broadcast on them. I agree with @rafaqz that that GeoArrays should not inherit from AbstractDiskArray, but rather stay generic and be able to wrap any type of array (including DiskArrays, CuArrays and normal arrays etc.).

So, when you index repeatedly into a DiskArray, the package will not help you speeding this up.

@rafaqz
Copy link
Owner

rafaqz commented Nov 11, 2020

In terms of chunking, AbstractDimArray already defers broadcast to the parent type (ie for CuArray), so I get the feeling that
you may already be able to just broadcast over the OpenGeoArray and DiskArrays will handle chunking without loading the whole file. The issue will be how to break up the side you are writing to into the same chunks.

Does that sound right @meggart ?

@maxfreu
Copy link
Contributor Author

maxfreu commented Nov 12, 2020

In your example the caching does not come from DiskArrays, it rather comes from the underlying GDAL library.

So, when you index repeatedly into a DiskArray, the package will not help you speeding this up.

Ahh ok! I didn't know that and was looking for the caching mechanism in DiskArrays without success.

GeoArrays should not inherit from AbstractDiskArray

Yes, I agree as well; was just speculating what the reason for the time difference was.

@meggart
Copy link

meggart commented Nov 12, 2020

Yes, this is correct. If you already defer broadcast, it might just work. Note that DiskArrays defines broadcast lazily, so an output will be returned instantaneously, you must call copy or getindex or friends to actually do the computation.

Regarding output chunks, currently there is no fallback, so that all arrays participating in your broadcast call must have either have no chunks defined or align with the other defined chunks. Otherwise the broadcast call will error. So make sure to create the output arrays with the same chunks as your inputs.

@rafaqz
Copy link
Owner

rafaqz commented Nov 14, 2020

Oh right broadcast will be chunk-wise if the source and dest align. That's great. I'll have to write some tests for doing this through an AbstractGeoArray wrapper.

It never ceases to amaze me how little we need to coordinate for this kind of thing to just work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants