Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rasterize issues #529

Closed
alex-s-gardner opened this issue Sep 29, 2023 · 16 comments
Closed

rasterize issues #529

alex-s-gardner opened this issue Sep 29, 2023 · 16 comments

Comments

@alex-s-gardner
Copy link
Contributor

alex-s-gardner commented Sep 29, 2023

I'm having issues rasterizing a shapefile using Rasters. The .shp file contains 18855 polygons. Part of the issue is out of memory errors and part of the issue is reprojection. Here's what I did:

using Rasters
using Shapefile
using ArchGDAL
using GeoFormatTypes
using Downloads
using ZipFile

Get and Unzip Data

the shapefiles can be downloaded from NSIDC: https://nsidc.org/data/nsidc-0770/versions/7
but you will require an Earth Data Login

for convenience I've temporarily put the files here

fn = "02_rgi60_WesternCanadaUS"
url2shp =  "https://its-live-data.s3.amazonaws.com/test/$fn.zip"

# download and unzip
Downloads.download(url2shp, "$fn.zip")
run(`unzip $fn`)

Load polygons

glacier = Shapefile.Handle("$fn/$fn.shp")

define a Raster to which the shapefile will be rasterized

epsg = GeoFormatTypes.EPSG("EPSG:32609")
gridsize = 25;
min_x = 273500;
max_x = 2507100;
min_y = 4077900;
max_y = 7234200;

x = X(min_x:gridsize:max_x);
y = Y(min_y:gridsize:max_y);
A = Raster(zeros(UInt8,y,x))
setcrs(A, epsg)

rasterize polygons

@time Rasters.rasterize!(last,A,glacier.shapes; fill=1, shape=:polygon)

This leads to an out of memory error

try a divide and concur approach to overcome memory limit

first let's reproject the polygons to the desired projection... once we do that then we loop over subsets of the full domain and rasterize

# reporject files
reproject(epsg, glacier)

reutrns: ERROR: StackOverflowError:

Alternatively we can us GDAL

at the command line this takes about 10s

run(`ogr2ogr -t_srs EPSG:3857 $(pwd())/$(fn)/$(fn)_32609.shp $(pwd())/$fn/$fn.shp`)# This takse a few seconds
run(`gdal_rasterize -burn 1.0 -tr 25 25 -a_nodata 0.0 -ot Byte  $(pwd())/$(fn)/$(fn)_32609.shp $(pwd())/$(fn)/$(fn)_32609.tif`)
@rafaqz
Copy link
Owner

rafaqz commented Sep 29, 2023

Ahh yes the problem is last which needs a lot if memory to work in parallel because we cant know which was last value for any pixel accross the threads.

Try sum instead, it doesnt have that problem and will be the same answer unless you have overlaps.

I guess we should not use threads for last, or order the blocks passed to each thread manually so we can merge them in order at the end.

(sum should beat gdal easily or there is somethong else wrong)

@alex-s-gardner
Copy link
Contributor Author

Try sum instead, it doesnt have that problem and will be the same answer unless you have overlaps.

switching to sum resolved the out of memory error. Now Rasters.rasterize! is about 10x slower than gdal_rasterize not counting the creation of A::Raster

Rasters.rasterize! takes 324.580515 seconds (1.60 M allocations: 336.273 GiB, 10.41% gc time, 1.50% compilation time)
gdal_rasterize takes 28s

@alex-s-gardner
Copy link
Contributor Author

Any idea why reproject(epsg, glacier) throws a StackOverflowError?

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

There is something very wrong to still be 10x slower on sum

No idea about reproject, you dont include a stack trace. Why not read the code matching your stack trace and see ;)

Also side note Im not sure what setcrs is meant to do... the return value is not used?

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

Finally at a computer.

Your raster is huuge!!!

126253×89345 Raster{UInt8,2} with dimensions:

Do you mean for it to be ?

If so great, nice test case for performance of huge things. I haven't tried rasterizing into anything within 2 orders of magnitude of this size. I am guessing some performance heuristics will need to change for it to be fast.

reproject

For reprojecting the shape file you are trying to use Rasters.reproject on a geointerface object. It doesn't do that, it just reprojects Raster and Dimension. I guess you can use Proj.jl or ArchGDAL, but actually in the unregistered GeometryOps.jl I wrote a nice generic reproject method that would be the most appropriate.

glacier_rep = GeometryOps.reproject(glacier; target_crs=EPSG(3857))

Unfortunately Proj.jl also has a bug where it wont accept shapefiles ESRIWellKnownText crs type 😭 but this branch works now:

JuliaGeo/Proj.jl#97

rasterize

First, try using fill=UInt8(1) for type stability. We should force that from the destination array type.

You may also want to use boundary=:touches so the lines count as your polygons seem pretty small.

But after that, we are just using too much RAM.
We can actually rasterize directly to disk by adding a filename argument, but that's also really slow because its writing to disk like 18000 times. We may need some intermediate step or a smarter way to do this.

But this is fun, lets make it beat GDAL by 10x even at this scale.

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

And finally, this already seems really fast for me at a quarter of your raster size:

@time Rasters.rasterize!(count, A, glacier_rep; shape=:polygon)

The full raster size kills my julia session 😅

count just skips the whole process of having a fill value and just counts occurances, which always the fastest rasterize method.

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

Ok, problem solved!! its threading. Each thread needs a raster to write to so at this scale that means using all your RAM. Its faster to just not thread at all.

@time Rasters.rasterize!(count, A, glacier_rep; threaded=false, shape=:polygon)

For me this is >6x faster than GDAL. (which is kinda suspicious actually)

We could look at memory and try to see how many threads we can safely use and maybe make this faster again.

There is also a worrying warning that the polygons are not actually affecting pixels. Probably they are too small... you may need boundary=:touches but it doesn't seem to help.

sum(A) == 0 so maybe the polygon locations don't match grid or the projections are wrong somewhere.

@alex-s-gardner
Copy link
Contributor Author

alex-s-gardner commented Sep 30, 2023

Wow.. this is lightning fast now (9s vs 28s but a bit of apples to oranges) ... but you're right none of the plygons are intersecting which is strange cuz they do... The polygons are in lat lon projection and the grid is in UTM 09N ... I thought from reading the docs that this could be handled internally but maybe it's necessary to reproject the shapefile polygons prior to rasterizing? I see above that you reprojected to 3857 when we need to reproject to 32609... I'll try that now.

@alex-s-gardner
Copy link
Contributor Author

For reprojecting the shape file you are trying to use Rasters.reproject on a geointerface object. It doesn't do that, it just reprojects Raster and Dimension

Maybe an informative error message could be added.

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

It doesn't handle crs conversion automatically - often there is no crs attached to polygons anyway. But we should in future when its possible and GeometryOps.jl is actually registered and useable. Its just slow getting all of these little checks and conversions working and perfect...

And yeah, thats the kind of performance you should expect... 6x GDAL without threading. Normally with smaller files and threading its 30x faster

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

Just wanted to point out that GDAL will just return zeros with no warning, but if you read the warnings Rasters checks that every polygon actually changes the raster and gives you a warning and a vector of misses in the raster metadata

@alex-s-gardner
Copy link
Contributor Author

alex-s-gardner commented Sep 30, 2023

@rafaqz I got it to work with:

epsg = GeoFormatTypes.EPSG("EPSG:32609")
glacier_rep = GeometryOps.reproject(glacier; target_crs=epsg);

and

foo = Rasters.rasterize!(count, A, glacier_rep.geom; threaded=false, shape=:polygon)
sum(.!(foo.data .== 0))
23343762

Thanks a ton for all of the help and patches

I noted that:

  1. GeometryOps.reproject(glacier; target_crs=epsg) returns a very different object then it was handed... I was not expecting that
  2. Rasters.rasterize! was unable to accept glacier_rep directly and needed to be handed glacier_rep.geom.. I guess this would need to change in future releases if you want Rasters.rasterize! to parse the crs

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

  1. Its not really possible to rebuild Shapefile.jl objects - we techically could but I deleted the convert code because it was buggy and they are not a good way to define geometries (the polygons are horrific) - so maintining it is a waste of our time. What you get back is an identical GeoInterface representation of the original structure.
  2. rasterize GeometryCollection #531 this issue hit an embarrassing number of bugs

@alex-s-gardner
Copy link
Contributor Author

  1. Its not really possible to rebuild Shapefile.jl objects - we techically could but I deleted the convert code because it was buggy and they are not a good way to define geometries (the polygons are horrific) - so maintining it is a waste of our time. What you get back is an identical GeoInterface representation of the original structure.

@rafaqz would it be more intuitive to have the user first convert the Shapefile.jl objects to GeoInterface then reproject... this would indicate to the user that to manipulate Shapefile.jl objects the first step is to convert it to a GeoInterface and abandon the Shapefile.jl object?

@rafaqz
Copy link
Owner

rafaqz commented Sep 30, 2023

Maybe... I've actually been making all the packages accept anything!! And just return their own types. LibGEOS.jl does this too.

To me needing a conversion step everywhere is a bit annoying to write, and takes more RAM - the current method only makes the destination copy, there is no intermediate.

If you use GeoInterface.jl geometry types this stuff barely matters - they can wrap any other geometry objects, even combining objects from different packages works.

@alex-s-gardner
Copy link
Contributor Author

me needing a conversion step everywhere is a bit annoying to write, and takes more RAM - the current method only makes the destination copy, there is no intermediate

Ya, makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants