-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing Julia's Rasters.zonal vs Python's Geopandas + Rasterio #312
Comments
Amazing! Thanks for putting that together.
Edit: I missed that But does that mean the comparison should be to Edit 2: Yes https://rasterio.readthedocs.io/en/latest/api/rasterio.mask.html We could compare |
Here you go:
py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
shp = gpd.read_file(shapefilepath)
raster = rasterio.open(rasterpath)
def geom_mask(geom, dataset=raster, **mask_kw):
mask_kw.setdefault('crop', True)
mask_kw.setdefault('all_touched', True)
mask_kw.setdefault('filled', False)
masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
**mask_kw)
return masked
return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""
python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")
function jlzonal(shapefilepath, rasterpath)
shp = Shapefile.Table(shapefilepath)
raster = Raster(rasterpath)
Int.(zonal(sum, raster; of=shp, boundary=:touches))
end
julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")
DataFrame(julia_result = julia_result, python_result = python_result) |> print
using Rasters
using Shapefile
using DataFrames
using PyCall
py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
shp = gpd.read_file(shapefilepath)
raster = rasterio.open(rasterpath)
def geom_mask(geom, dataset=raster, **mask_kw):
mask_kw.setdefault('crop', True)
mask_kw.setdefault('all_touched', False)
mask_kw.setdefault('filled', False)
masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
**mask_kw)
return masked
return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""
python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")
function jlzonal(shapefilepath, rasterpath)
shp = Shapefile.Table(shapefilepath)
raster = Raster(rasterpath)
Int.(zonal(sum, raster; of=shp, boundary=:center))
end
julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")
DataFrame(julia_result = julia_result, python_result = python_result) |> print
For 2, Julia and Python are indeed very close. @cmg777 |
Looking through rasterio it seems we are actually comparing with GDAL, which is a good benchmark. The second, very close version with The first The GDAL alg My alg uses the exact floating point start and end points from the shape file and steps along the line between those, rather than the line directly between two integer x/y positions. That could be the difference? although I am hesitant to say Rasters.jl is more accurate than GDAL, it is possible!! Seeing the work they have to do with so many variants of that code just to burn in different value types I can see why they might choose the simpler algorithm. It would also be faster than mine - but we get so much speed elsewhere we can afford that a little more. (if this is all true, we could add Edit - there is also this algorithm which keeps things floating point: Its a bit convoluted but I think this may be what we are comparing to. It is very likely as good or better than my algorithm, but it would be interesting to see what the exact differnces are... probably making masks with both and comparing them is the best way to see whats happening. Edit 2: It would also be good to plot the output |
Perhaps I can change the start and end points to Int just for testing. Can you tell me which line to change here? I had plotted the |
I was mistaken, GDAL is actually using Floating point for that - also pretty hard to understand that algorithm! I can't currently get a rasterio build working on my machine or via conda (another thing I dont miss about python!) but I might rewrite this to use ArchGDAL.jl/GDAL.jl and do the comparison there as it will be the same algorithm. (we can probably just multiply both output rasters together to find the common pixels then mask them separately with the negative of that to see which pixels are not shared) |
Do you know to do zonal statistics in ArchGDAL? Perhaps we can first compare ArchGDAL vs Rasterio. |
Yes it should be doable in ArchGDAL, it will just needs a lot more code than Rasters.jl. |
Ok, it seems there is a bug in |
Since there was a bug fix, I have computed the results again for
|
Thanks! That looks a lot closer. I will try tweaking it to see if can be the same, but Im not sure now if the difference is a real problem or just a floating point error or >/>= difference . |
The answers are close, but Rasters' results are consistently lower than rasterio's. |
Yes that's interesting. So the problem is occasionally not including a pixel that gdal includes. |
Ok I've solved this. The remaining problem is actually in The way DimensionalData.jl works selecting areas is not exactly what The remainder will be the differences on pixel corners in the middle of lines, as in the julia> @time python_zonal = convert(Array{Int64}, py"""pyzonal($shapefilepath, $rasterpath)""");
3.100880 seconds (156 allocations: 4.016 KiB)
julia> @time julia_zonal = jlzonal(shapefilepath, rasterpath);
0.820940 seconds (14.85 k allocations: 175.825 MiB, 0.88% gc time)
julia> difference = python_zonal .- julia_zonal;
julia> DataFrame(; julia_zonal, python_zonal, difference)
36×3 DataFrame
│ Row │ julia_zonal │ python_zonal │ difference │
│ │ Int64 │ Int64 │ Int64 │
├─────┼─────────────┼──────────────┼────────────┤
│ 1 │ 10271 │ 10271 │ 0 │
│ 2 │ 241023 │ 241023 │ 0 │
│ 3 │ 11267 │ 11267 │ 0 │
│ 4 │ 1330178 │ 1330178 │ 0 │
│ 5 │ 14818 │ 14818 │ 0 │
│ 6 │ 30311 │ 30311 │ 0 │
│ 7 │ 7748 │ 7748 │ 0 │
│ 8 │ 12991 │ 12996 │ 5 │
│ 9 │ 1021556 │ 1021565 │ 9 │
│ 10 │ 1562617 │ 1562609 │ -8 │
│ 11 │ 9977 │ 9977 │ 0 │
│ 12 │ 41918 │ 41918 │ 0 │
│ 13 │ 205642 │ 205642 │ 0 │
│ 14 │ 997366 │ 997366 │ 0 │
│ 15 │ 196811 │ 196811 │ 0 │
│ 16 │ 350714 │ 350714 │ 0 │
│ 17 │ 858119 │ 858115 │ -4 │
│ 18 │ 7483 │ 7483 │ 0 │
│ 19 │ 1200148 │ 1200153 │ 5 │
│ 20 │ 319 │ 319 │ 0 │
│ 21 │ 360163 │ 360163 │ 0 │
│ 22 │ 18325 │ 18325 │ 0 │
│ 23 │ 169326 │ 169326 │ 0 │
│ 24 │ 418850 │ 418850 │ 0 │
│ 25 │ 109069 │ 109069 │ 0 │
│ 26 │ 50640 │ 50640 │ 0 │
│ 27 │ 782535 │ 782535 │ 0 │
│ 28 │ 172874 │ 172874 │ 0 │
│ 29 │ 211326 │ 211326 │ 0 │
│ 30 │ 1461464 │ 1461464 │ 0 │
│ 31 │ 1147173 │ 1147173 │ 0 │
│ 32 │ 635311 │ 635311 │ 0 │
│ 33 │ 1087124 │ 1087124 │ 0 │
│ 34 │ 24501 │ 24501 │ 0 │
│ 35 │ 2008439 │ 2008439 │ 0 │
│ 36 │ 7741 │ 7741 │ 0 │ Also good to see it's 4x faster than rasterio. |
Brilliant! |
Just noting this isn't actually merged yet so reopening until its fixed and registered |
This should be fixed now. |
Hi,
I am doing a project on nighttime lights. I am moving my code from Python to Julia. I had performed zonal statistics in a specific case in both Julia and Python. I had the code handy, so I felt it's a good opportunity to compare the results.
I am not an expert on either of the two systems, just a user.
For ease of testing, please clone: https://github.com/xKDR/raster-experiments/ , where the data and the code has been pushed.
The code can also be found here:
The state level shapefile for India can be found here: (originally from datameet)
The raster file can be found here: (originally from LiEtal2020)
The text was updated successfully, but these errors were encountered: