Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing Julia's Rasters.zonal vs Python's Geopandas + Rasterio #312

Closed
ayushpatnaikgit opened this issue Sep 21, 2022 · 16 comments
Closed

Comments

@ayushpatnaikgit
Copy link

Hi,
I am doing a project on nighttime lights. I am moving my code from Python to Julia. I had performed zonal statistics in a specific case in both Julia and Python. I had the code handy, so I felt it's a good opportunity to compare the results.

using Rasters
using Shapefile
using DataFrames
using PyCall

py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
    shp = gpd.read_file(shapefilepath)  
    raster = rasterio.open(rasterpath)
    def geom_mask(geom, dataset=raster, **mask_kw):
        mask_kw.setdefault('crop', True)
        mask_kw.setdefault('all_touched', False)
        mask_kw.setdefault('filled', False)
        masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
                                    **mask_kw)
        return masked
    return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""

python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")

function jlzonal(shapefilepath, rasterpath)
    shp = Shapefile.Table(shapefilepath)
    raster = Raster(rasterpath)
    Int.(zonal(sum, raster; of=shp, boundary=:inside))
end

julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")

DataFrame(julia_result = julia_result, python_result = python_result) |> print
36×2 DataFrame
 Row │ julia_result  python_result 
     │ Int64         Int64         
─────┼─────────────────────────────
   1 │         9599           9946
   2 │       235197         238343
   3 │         6338           9135
   4 │      1298732        1316207
   5 │        14545          14683
   6 │        28220          29342
   7 │         7542           7652
   8 │        10986          12186
   9 │       986312        1005796
  10 │      1531090        1547802
  11 │         9346           9723
  12 │        37824          40080
  13 │       197552         202026
  14 │       983131         991010
  15 │       191717         194372
  16 │       332144         344015
  17 │       844276         851621
  18 │         4291           5973
  19 │      1169259        1187453
  20 │            0            108
  21 │       353851         357619
  22 │        10876          15085
  23 │       161010         165474
  24 │       416945         418016
  25 │        95337         102516
  26 │        43565          47522
  27 │       739229         762352
  28 │       162653         168294
  29 │       204255         208089
  30 │      1424513        1444589
  31 │      1119402        1134418
  32 │       615313         626291
  33 │      1055784        1073378
  34 │         9791          17393
  35 │      1973589        1992820
  36 │         7741           7741

I am not an expert on either of the two systems, just a user.

For ease of testing, please clone: https://github.com/xKDR/raster-experiments/ , where the data and the code has been pushed.

The code can also be found here:
The state level shapefile for India can be found here: (originally from datameet)
The raster file can be found here: (originally from LiEtal2020)

@rafaqz
Copy link
Owner

rafaqz commented Sep 21, 2022

Amazing! Thanks for putting that together.

I think boundary=:touches in Rasters.jl may be a better comparison here? I assume thats what all_touched does in rasterio.

:inside should have less pixels than :touches as :inside does not include any pixels the line touches. So this difference isn't unexpected.

Edit: I missed that all_touched is false.

But does that mean the comparison should be to boundary=:center?

Edit 2: Yes all_touched, False in rasterio means the center line is used, so boundary=:center is the appropriate comparison here. I don't think it even has an option for boundary=:inside.

https://rasterio.readthedocs.io/en/latest/api/rasterio.mask.html

We could compare boundary=:center to all_touched, False, and boundary=:touches to all_touched, True

@ayushpatnaikgit
Copy link
Author

Here you go:

  1. boundary=:touches vs all_touched, True
py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
    shp = gpd.read_file(shapefilepath)  
    raster = rasterio.open(rasterpath)
    def geom_mask(geom, dataset=raster, **mask_kw):
        mask_kw.setdefault('crop', True)
        mask_kw.setdefault('all_touched', True)
        mask_kw.setdefault('filled', False)
        masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
                                    **mask_kw)
        return masked
    return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""

python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")

function jlzonal(shapefilepath, rasterpath)
    shp = Shapefile.Table(shapefilepath)
    raster = Raster(rasterpath)
    Int.(zonal(sum, raster; of=shp, boundary=:touches))
end

julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")

DataFrame(julia_result = julia_result, python_result = python_result) |> print
36×2 DataFrame
 Row │ julia_result  python_result 
     │ Int64         Int64         
─────┼─────────────────────────────
   1 │        10292          10271
   2 │       241542         241023
   3 │        10590          11267
   4 │      1332685        1330178
   5 │        14848          14818
   6 │        30446          30311
   7 │         7761           7748
   8 │        13210          12996
   9 │      1024225        1021565
  10 │      1566123        1562609
  11 │         9961           9977
  12 │        42033          41918
  13 │       206436         205642
  14 │       998901         997366
  15 │       197394         196811
  16 │       351144         350714
  17 │       859799         858115
  18 │         7650           7483
  19 │      1202572        1200153
  20 │          363            319
  21 │       360502         360163
  22 │        18675          18325
  23 │       170134         169326
  24 │       419006         418850
  25 │       110406         109069
  26 │        50955          50640
  27 │       787630         782535
  28 │       173302         172874
  29 │       212077         211326
  30 │      1464562        1461464
  31 │      1150695        1147173
  32 │       636513         635311
  33 │      1090006        1087124
  34 │        25326          24501
  35 │      2011179        2008439
  36 │         7741           7741
using Rasters
using Shapefile
using DataFrames
using PyCall

py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
    shp = gpd.read_file(shapefilepath)  
    raster = rasterio.open(rasterpath)
    def geom_mask(geom, dataset=raster, **mask_kw):
        mask_kw.setdefault('crop', True)
        mask_kw.setdefault('all_touched', False)
        mask_kw.setdefault('filled', False)
        masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
                                    **mask_kw)
        return masked
    return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""

python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")

function jlzonal(shapefilepath, rasterpath)
    shp = Shapefile.Table(shapefilepath)
    raster = Raster(rasterpath)
    Int.(zonal(sum, raster; of=shp, boundary=:center))
end

julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")

DataFrame(julia_result = julia_result, python_result = python_result) |> print
36×2 DataFrame
 Row │ julia_result  python_result 
     │ Int64         Int64         
─────┼─────────────────────────────
   1 │         9946           9946
   2 │       238339         238343
   3 │         8930           9135
   4 │      1316207        1316207
   5 │        14683          14683
   6 │        29306          29342
   7 │         7652           7652
   8 │        12183          12186
   9 │      1005791        1005796
  10 │      1547802        1547802
  11 │         9702           9723
  12 │        40080          40080
  13 │       202011         202026
  14 │       991006         991010
  15 │       194367         194372
  16 │       344010         344015
  17 │       851621         851621
  18 │         5968           5973
  19 │      1187453        1187453
  20 │          104            108
  21 │       357587         357619
  22 │        15057          15085
  23 │       165467         165474
  24 │       418016         418016
  25 │       102414         102516
  26 │        47517          47522
  27 │       762160         762352
  28 │       168294         168294
  29 │       208089         208089
  30 │      1444589        1444589
  31 │      1134418        1134418
  32 │       626291         626291
  33 │      1073373        1073378
  34 │        17393          17393
  35 │      1992748        1992820
  36 │         7741           7741

For 2, Julia and Python are indeed very close. @cmg777

@rafaqz
Copy link
Owner

rafaqz commented Sep 21, 2022

Looking through rasterio it seems we are actually comparing with GDAL, which is a good benchmark.

The second, very close version with :center is basically the output of PolygonInbounds.jl. I'm not sure what would account for the slight differences, it may be just the difference between a > and a >= somewhere in the algorithm - this seems unlikely but people tend to use round numbers so it actually gets hit a lot.

The first :touches example with bigger difference uses my custom line drawing algorithm to calculate which pixels touch (on top of which are chosen by PolygonInbounds.jl). I think the alg in GDAL is here: https://github.com/OSGeo/gdal/blob/c7e3e652c32773ab5627aafa8c3bd76e01d657b6/alg/gdalrasterize.cpp

The GDAL alg gvBurnScanline and its variants accept int start and end values. And rasterio suggests it uses a raw Bresenham algorithm. So it's rounding the start and end point of the line, which is an approximation of which pixels are touched.

My alg uses the exact floating point start and end points from the shape file and steps along the line between those, rather than the line directly between two integer x/y positions. That could be the difference? although I am hesitant to say Rasters.jl is more accurate than GDAL, it is possible!! Seeing the work they have to do with so many variants of that code just to burn in different value types I can see why they might choose the simpler algorithm. It would also be faster than mine - but we get so much speed elsewhere we can afford that a little more.

(if this is all true, we could add boundary=:bresenham to use to be consistent with GDAL)

Edit - there is also this algorithm which keeps things floating point:

https://github.com/OSGeo/gdal/blob/030ff40cf8340273bcc797e90c938cc32d14a34f/alg/llrasterize.cpp#L377

Its a bit convoluted but I think this may be what we are comparing to. It is very likely as good or better than my algorithm, but it would be interesting to see what the exact differnces are... probably making masks with both and comparing them is the best way to see whats happening.

Edit 2: It would also be good to plot the output boolmask for each polgon in the shapefile... there is probably just a bug in the line algorithm. Looking at the code I am actually half way through fixing a potential bug I found when writing another line related algorithm but isn't hit by the tests. That would make more sense than anything else I have suggested...

@ayushpatnaikgit
Copy link
Author

Perhaps I can change the start and end points to Int just for testing. Can you tell me which line to change here?
Or perhaps create a separate branch on Rasters.jl with that change? I can generate the zonal statistics using that branch.

I had plotted the boolmask for a few polygons where the difference was relatively large. I opened the same plot in QGIS. Hard to tell if there was any difference.

@rafaqz
Copy link
Owner

rafaqz commented Sep 21, 2022

I was mistaken, GDAL is actually using Floating point for that - also pretty hard to understand that algorithm!

I can't currently get a rasterio build working on my machine or via conda (another thing I dont miss about python!) but I might rewrite this to use ArchGDAL.jl/GDAL.jl and do the comparison there as it will be the same algorithm.

(we can probably just multiply both output rasters together to find the common pixels then mask them separately with the negative of that to see which pixels are not shared)

@ayushpatnaikgit
Copy link
Author

Do you know to do zonal statistics in ArchGDAL? Perhaps we can first compare ArchGDAL vs Rasterio.

@rafaqz
Copy link
Owner

rafaqz commented Sep 24, 2022

Yes it should be doable in ArchGDAL, it will just needs a lot more code than Rasters.jl. gdalrasterize to make a mask of the shapes (converted to GDAL polygons) then do the stats in raw julia on the resulting arrays. The result should be identical to rasterio as it's all just GDAL. I guess Rasters is a rare package to do this natively (mostly because I would like it to be 10x faster than rasterio)

@rafaqz
Copy link
Owner

rafaqz commented Oct 25, 2022

Ok, it seems there is a bug in :touches. Will have a look at fixing it next weekend.

@ayushpatnaikgit
Copy link
Author

Since there was a bug fix, I have computed the results again for boundary=:touches vs all_touched, True

36×2 DataFrame
 Row │ julia_result  python_result 
     │ Int64         Int64         
─────┼─────────────────────────────
   1 │        10271          10271
   2 │       241015         241023
   3 │        10443          11267
   4 │      1330156        1330178
   5 │        14818          14818
   6 │        30206          30311
   7 │         7744           7748
   8 │        12984          12996
   9 │      1021443        1021565
  10 │      1562556        1562609
  11 │         9922           9977
  12 │        41859          41918
  13 │       205614         205642
  14 │       997308         997366
  15 │       196792         196811
  16 │       350633         350714
  17 │       858087         858115
  18 │         7469           7483
  19 │      1200142        1200153
  20 │          307            319
  21 │       360106         360163
  22 │        18178          18325
  23 │       169262         169326
  24 │       418850         418850
  25 │       108623         109069
  26 │        50596          50640
  27 │       782223         782535
  28 │       172838         172874
  29 │       211326         211326
  30 │      1461293        1461464
  31 │      1147133        1147173
  32 │       635311         635311
  33 │      1087058        1087124
  34 │        24274          24501
  35 │      2008286        2008439
  36 │         7741           7741

@rafaqz
Copy link
Owner

rafaqz commented Nov 9, 2022

Thanks! That looks a lot closer.

I will try tweaking it to see if can be the same, but Im not sure now if the difference is a real problem or just a floating point error or >/>= difference .

@ayushpatnaikgit
Copy link
Author

The answers are close, but Rasters' results are consistently lower than rasterio's.

@rafaqz
Copy link
Owner

rafaqz commented Nov 10, 2022

Yes that's interesting. So the problem is occasionally not including a pixel that gdal includes.

@rafaqz
Copy link
Owner

rafaqz commented Nov 12, 2022

Ok I've solved this.

The remaining problem is actually in crop rather than rasterize. That's why the tests against gdal rasterize didn't catch this.

The way DimensionalData.jl works selecting areas is not exactly what crop should do when cells are Intervals - it's dropping cells that are only half inside an Extent. But for boundary=:touches and probably most use cases we really want anything that touches the area at all. Using the modified version, the results of zonal are almost identical. I'll PR soon and make a new minor version for DD as its a breaking change.

The remainder will be the differences on pixel corners in the middle of lines, as in the rasterize tests:

julia> @time python_zonal = convert(Array{Int64}, py"""pyzonal($shapefilepath, $rasterpath)""");
  3.100880 seconds (156 allocations: 4.016 KiB)

julia> @time julia_zonal = jlzonal(shapefilepath, rasterpath);
  0.820940 seconds (14.85 k allocations: 175.825 MiB, 0.88% gc time)

julia> difference = python_zonal .- julia_zonal;

julia> DataFrame(; julia_zonal, python_zonal, difference)
36×3 DataFrame
│ Row │ julia_zonal │ python_zonal │ difference │
│     │ Int64       │ Int64        │ Int64      │
├─────┼─────────────┼──────────────┼────────────┤
│ 110271102710          │
│ 22410232410230          │
│ 311267112670          │
│ 4133017813301780          │
│ 514818148180          │
│ 630311303110          │
│ 7774877480          │
│ 812991129965          │
│ 9102155610215659          │
│ 1015626171562609-8         │
│ 11997799770          │
│ 1241918419180          │
│ 132056422056420          │
│ 149973669973660          │
│ 151968111968110          │
│ 163507143507140          │
│ 17858119858115-4         │
│ 18748374830          │
│ 19120014812001535          │
│ 203193190          │
│ 213601633601630          │
│ 2218325183250          │
│ 231693261693260          │
│ 244188504188500          │
│ 251090691090690          │
│ 2650640506400          │
│ 277825357825350          │
│ 281728741728740          │
│ 292113262113260          │
│ 30146146414614640          │
│ 31114717311471730          │
│ 326353116353110          │
│ 33108712410871240          │
│ 3424501245010          │
│ 35200843920084390          │
│ 36774177410

Also good to see it's 4x faster than rasterio.

@ayushpatnaikgit
Copy link
Author

Brilliant!

@rafaqz
Copy link
Owner

rafaqz commented Nov 16, 2022

Just noting this isn't actually merged yet so reopening until its fixed and registered

@rafaqz
Copy link
Owner

rafaqz commented May 19, 2024

This should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants