Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDAL Error when reading many files: GDALOpen() called with too many recursion levels #455

Closed
maxfreu opened this issue May 30, 2023 · 16 comments

Comments

@maxfreu
Copy link
Contributor

maxfreu commented May 30, 2023

Hi! I'm trying to implement a sanity check for image files downloaded from the web. I want to check ~600k images of size 2240x2240x4. After 105 corrupted files, I always get a LoadError: GDALError (CE_Failure, code 1): GDALOpen() called with too many recursion levels. I suspect the file descriptors are not properly closed. A minimum failing example looks like this:

using Rasters

function check()
    for (i,f) in enumerate(sort(readdir("/data_hdd/nrw_dop10/nonforest/"; join=true)))
        if i % 1000 == 0
            println(i)
        end

        try
            open(Raster(f; lazy=true)) do r
                size(r)
            end
        catch err
            @warn err.msg
            if contains(err.msg, "too many recursion levels")
                return i
            end
            #do sth
        end
    end
end

check()

Edit: Ok, I think this is an upstream issue with GDAL.jl not closing the file descriptors after the read function threw an error. If you have any further thoughts I'm happy to hear them!

@rafaqz
Copy link
Owner

rafaqz commented May 30, 2023

Oh it has no cleanup mechanism? That is concerning.

@maxfreu
Copy link
Contributor Author

maxfreu commented May 30, 2023

There is something in place here (aftercare and gdaljl_errorhandler), but I'm not sure whether they also close open files automatically; seems like no.

@rafaqz
Copy link
Owner

rafaqz commented May 30, 2023

Probably @visr will understand this better than I do

@maxfreu
Copy link
Contributor Author

maxfreu commented May 31, 2023

So, I managed to produce a similar error without reading with Rasters:

using Rasters
using ArchGDAL

s = 1000
r = Raster(rand(s,s,1); dims=(X(1:s), Y(1:s), Band(1:1)))
# this can jam your session but should produce a corrupted file when it works
Threads.@threads for _ in 1:8 write("corrupted.tif", r; force=true) end

for i in 1:3
    try
        ds = ArchGDAL.read("corrupted.tif")
    catch err
        @warn err
    end
end

┌ Warning: GDAL.GDALError(GDAL.CE_Failure, 1, "corrupted.tif: TIFFFetchDirectory:Sanity check on directory count failed, this is probably not a valid IFD offset")
┌ Warning: GDAL.GDALError(GDAL.CE_Failure, 1, "GDALOpen() called on corrupted.tif recursively")
┌ Warning: GDAL.GDALError(GDAL.CE_Failure, 1, "GDALOpen() called on corrupted.tif recursively")

@visr is this to be expected? If not, should I move this issue upstream to AG or GDAL.jl?

@visr
Copy link

visr commented May 31, 2023

So in this last example you read a bad GeoTIFF, and get a good error the first time, and a different error afterwards, right? And you suspect that it is because the file is not properly closed on such an error. Does the error also occur when you use the do-block syntax? Because then things can be cleaned up directly rather than relying on the GC to trigger the finalizer.

https://yeesian.com/ArchGDAL.jl/stable/memory/

I'm not sure how to best improve this for the interactive usecase. I guess it would be nice if we could directly call destroy before throwing a fatal error on gdalopen?

@maxfreu
Copy link
Contributor Author

maxfreu commented May 31, 2023

So in this last example you read a bad GeoTIFF, and get a good error the first time, and a different error afterwards, right?

right

Does the error also occur when you use the do-block syntax?

yes

I guess it would be nice if we could directly call destroy before throwing a fatal error on gdalopen?

Probably yes. I think as of now all gdal functions receive the same "aftercare". Maybe the IO functions should have their own "ioaftercare" which closes the dataset right away. Because closing them after they threw an error is almost impossible for the user. In my case, the system almost crashed, because I had too many files open.

@rafaqz
Copy link
Owner

rafaqz commented May 31, 2023

Rasters.jl is using the do block syntax in this case too. ioaftercare sounds like a solution

(also, kudos to the creative corrupted file generation...)

@visr
Copy link

visr commented May 31, 2023

It would be good to test the solution first with some manual edits in GDAL.jl, and if it works we can build it into the generator and regenerate the code.

I see that gdalclose, which ArchGDAL calls in destroy (part of the finalizer) also gets aftercare. It could also be that this is the source of the recursion. Perhaps that aftercare should be removed from that one.

https://github.com/JuliaGeo/GDAL.jl/blob/8e2ef58f694e6022301a0923f0cbf7ef4aaa6b81/src/libgdal.jl#L6997
https://github.com/JuliaGeo/GDAL.jl/blob/8e2ef58f694e6022301a0923f0cbf7ef4aaa6b81/gen/generator.jl#L131

@maxfreu
Copy link
Contributor Author

maxfreu commented Jun 1, 2023

So, another minimum failing example:

# run twice
ret = ccall(
    (:GDALOpenEx, GDAL.libgdal),
    GDAL.GDALDatasetH,
    (Cstring, Cuint, Ptr{Cstring}, Ptr{Cstring}, Ptr{Cstring}),
    "corrupted.tif",
    0x40,
    Ptr{Cvoid}(C_NULL),
    Ptr{Cvoid}(C_NULL),
    Ptr{Cvoid}(C_NULL),
)

I think the problem is unrelated to Rasters and AG, as it also occurs with plain GDAL. What I think happens is that the gdaljl_errorhandler gets triggered immediately when the call to gdalopenex fails and before the C side has closed the file. When I comment the error handler out, the recursion is gone. I think it boils down to that gdaljl_errorhandler should not throw.

@rafaqz
Copy link
Owner

rafaqz commented Jun 1, 2023

You mean plain GDAL.jl or gdal proper?

@maxfreu
Copy link
Contributor Author

maxfreu commented Jun 1, 2023

GDAL.jl

@visr
Copy link

visr commented Jun 1, 2023

Yes, right now with the error handler GDAL calls Julia when an error is encountered, and only if it is fatal we throw. Though since we also have aftercare, it could be that this issue would've been found anyway at that stage without the handler. There are some tests in https://github.com/JuliaGeo/GDAL.jl/blob/master/test/error.jl that we can use to check if it is fine to turn it off.

@maxfreu
Copy link
Contributor Author

maxfreu commented Jun 1, 2023

only if it is fatal we throw

You mean CE_Fatal? Currently it throws for CE_Failure.

Though since we also have aftercare, it could be that this issue would've been found anyway at that stage without the handler.

Yes, I tested it and the aftercare catches this error.

I noticed that the error handler actually runs twice during a single call to gdalopenex, each with a slightly different error message. So currently the gdaljl_errorhandler throws on the first error, while the aftercare would throw on the last, which should also be fine. So maybe the safest solution would be to switch the error handler to CE_Fatal and leave all the rest as it is. Switching to CE_Fatal passes the tests.

@visr
Copy link

visr commented Jun 1, 2023

Ah I meant CE_Failure. Is there a benefit to keep using it but only for CE_Fatal versus not registering the error handler at all? I'd probably be fine with either.

@maxfreu
Copy link
Contributor Author

maxfreu commented Jun 1, 2023

I think if you don't register an error handler all errors will be printed to stdout. But yes, a simple return C_NULL in the handler should do it, aftercare can also throw after CE_Fatal (if gdal didnt crash completely by this point).

@rafaqz
Copy link
Owner

rafaqz commented Jun 3, 2023

Closing in favour of JuliaGeo/GDAL.jl#153

@rafaqz rafaqz closed this as completed Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants