Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalizer error when rapidly writing to HDF5 #1048

Closed
TacHawkes opened this issue Feb 2, 2023 · 28 comments
Closed

Finalizer error when rapidly writing to HDF5 #1048

TacHawkes opened this issue Feb 2, 2023 · 28 comments

Comments

@TacHawkes
Copy link

TacHawkes commented Feb 2, 2023

Hi,

I have the following setup:

  1. A main thread which is idling
  2. A thread which writes to a HDF5 dataset faster than 1 ms delay between successive writes, i.e. (just pseudo-MWE)
Threads.@spawn begin
     for i=1:1_000_000
          lock(some_filelock) do
              hdf5_dataset[startidx:stopidx,1] = somedata
          end
     end
end

If the writes are too fast, the task/thread crashes with the following trace:

error in running finalizer: ErrorException("val already in a list")
error at .\error.jl:35
push! at .\linked_list.jl:53 [inlined]
_wait2 at .\condition.jl:87
#wait#621 at .\condition.jl:127
wait at .\condition.jl:125 [inlined]
slowlock at .\lock.jl:156
lock at .\lock.jl:147 [inlined]
h5i_is_valid at C:\Users\user\.julia\packages\HDF5\TcavY\src\api\functions.jl:1960
isvalid at C:\Users\user\.julia\packages\HDF5\TcavY\src\objects.jl:2 [inlined]
close at C:\Users\user\.julia\packages\HDF5\TcavY\src\dataspaces.jl:25
unknown function (ip: 000002027f7c9b96)
run_finalizer at C:/workdir/src\gc.c:298
jl_gc_run_finalizers_in_list at C:/workdir/src\gc.c:388
run_finalizers at C:/workdir/src\gc.c:434
jl_mutex_unlock at C:/workdir/src\julia_locks.h:81 [inlined]
ijl_task_get_next at C:/workdir/src\partr.c:385
poptask at .\task.jl:963
wait at .\task.jl:972
#wait#621 at .\condition.jl:130
wait at .\condition.jl:125 [inlined]
take_buffered at .\channels.jl:456
take! at .\channels.jl:450 [inlined]
iterate at .\channels.jl:592
macro expansion at c:\Users\user\.vscode\extensions\julialang.language-julia-1.41.1\scripts\packages\JSONRPC\src\core.jl:103 [inlined]
#1 at .\task.jl:514
unknown function (ip: 000002027f6ffb33)
jl_apply at C:/workdir/src\julia.h:1874 [inlined]
start_task at C:/workdir/src\task.c:1092

Versioninfo:

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
@mkitti
Copy link
Member

mkitti commented Feb 2, 2023

#1024 might be able to help with this.

It might help to call GC.gc() to manually engage garbage collection.

Another alternative would be to use GC.enable(false) or GC.enable_finalizers(false) before entering the tight loop.

@simonbyrne
Copy link
Collaborator

What version of HDF5.jl are you using?

@simonbyrne
Copy link
Collaborator

Also, it would be helpful if you can give a full minimal reproducible example.

@TacHawkes
Copy link
Author

This is with v0.16.13

@simonbyrne
Copy link
Collaborator

Huh, we added locking around API calls in #1021, which should be in that version.

Do you still see the same issue if you remove the lock?

@TacHawkes
Copy link
Author

Yes, here is a MWE. Run the whole block at once for it to happen. Adding GC.disable/enable fixes the problem seemingly...

using HDF5

fid = h5open(tempname() * ".h5", "w")

d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

t = Threads.@spawn begin
    #GC.enable(false)
    for i=1:1_000_000
        d[i,1] = 1
    end
    #GC.enable(true)
end

wait(t)

close(fid)

@simonbyrne
Copy link
Collaborator

I am unable to replicate it on my Mac. I don't have a Windows machine to try it on, unfortunately.

Do you see it if you run it outside VSCode? (i.e. if you just run it in a script from the command line)

@TacHawkes
Copy link
Author

TacHawkes commented Feb 2, 2023

I'm also on a Mac at the moment. It also crashes from the cmd line but with a slightly different trace:

error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")
ijl_error at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ijl_switch at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
try_yieldto at ./task.jl:899
wait at ./task.jl:973
#wait#621 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
h5i_is_valid at /Users/user/.julia/packages/HDF5/TcavY/src/api/functions.jl:1960
isvalid at /Users/user/.julia/packages/HDF5/TcavY/src/objects.jl:2 [inlined]
close at /Users/user/.julia/packages/HDF5/TcavY/src/dataspaces.jl:25
unknown function (ip: 0x115d4005f)
ijl_apply_generic at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizer at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_gc_run_finalizers_in_list at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizers at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ijl_task_get_next at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
poptask at ./task.jl:963
wait at ./task.jl:972
task_done_hook at ./task.jl:672
jfptr_task_done_hook_27918 at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
ijl_apply_generic at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_finish_task at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
start_task at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait(t::Task)
   @ Base ./task.jl:349
 [2] top-level scope
   @ ~/.julia/dev/HDF5Test/test/rectest.jl:41

    nested task error: schedule: Task not runnable
    Stacktrace:
      [1] error(s::String)
        @ Base ./error.jl:35
      [2] schedule(t::Task, arg::Any; error::Bool)
        @ Base ./task.jl:829
      [3] notify(c::Base.GenericCondition{Base.Threads.SpinLock}, arg::Any, all::Bool, error::Bool)
        @ Base ./condition.jl:154
      [4] #notify#622
        @ ./condition.jl:148 [inlined]
      [5] notify (repeats 2 times)
        @ ./condition.jl:148 [inlined]
      [6] (::Base.var"#notifywaiters#625")(rl::ReentrantLock)
        @ Base ./lock.jl:187
      [7] (::Base.var"#_unlock#624")(rl::ReentrantLock)
        @ Base ./lock.jl:183
      [8] unlock
        @ ./lock.jl:177 [inlined]
      [9] h5t_get_native_type(dtype_id::HDF5.Datatype, direction::Int64)
        @ HDF5.API ~/.julia/packages/HDF5/TcavY/src/api/functions.jl:7026
     [10] h5t_get_native_type
        @ ~/.julia/packages/HDF5/TcavY/src/api/helpers.jl:891 [inlined]
     [11] _memtype(filetype::HDF5.Datatype, #unused#::Type{Int64})
        @ HDF5 ~/.julia/packages/HDF5/TcavY/src/readwrite.jl:290
     [12] setindex!(::HDF5.Dataset, ::Matrix{Int64}, ::Int64, ::Int64)
        @ HDF5 ~/.julia/packages/HDF5/TcavY/src/datasets.jl:327
     [13] setindex!(::HDF5.Dataset, ::Int64, ::Int64, ::Int64)
        @ HDF5 ~/.julia/packages/HDF5/TcavY/src/datasets.jl:357
     [14] macro expansion
        @ ~/.julia/dev/HDF5Test/test/rectest.jl:36 [inlined]
     [15] (::var"#3#4")()
        @ Main ./threadingconstructs.jl:340

Versioninfo for this platform:

Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.4.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores

@simonbyrne
Copy link
Collaborator

I still can't replicate it. What is your full status -m?

Can you also try running it with --startup-file=no to make sure you're not picking up anything in your startup file?

@mkitti
Copy link
Member

mkitti commented Feb 2, 2023

My guess is that we also need to be under low memory conditions for the GC to be attempting to finalize things in the middle of this loop.

Nonetheless, this points to a fundamental issue. We should not be locking and unlocking during finalization from a different task.

@simonbyrne
Copy link
Collaborator

I guess I still don't understand how this can happen after #1021.

@mkitti
Copy link
Member

mkitti commented Feb 2, 2023

I guess I still don't understand how this can happen after #1021.

The problem was caused by #1021. It is because the finalizer tries to call close. As of #1021, close tries to lock and unlock, which causes a potential task switch in the finalizer and that is not allowed.

I raised this and you responded in #1021 (comment) . So either JuliaLang/julia#38487 does not actually address this or there has been a regression.

@simonbyrne
Copy link
Collaborator

My guess is that we also need to be under low memory conditions for the GC to be attempting to finalize things in the middle of this loop.

You're right: I was able to trigger it with --heap-size-hint=50M.

@mkitti
Copy link
Member

mkitti commented Feb 2, 2023

Yes, here is a MWE. Run the whole block at once for it to happen. Adding GC.disable/enable fixes the problem seemingly...

Does GC.enable_finalizers(false) / GC.enable_finalizers(true) also fix the issue?

@simonbyrne
Copy link
Collaborator

Does GC.enable_finalizers(false) / GC.enable_finalizers(true) also fix the issue?

Yes.

@mkitti
Copy link
Member

mkitti commented Feb 2, 2023

So far I am unable to replicate on Windows.

@mkitti
Copy link
Member

mkitti commented Feb 3, 2023

I can replicate on macOS but not consistently. It seemed to occur more frequently when I first logged in.

@simonbyrne
Copy link
Collaborator

simonbyrne commented Feb 3, 2023

Possibly related: JuliaLang/julia#47612

@mkitti
Copy link
Member

mkitti commented Feb 3, 2023

I think we need to use trylock in the finalizers.

function try_close_finalizer(x)
   # Test and test lock, see trylock docstring
   # Only proceed if the finalizer thread can acquire the lock, do not wait
   if(!islocked(API.liblock) && trylock(API.liblock))
       try
           close(x)
       finally
           unlock(API.liblock)
       end
   else
       # try again later
       # TODO: consider exponential backoff
       finalizer(try_close_finalizer, x) 
   end
end

@mkitti
Copy link
Member

mkitti commented Feb 3, 2023

User code may want to try to acquire the API lock. Shouldn't this disable finalizers due to https://github.com/JuliaLang/julia/blob/72bebdf2715f94bad52c701cfb867744a5b070d6/base/lock.jl#L96 ?

edit: This makes the problem worse.

t = Threads.@spawn begin
    lock(HDF5.API.liblock)
    try
        for i=1:1_000_000
            d[i,1] = 1
        end
    finally
        unlock(HDF5.API.liblock)
    end
end

or maybe

t = Threads.@spawn begin
    @lock HDF5.API.liblock begin
        for i=1:1_000_000
            d[i,1] = 1
        end
    end
end

@mkitti
Copy link
Member

mkitti commented Feb 3, 2023

On macOS, once I find a terminal process that causes the error, then this reliably creates a problem:

$ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
           using HDF5

           fid = h5open(tempname() * ".h5", "w")

           d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

           t = Threads.@spawn begin
               #GC.enable(false)
               for i=1:1_000_000
                   d[i,1] = 1
               end
               #GC.enable(true)
           end

           wait(t)

           close(fid)
       end
error in running finalizer: ErrorException("val already in a list")
error at ./error.jl:35
push! at ./linked_list.jl:53 [inlined]
_wait2 at ./condition.jl:87
#wait#621 at ./condition.jl:127
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
h5i_is_valid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/api/functions.jl:1960
isvalid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:19 [inlined]
close at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:11
unknown function (ip: 0x1170d405f)
ijl_apply_generic at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizer at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_gc_run_finalizers_in_list at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
...

Meanwhile, the using do with h5open does not create a problem:

$ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
           using HDF5

           h5open(tempname() * ".h5", "w") do fid

              d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

              t = Threads.@spawn begin
                  for i=1:1_000_000
                       d[i,1] = 1
                  end
              end
              wait(t)
           end
       end

@mkitti
Copy link
Member

mkitti commented Feb 3, 2023

Hmm... acquiring the lock causes a lot of error message scrolling.

julia> begin
       using HDF5

       fid = h5open(tempname() * ".h5", "w")

       d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

       t = Threads.@spawn @lock HDF5.API.liblock begin
           #GC.enable(false)
           for i=1:1_000_000
               d[i,1] = 1
           end
           #GC.enable(true)
       end

       wait(t)

       close(fid)
       end

Using the do block seems to not cause the problem:

julia> begin
       using HDF5

       h5open(tempname() * ".h5", "w") do fid

       d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

       t = Threads.@spawn @lock HDF5.API.liblock begin
           #GC.enable(false)
           for i=1:1_000_000
               d[i,1] = 1
           end
           #GC.enable(true)
       end

       wait(t)

       close(fid)
       end
       end

@mkitti
Copy link
Member

mkitti commented Feb 3, 2023

Does GC.enable_finalizers(false) / GC.enable_finalizers(true) also fix the issue?

Yes.

This only avoids the error if I use GC.enable_finalizers outside of the Threads.@spawn. Using them inside causes much error scrolling.

@mkitti
Copy link
Member

mkitti commented Feb 3, 2023

Implementing try_close_finalizer above appears to resolve the issue.

diff --git a/src/HDF5.jl b/src/HDF5.jl
index fb95abf..9e29340 100644
--- a/src/HDF5.jl
+++ b/src/HDF5.jl
@@ -62,8 +62,24 @@ export @read,
 # H5DataStore, Attribute, File, Group, Dataset, Datatype, Opaque,
 # Dataspace, Object, Properties, VLen, ChunkStorage, Reference

+
+
 h5doc(name) = "[`$name`](https://portal.hdfgroup.org/display/HDF5/$(name))"

+function try_close_finalizer(x)
+    if !islocked(API.liblock) && trylock(API.liblock)
+        try
+            close(x)
+        finally
+            unlock(API.liblock)
+        end
+    else
+        finalizer(try_close_finalizer, x)
+    end
+end
+#const try_close_finalizer = Base.close
+
+
 include("api/api.jl")
 include("properties.jl")
 include("context.jl")
diff --git a/src/properties.jl b/src/properties.jl
index 1ca0033..88f16f9 100644
--- a/src/properties.jl
+++ b/src/properties.jl
@@ -104,7 +104,7 @@ macro propertyclass(name, classid)
             id::API.hid_t
             function $name(id::API.hid_t)
                 obj = new(id)
-                finalizer(close, obj)
+                finalizer(try_close_finalizer, obj)
                 obj
             end
         end
diff --git a/src/types.jl b/src/types.jl
index eea17a9..980d240 100644
--- a/src/types.jl
+++ b/src/types.jl
@@ -5,6 +5,7 @@
 # Supertype of HDF5.File, HDF5.Group, JldFile, JldGroup, Matlabv5File, and MatlabHDF5File.
 abstract type H5DataStore end

+
 # Read a list of variables, read(parent, "A", "B", "x", ...)
 function Base.read(parent::H5DataStore, name::AbstractString...)
     tuple((read(parent, x) for x in name)...)
@@ -41,7 +42,7 @@ mutable struct File <: H5DataStore
     function File(id, filename, toclose::Bool=true)
         f = new(id, filename)
         if toclose
-            finalizer(close, f)
+            finalizer(try_close_finalizer, f)
         end
         f
     end
@@ -55,7 +56,7 @@ mutable struct Group <: H5DataStore

     function Group(id, file)
         g = new(id, file)
-        finalizer(close, g)
+        finalizer(try_close_finalizer, g)
         g
     end
 end
@@ -69,7 +70,7 @@ mutable struct Dataset

     function Dataset(id, file, xfer=DatasetTransferProperties())
         dset = new(id, file, xfer)
-        finalizer(close, dset)
+        finalizer(try_close_finalizer, dset)
         dset
     end
 end
@@ -84,14 +85,14 @@ mutable struct Datatype
     function Datatype(id, toclose::Bool=true)
         nt = new(id, toclose)
         if toclose
-            finalizer(close, nt)
+            finalizer(try_close_finalizer, nt)
         end
         nt
     end
     function Datatype(id, file::File, toclose::Bool=true)
         nt = new(id, toclose, file)
         if toclose
-            finalizer(close, nt)
+            finalizer(try_close_finalizer, nt)
         end
         nt
     end
@@ -106,7 +107,7 @@ mutable struct Dataspace

     function Dataspace(id)
         dspace = new(id)
-        finalizer(close, dspace)
+        finalizer(try_close_finalizer, dspace)
         dspace
     end
 end
@@ -119,7 +120,7 @@ mutable struct Attribute

     function Attribute(id, file)
         dset = new(id, file)
-        finalizer(close, dset)
+        finalizer(try_close_finalizer, dset)
         dset
     end
 end

@TacHawkes
Copy link
Author

#1049 fixes this. Big thanks to @mkitti for the fast fix!

@laborg
Copy link
Contributor

laborg commented Feb 14, 2023

I've stumbled over this bug too. Any chance to land the fix and cut a quick bugfix release?

@mkitti
Copy link
Member

mkitti commented Feb 14, 2023

Could you confirm that #1049 fixes the issue for you? If so, I will accelerate the merge.

cc: @musm

@laborg
Copy link
Contributor

laborg commented Feb 14, 2023

Yes, the branch worked.

@mkitti mkitti closed this as completed in 2203de3 Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants