Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix close error due to race in d_closeall #248

Merged
merged 1 commit into from
Oct 3, 2023
Merged

fix close error due to race in d_closeall #248

merged 1 commit into from
Oct 3, 2023

Conversation

tanmaykm
Copy link
Member

@tanmaykm tanmaykm commented Oct 3, 2023

It seems possible that DistributedArrays.d_closeall() may encounter a condition where it finds a darray id in the registry, but the corresponding weakref value is nothing because the referenced darray got garbage collected. It has been enountered many times in CI and elsewhere, but is hard to replicate normally. Adding a check for the weakref value, before actually invoking close on it, to fix it.

fixes #246

@tanmaykm
Copy link
Member Author

tanmaykm commented Oct 3, 2023

The below code that adds some code to regulate GC behavior can recreate the situation.

julia> using Distributed

julia> addprocs(4);

julia> @everywhere begin
           using Distributed, DistributedArrays
       end

julia> A = rand(1:100, (100,100));

julia> DA = distribute(A);

julia> GC.enable(false)
true

julia> DA = nothing

julia> function DistributedArrays.d_from_weakref_or_d(id)
           d = get(DistributedArrays.registry, id, nothing)
           GC.enable(true)
           GC.gc()
           isa(d, WeakRef) && return d.value
           return d
       end

julia> DistributedArrays.d_closeall()
ERROR: MethodError: no method matching close(::Nothing)

Closest candidates are:
  close(::Union{Base.AsyncCondition, Timer})
   @ Base asyncevent.jl:162
  close(::Union{FileWatching.FileMonitor, FileWatching.FolderMonitor, FileWatching.PollingFileWatcher})
   @ FileWatching /data/Work/julia/binaries/julia-1.9.3/share/julia/stdlib/v1.9/FileWatching/src/FileWatching.jl:328
  close(::DArray)
   @ DistributedArrays ~/.julia/dev/DistributedArrays/src/core.jl:34
  ...

Stacktrace:
 [1] d_closeall()
   @ DistributedArrays ~/.julia/dev/DistributedArrays/src/core.jl:47
 [2] top-level scope
   @ REPL[9]:1

It seems possible that `DistributedArrays.d_closeall()` may encounter a condition where it finds a darray id in the `registry`, but the corresponding weakref value is `nothing` because the referenced darray got garbage collected. It has been enountered many times in CI and elsewhere, but is hard to replicate normally. Adding a check for the weakref value, before actually invoking `close` on it, to fix it.
@andreasnoack
Copy link
Member

Can we be sure that the remote memory is properly freed if the reference has been garbage collected?

@tanmaykm
Copy link
Member Author

tanmaykm commented Oct 3, 2023

Yes, I think so, because the finalizer invokes close. Did a small test to confirm that:

julia> using Distributed

julia> addprocs(4);

julia> @everywhere begin
           using Distributed, DistributedArrays

           function print_registry_entries()
               for k in keys(DistributedArrays.registry)
                   hasval = !(DistributedArrays.d_from_weakref_or_d(k) === nothing)
                   println("key $k hasval $hasval")
               end
           end
       end

julia> @everywhere print_registry_entries()

julia> A = rand(1:100, (100,100));

julia> DA = distribute(A);

julia> GC.enable(false)
true

julia> DA = nothing

julia> @everywhere print_registry_entries()
key (1, 1) hasval true
      From worker 5:	key (1, 1) hasval true
      From worker 3:	key (1, 1) hasval true
      From worker 2:	key (1, 1) hasval true
      From worker 4:	key (1, 1) hasval true

julia> function DistributedArrays.d_from_weakref_or_d(id)
           d = get(DistributedArrays.registry, id, nothing)
           GC.enable(true)
           GC.gc()
           isa(d, WeakRef) && return d.value
           return d
       end

julia> @everywhere print_registry_entries()
key (1, 1) hasval false
      From worker 2:	key (1, 1) hasval true
      From worker 3:	key (1, 1) hasval true
      From worker 5:	key (1, 1) hasval true
      From worker 4:	key (1, 1) hasval true

julia> DistributedArrays.d_closeall()

julia> @everywhere print_registry_entries()

Would it be good to add this as a test?

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine as it is

@andreasnoack andreasnoack merged commit 4e82ecf into master Oct 3, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MethodError: no method matching close(::Nothing) with Julia 1.9
2 participants