Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug when using both processes and threads, and dynamic remote channels between them #78

Open
orenbenkiki opened this issue Jul 8, 2021 · 2 comments

Comments

@orenbenkiki
Copy link

orenbenkiki commented Jul 8, 2021

I have code which:

  • Creates multiple worker processes
  • Using multiple threads in each one
  • Having a "server" loop in the main process which receives requests on a globally-known channel
  • Each request contains a freshly-created response channel
  • The server uses that response channel to send back a single value, and then closes the response channel
  • All the threads in all the processes start hammering the server with requests

The motivation is to have one multi-threaded process on each server in a compute cluster, but in the code demonstrating the bug (see below), all processes run on the same (local) machine; the code doesn't make use of this fact (does not use shared memory or atomics, only remote channels).

Leaving aside whether this is a good idiom, the approach is legal and should work(?).
However, running this in Julia 1.6.1 produces nondeterministic failures:

  • Sometimes it works (not often)
  • Sometimes it deadlocks (often)
  • Sometimes (less often) it crashes with an error message complaining about conversion of the data type of the response channel EDIT: This is clearly "impossible" as the error message indicates an object of an explicitly created type actually has a different type instead; no to mention, this works most of the time - if the types were incorrect, the code should have failed the 1st time it was run. The code is:
response_channel = RemoteChannel(() -> Channel{Int}(1+100))
put!(everywhere_counters_channel, response_channel)

And the error message complains that:

nested task error: MethodError: Cannot `convert` an object of type
RemoteChannel{Channel{Any}} to an object of type
RemoteChannel{Channel{Int64}}
  • Sometimes (rarely) it crashes with error messages involving the GC and concurrency errors

This seems to be a bug, unless the code does something "forbidden" (it doesn't seem to?). It was suggested the GC issues might be related to JuliaLang/julia#38180 but this doesn't seem to cover the concurrency errors in the crash traces.

The source code and output crash traces are available in https://gist.github.com/orenbenkiki/ac71f348d4915b394805656b142b33fe

To run it type JULIA_NUM_THREADS=4 julia Bug.jl 4 1000 quiet - you can play with the number of threads, number of processes (here, also 4), number of requests sent by each thread of each process (here, 1000), and whether the code is quiet or verbose (the latter uses println and flush a lot which will impact the behavior).

@jonas-schulze
Copy link
Contributor

The put! in line 104, which is executed from threads other than one, should cause failures similar to #73.

@orenbenkiki
Copy link
Author

@jonas-schulze - thanks for the pointer!

I have also tried to replicate the problem w/o using any worker threads - see https://discourse.julialang.org/t/are-julia-channels-futures-thread-safe-with-a-failing-code-example/64490 - it is possible that Distributed has problem with Threads even in a single process? If so that would explain the problem I am seeing, but it seems this would go beyond the scope of #73?

@vtjnash vtjnash transferred this issue from JuliaLang/julia Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants