Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make nprocs report only fully connected workers #21347

Merged
merged 2 commits into from
Apr 14, 2017

Conversation

tanmaykm
Copy link
Member

This changes nprocs (and therefore nworkers) to report only workers in W_CONNECTED state.

This is to avoid failures in certain conditions with @everywhere or other similar methods that broadcast
messages to all workers.

To simulate:

using ClusterManagers

ElasticManager(;addr=IPv4("0.0.0.0"), port=9009, cookie="cookie", topology=:master_slave)

while nworkers() < 4
    sleep(1)
end

@everywhere println(myid())
  • start 4 workers with:
using ClusterManagers
ClusterManagers.elastic_worker("cookie", "127.0.0.1", 9009; stdout_to_master=false)

Without this change, this will often result in:

ERROR: LoadError: peer 3 is not connected to 1. Topology : master_slave
check_worker_state(::Base.Distributed.Worker) at ./distributed/cluster.jl:115
send_msg_(::Base.Distributed.Worker, ::Base.Distributed.MsgHeader, ::Base.Distributed.CallMsg{:call_fetch}, ::Bool) at ./distributed/messages.jl:180
remotecall_fetch(::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:346
remotecall_fetch(::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367
(::##1#3)() at ./distributed/macros.jl:84

@ararslan ararslan added the domain:parallelism Parallel or distributed computation label Apr 11, 2017
@tkelman
Copy link
Contributor

tkelman commented Apr 11, 2017

does anything under the TEST_FULL flag mock the conditions where this would differ, for testing?

@amitmurthy
Copy link
Contributor

does anything under the TEST_FULL flag mock the conditions where this would differ, for testing?

No.

@amitmurthy
Copy link
Contributor

@tanmaykm - The change is fine. Could you also fix procs() and workers() which suffer from the same issue in this PR?

@tanmaykm
Copy link
Member Author

Yes, will do that in a bit.

This changes nprocs (and therefore nworkers) report only workers in `W_CONNECTED` state.

This is to avoid issue with `@everywhere` or other similar methods that broadcast
messages to all workers.

To simulate:
- introduce an artificial delay at https://github.com/JuliaLang/julia/blob/2c4f6d74577a1b7606ed5e74e96158810f4f7af4/base/distributed/cluster.jl#L443 with `sleep(10)`
- start a master with:
```
using ClusterManagers

ElasticManager(;addr=IPv4("0.0.0.0"), port=9009, cookie="cookie", topology=:master_slave)

while nworkers() < 4
    sleep(1)
end

@Everywhere println(myid())
```
- start 4 workers with:
```
using ClusterManagers
ClusterManagers.elastic_worker("cookie", "127.0.0.1", 9009; stdout_to_master=false)
```

Without this change, this will often result in:

```
ERROR: LoadError: peer 3 is not connected to 1. Topology : master_slave
check_worker_state(::Base.Distributed.Worker) at ./distributed/cluster.jl:115
send_msg_(::Base.Distributed.Worker, ::Base.Distributed.MsgHeader, ::Base.Distributed.CallMsg{:call_fetch}, ::Bool) at ./distributed/messages.jl:180
remotecall_fetch(::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:346
remotecall_fetch(::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367
(::##1#3)() at ./distributed/macros.jl:84
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants