make nprocs report only fully connected workers #21347

tanmaykm · 2017-04-11T08:15:36Z

This changes nprocs (and therefore nworkers) to report only workers in W_CONNECTED state.

This is to avoid failures in certain conditions with @everywhere or other similar methods that broadcast
messages to all workers.

To simulate:

introduce an artificial delay at

julia/base/distributed/cluster.jl

Line 443 in 2c4f6d7

(r_s, w_s) = connect(manager, w.id, wconfig)

with sleep(10)
start a master with:

using ClusterManagers

ElasticManager(;addr=IPv4("0.0.0.0"), port=9009, cookie="cookie", topology=:master_slave)

while nworkers() < 4
    sleep(1)
end

@everywhere println(myid())

start 4 workers with:

using ClusterManagers
ClusterManagers.elastic_worker("cookie", "127.0.0.1", 9009; stdout_to_master=false)

Without this change, this will often result in:

ERROR: LoadError: peer 3 is not connected to 1. Topology : master_slave
check_worker_state(::Base.Distributed.Worker) at ./distributed/cluster.jl:115
send_msg_(::Base.Distributed.Worker, ::Base.Distributed.MsgHeader, ::Base.Distributed.CallMsg{:call_fetch}, ::Bool) at ./distributed/messages.jl:180
remotecall_fetch(::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:346
remotecall_fetch(::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367
(::##1#3)() at ./distributed/macros.jl:84

tkelman · 2017-04-11T17:48:49Z

does anything under the TEST_FULL flag mock the conditions where this would differ, for testing?

amitmurthy · 2017-04-12T03:56:07Z

does anything under the TEST_FULL flag mock the conditions where this would differ, for testing?

No.

amitmurthy · 2017-04-12T03:58:33Z

@tanmaykm - The change is fine. Could you also fix procs() and workers() which suffer from the same issue in this PR?

tanmaykm · 2017-04-12T04:06:59Z

Yes, will do that in a bit.

@Everywhere

This changes nprocs (and therefore nworkers) report only workers in `W_CONNECTED` state. This is to avoid issue with `@everywhere` or other similar methods that broadcast messages to all workers. To simulate: - introduce an artificial delay at https://github.com/JuliaLang/julia/blob/2c4f6d74577a1b7606ed5e74e96158810f4f7af4/base/distributed/cluster.jl#L443 with `sleep(10)` - start a master with: ``` using ClusterManagers ElasticManager(;addr=IPv4("0.0.0.0"), port=9009, cookie="cookie", topology=:master_slave) while nworkers() < 4 sleep(1) end @Everywhere println(myid()) ``` - start 4 workers with: ``` using ClusterManagers ClusterManagers.elastic_worker("cookie", "127.0.0.1", 9009; stdout_to_master=false) ``` Without this change, this will often result in: ``` ERROR: LoadError: peer 3 is not connected to 1. Topology : master_slave check_worker_state(::Base.Distributed.Worker) at ./distributed/cluster.jl:115 send_msg_(::Base.Distributed.Worker, ::Base.Distributed.MsgHeader, ::Base.Distributed.CallMsg{:call_fetch}, ::Bool) at ./distributed/messages.jl:180 remotecall_fetch(::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:346 remotecall_fetch(::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367 (::##1#3)() at ./distributed/macros.jl:84 ```

tanmaykm requested a review from amitmurthy April 11, 2017 16:28

ararslan added the domain:parallelism Parallel or distributed computation label Apr 11, 2017

tanmaykm added 2 commits April 12, 2017 10:24

make procs & workers return fully connected wrkrs

884e5f4

tanmaykm force-pushed the tan/par branch from c8399fd to 884e5f4 Compare April 12, 2017 05:28

amitmurthy approved these changes Apr 12, 2017

View reviewed changes

amitmurthy merged commit bd1ab56 into JuliaLang:master Apr 14, 2017

amitmurthy added the backport pending 0.5 label Apr 14, 2017

This was referenced Apr 29, 2017

nprocs and nworkers return incorrect numbers when topology is not all-to-all. #21632

Closed

Fix bug in nprocs()/nworkers()/procs() in non :all_to_all topology case. #21651

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make nprocs report only fully connected workers #21347

make nprocs report only fully connected workers #21347

tanmaykm commented Apr 11, 2017

tkelman commented Apr 11, 2017

amitmurthy commented Apr 12, 2017

amitmurthy commented Apr 12, 2017

tanmaykm commented Apr 12, 2017

make nprocs report only fully connected workers #21347

make nprocs report only fully connected workers #21347

Conversation

tanmaykm commented Apr 11, 2017

tkelman commented Apr 11, 2017

amitmurthy commented Apr 12, 2017

amitmurthy commented Apr 12, 2017

tanmaykm commented Apr 12, 2017