-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored asynchronous worker setup code. Speeds up cluster setup. #11520
Conversation
8fb9b2d
to
f2c200d
Compare
I added the functions required at worker startup to
|
f2c200d
to
5c68c30
Compare
send_msg_now(w, :identify_socket, myid()) | ||
|
||
# test connectivity with an echo | ||
assert(:ok == remotecall_fetch(rpid, ()->:ok)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be better to throw do the check and throw a better error here? Otherwise it will be caught and the printed error will be
Error: AssertionError("") connecting to peer. Exiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Will correct.
5c68c30
to
a1ad093
Compare
I would like to merge this over the next couple of days if there are no issues. Any feedback on the additions to |
The way In the past, when I've wanted to investigate which functions should be added to Sounds like a nice optimization. I've noticed starting workers slows things down a lot, looking forward to see how this feels. |
precompile(Base.connect, (Base.ClusterManager, Int, Base.WorkerConfig)) | ||
precompile(Base.connect_w2w, (Int, Base.WorkerConfig)) | ||
precompile(Base.connect_to_worker, (AbstractString, Int)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I certainly don't see any problem with these.
@timholy thanks for the tips. |
The way I did it for #11359 was adding |
@@ -1,47 +1,5 @@ | |||
# This file is a part of Julia. License is MIT: https://julialang.org/license | |||
|
|||
## multi.jl - multiprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this block of comments no longer accurate or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historical API documentation. Has been added to the doc files long since.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, minor differences with current API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay as long as there's a better version somewhere else and this isn't the only place this is listed, seems okay to delete then
a1ad093
to
8f8a388
Compare
After adding appropriate statements in @vtjnash , I have added @JeffBezanson The other change is addition of a boolean keyword arg I am seriously considering changing the above option to |
There should be a path here towards supporting other topologies, like hypercube. |
The other change that helped speed up @JeffBezanson, the master will still connect to all workers. Other topologies can be supported for worker-worker connections. |
8f8a388
to
34d7892
Compare
I have removed changes related to network setup completion from this PR. Will submit another PR to support different network topologies. Ready to merge if there are no more concerns. |
type Semaphore | ||
cnt::Int | ||
n_acq::Int | ||
c::Condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these field names could stand to be clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, the new version is definitely better
34d7892
to
e554039
Compare
Travis errors are unrelated to this PR and also not the usual OOM errors. Are they known ones? Have also noticed that they show up in other PRs in the last 48 hours.
and
|
Refactored asynchronous worker setup code. Speeds up cluster setup.
@andreasnoack did you try this to see if it helps with the dreaded exascale 1+1 ? |
No, but I'll do that now. |
It is still quite fragile, but I've been able to connect with 1024 workers and compute Worker 618 terminated.ERROR (unhandled task failure): EOFError: read end of file
in read at ./iobuffer.jl:86
in read at ./stream.jl:753
in message_handler_loop at multi.jl:803
in process_tcp_streams at multi.jl:791
in anonymous at task.jl:13 at some random point. There is also still something weird sequentilly going on. The timings for the first
|
Can you try commenting out Lines 938 to 940 in 84f14d1
addprocs further and hence less waiting for Slurm. Those lines are there to guarantee that all workers have complete visibility to each other when addprocs returns on the master.
#11665 will provide user control for the same. |
I can try that, but I don't think the time spent in |
Unless slurm launches multiple workers per core, I don't see why each Speeding up addprocs was in context of your suspicion that slurm is On Fri, Jun 12, 2015 at 8:35 PM, Andreas Noack [email protected]
|
I don't think it is Slurm related. Here are similar numbers for julia.mit.edu. There is much more noise from the constant term because the machine doesn't have thousands of cores. The
|
nworkers | seconds |
---|---|
2 | 0.394349 |
4 | 0.393942 |
8 | 0.623837 |
16 | 0.796043 |
32 | 1.384 |
64 | 2.56447 |
Found the problem. Should have a PR soon. On Fri, Jun 12, 2015 at 9:31 PM, Andreas Noack [email protected]
|
This PR refactors code touched by an
addprocs
call. It also fixes the relatively slower addition of tunneled workers in the current code base.Local tests pass. I am yet to get timings w.r.t. expected speedups in adding workers. Have opened this PR to get any feedback at all.
@JeffBezanson , just fyi, I have used changes from #11503 in this PR.