Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored asynchronous worker setup code. Speeds up cluster setup. #11520

Merged
merged 1 commit into from
Jun 10, 2015

Conversation

amitmurthy
Copy link
Contributor

This PR refactors code touched by an addprocs call. It also fixes the relatively slower addition of tunneled workers in the current code base.

Local tests pass. I am yet to get timings w.r.t. expected speedups in adding workers. Have opened this PR to get any feedback at all.

@JeffBezanson , just fyi, I have used changes from #11503 in this PR.

@amitmurthy
Copy link
Contributor Author

addprocs now waits till all workers have established and tested their connections with their peers. In spite of these additional checks, addprocs(8) on my local machine is marginally faster - it takes 2.5 seconds now as opposed to 2.9 seconds previously.

I added the functions required at worker startup to precompile.jl. It resulted in a reduction of addprocs(8) for this PR from 4.0 (without precompile.jl additions) to 2.5 seconds. How does one decide what functions to add to precompile.jl? Should we add all possible functions required at startup? I am not very clear on how exactly precompile.jl speeds up things....

addprocs from my local machine via ssh tunnelling to a 32-core AWS server was previously timing out since my local machine connects to the Internet via a relatively high latency 4G cellular network. With this PR it completes in approximately 30 seconds.

@amitmurthy amitmurthy changed the title WIP : Refactored asynchronous worker setup code Refactored asynchronous worker setup code. Speeds up cluster setup. Jun 3, 2015
send_msg_now(w, :identify_socket, myid())

# test connectivity with an echo
assert(:ok == remotecall_fetch(rpid, ()->:ok))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to throw do the check and throw a better error here? Otherwise it will be caught and the printed error will be

Error: AssertionError("") connecting to peer. Exiting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will correct.

@amitmurthy
Copy link
Contributor Author

I would like to merge this over the next couple of days if there are no issues. Any feedback on the additions to precompile.jl will be helpful.

@ViralBShah ViralBShah added the parallelism Parallel or distributed computation label Jun 4, 2015
@timholy
Copy link
Sponsor Member

timholy commented Jun 4, 2015

The way precompile.jl speeds things up is by generating native code for functions that haven't been called yet. It's as if you've called all those functions with the specified types before saving sys.so. Consequently, when a new julia session starts up, those compiled versions are already there waiting to be used, rather than needing to be JIT-compiled.

In the past, when I've wanted to investigate which functions should be added to precompile.jl, I've just "snooped" on codegen.cpp, for example having it print out both the name and types of the function it's compiling (and occasionally, the time required to complete the compilation). I don't remember exactly what I did to print the types, but jl_(li->specTypes) would be the first thing I'd try. (If that isn't useful, the ast surely would be.) Others may have better suggestions for how you go about picking which functions to add.

Sounds like a nice optimization. I've noticed starting workers slows things down a lot, looking forward to see how this feels.

precompile(Base.connect, (Base.ClusterManager, Int, Base.WorkerConfig))
precompile(Base.connect_w2w, (Int, Base.WorkerConfig))
precompile(Base.connect_to_worker, (AbstractString, Int))

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly don't see any problem with these.

@amitmurthy
Copy link
Contributor Author

@timholy thanks for the tips.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 4, 2015

The way I did it for #11359 was adding jl_(li) in to_function when nested_compile is false.

@@ -1,47 +1,5 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

## multi.jl - multiprocessing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this block of comments no longer accurate or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historical API documentation. Has been added to the doc files long since.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, minor differences with current API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay as long as there's a better version somewhere else and this isn't the only place this is listed, seems okay to delete then

@amitmurthy
Copy link
Contributor Author

After adding appropriate statements in precompile.jl (based on time taken by emit_function), addprocs(8) takes around 1 second and addprocs(4) around 700 ms.

@vtjnash , I have added start_worker toprecompile.jl. However it still shows up in my timing logs, i.e. emit_function is called. If I run julia --worker from a bash prompt, or run(julia --worker) from a REPL it is not printed, which leads me to believe that the correct version has actually been precompiled. However open(julia --worker, "r") does print it. Why does open not pick up the precompiled version of start_worker ?

@JeffBezanson The other change is addition of a boolean keyword arg strict to addprocs. strict=true would mean that addprocs will wait till all the workers have setup and tested their connections with each other. Default is false, i.e., addprocs will return once the master is connected to the workers, while the all-to-all connection happens in the background among the workers.

I am seriously considering changing the above option to connection=:master_slave and connection=:all_to_all. The former will result in connections being setup only from the master process to all workers, while the latter is the current all-to-all setup. Most usage of Julia parallel that I have seen only requires master-slave connections, worker-to-worker connections are not used at all. It will be a nice optimization especially when we have to deal to hundreds of workers. Default will be connection=:master_slave, and that will be a breaking change for code that expects a mesh network.

@JeffBezanson
Copy link
Sponsor Member

There should be a path here towards supporting other topologies, like hypercube.

@amitmurthy
Copy link
Contributor Author

The other change that helped speed up addprocs is replacing begin .... end blocks with functions that can be precompiled.

@JeffBezanson, the master will still connect to all workers. Other topologies can be supported for worker-worker connections.

@amitmurthy
Copy link
Contributor Author

I have removed changes related to network setup completion from this PR. Will submit another PR to support different network topologies.

Ready to merge if there are no more concerns.

type Semaphore
cnt::Int
n_acq::Int
c::Condition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these field names could stand to be clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, the new version is definitely better

@amitmurthy
Copy link
Contributor Author

Travis errors are unrelated to this PR and also not the usual OOM errors. Are they known ones? Have also noticed that they show up in other PRs in the last 48 hours.

exception on 2: ERROR: LoadError: LoadError: test error in expression: float(A) == float(full(A))
StackOverflowError:
 in convert at dates/periods.jl:22 (repeats 79996 times)
 in convert at dates/periods.jl:22 (repeats 79996 times)
while loading /tmp/julia/share/julia/test/sparsedir/sparse.jl, in expression starting on line 878
while loading /tmp/julia/share/julia/test/sparse.jl, in expression starting on line 3
    From worker 4:       * sorting              in  31.31 seconds
    From worker 3:       * bitarray             in  92.75 seconds
    From worker 5:       * subarray             in 181.53 seconds
ERROR: LoadError: LoadError: LoadError: test error in expression: float(A) == float(full(A))
StackOverflowError:
 in anonymous at task.jl:1378
while loading /tmp/julia/share/julia/test/sparsedir/sparse.jl, in expression starting on line 878
while loading /tmp/julia/share/julia/test/sparse.jl, in expression starting on line 3
while loading /tmp/julia/share/julia/test/runtests.jl, in expression starting on line 5

    From worker 2:       * sparse  

and

exception on 2: ERROR: LoadError: TypeError: _methods: in typeassert, expected Array{Any,1}, got Array{Any,1}
 in _methods at reflection.jl:143
 in _methods at reflection.jl:134
 in methods at reflection.jl:128
 in code_lowered at reflection.jl:127
 in include at ./boot.jl:253
 in runtests at /tmp/julia/share/julia/test/testdefs.jl:197
 in anonymous at multi.jl:822
 in run_work_thunk at multi.jl:575
 in anonymous at task.jl:822
while loading /tmp/julia/share/julia/test/meta.jl, in expression starting on line 59
WARNING: Module MetaTest not defined on process 1
Worker 2 terminated.

amitmurthy added a commit that referenced this pull request Jun 10, 2015
Refactored asynchronous worker setup code. Speeds up cluster setup.
@amitmurthy amitmurthy merged commit 06bd8a8 into master Jun 10, 2015
@tkelman tkelman deleted the amitm/multi_setup branch June 10, 2015 18:41
@carnaval
Copy link
Contributor

@andreasnoack did you try this to see if it helps with the dreaded exascale 1+1 ?

@andreasnoack
Copy link
Member

No, but I'll do that now.

@andreasnoack
Copy link
Member

It is still quite fragile, but I've been able to connect with 1024 workers and compute 1+1 in parallel. The fragility might be that Slurm doesn't like to wait. The workers still die with

Worker 618 terminated.ERROR (unhandled task failure): EOFError: read end of file
 in read at ./iobuffer.jl:86
 in read at ./stream.jl:753
 in message_handler_loop at multi.jl:803
 in process_tcp_streams at multi.jl:791
 in anonymous at task.jl:13

at some random point.

There is also still something weird sequentilly going on. The timings for the first @everywhere 1+1 are

nworkers seconds
64 3.57264
128 5.35693
256 10.8651
512 22.5759
1024 43.686

@amitmurthy
Copy link
Contributor Author

Can you try commenting out

julia/base/multi.jl

Lines 938 to 940 in 84f14d1

if remotecall_fetch(rpid, ()->:ok) != :ok
throw("ping test with remote peer failed")
end
and try. That should speed up your addprocs further and hence less waiting for Slurm. Those lines are there to guarantee that all workers have complete visibility to each other when addprocs returns on the master.

#11665 will provide user control for the same.

@andreasnoack
Copy link
Member

I can try that, but I don't think the time spent in addprocs is the main issue here. It is the first remote calls after the workers have connected that take a lot of time. I guess each worker is waiting for the previous to finish compilation before it starts compiling.

@amitmurthy
Copy link
Contributor Author

Unless slurm launches multiple workers per core, I don't see why each
worker is waiting for others to finish compiling.

Speeding up addprocs was in context of your suspicion that slurm is
shutting down "inactive" workers - the fragility that you mentioned. A
faster addprocs means your computation can start faster.

On Fri, Jun 12, 2015 at 8:35 PM, Andreas Noack [email protected]
wrote:

I can try that, but I don't think the time spent in addprocs is the main
issue here. It is the first remote calls after the workers have connected
that take a lot of time. I guess each worker is waiting for the previous to
finish compilation before it starts compiling.


Reply to this email directly or view it on GitHub
#11520 (comment).

@andreasnoack
Copy link
Member

I don't think it is Slurm related. Here are similar numbers for julia.mit.edu. There is much more noise from the constant term because the machine doesn't have thousands of cores.

The @everywhere 1+1 benchmark
nworkers seconds
2 0.394349
4 0.393942
8 0.623837
16 0.796043
32 1.384
64 2.56447

@amitmurthy
Copy link
Contributor Author

Found the problem. Should have a PR soon.

On Fri, Jun 12, 2015 at 9:31 PM, Andreas Noack [email protected]
wrote:

I don't think it is Slurm related. Here are similar numbers for
julia.mit.edu. There is much more noise from the constant term because
the machine doesn't have thousands of cores.
The @Everywhere 1+1 benchmark nworkers seconds 2 0.394349 4 0.393942
8 0.623837 16 0.796043 32 1.384 64 2.56447


Reply to this email directly or view it on GitHub
#11520 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.