Refactored asynchronous worker setup code. Speeds up cluster setup. #11520

amitmurthy · 2015-06-01T11:13:45Z

This PR refactors code touched by an addprocs call. It also fixes the relatively slower addition of tunneled workers in the current code base.

Local tests pass. I am yet to get timings w.r.t. expected speedups in adding workers. Have opened this PR to get any feedback at all.

@JeffBezanson , just fyi, I have used changes from #11503 in this PR.

amitmurthy · 2015-06-03T12:00:48Z

addprocs now waits till all workers have established and tested their connections with their peers. In spite of these additional checks, addprocs(8) on my local machine is marginally faster - it takes 2.5 seconds now as opposed to 2.9 seconds previously.

I added the functions required at worker startup to precompile.jl. It resulted in a reduction of addprocs(8) for this PR from 4.0 (without precompile.jl additions) to 2.5 seconds. How does one decide what functions to add to precompile.jl? Should we add all possible functions required at startup? I am not very clear on how exactly precompile.jl speeds up things....

addprocs from my local machine via ssh tunnelling to a 32-core AWS server was previously timing out since my local machine connects to the Internet via a relatively high latency 4G cellular network. With this PR it completes in approximately 30 seconds.

jakebolewski · 2015-06-03T16:00:27Z

base/multi.jl

+ send_msg_now(w, :identify_socket, myid())
+
+ # test connectivity with an echo
+ assert(:ok == remotecall_fetch(rpid, ()->:ok))


wouldn't it be better to throw do the check and throw a better error here? Otherwise it will be caught and the printed error will be

Error: AssertionError("") connecting to peer. Exiting

Yes. Will correct.

amitmurthy · 2015-06-04T10:50:30Z

I would like to merge this over the next couple of days if there are no issues. Any feedback on the additions to precompile.jl will be helpful.

timholy · 2015-06-04T11:22:54Z

The way precompile.jl speeds things up is by generating native code for functions that haven't been called yet. It's as if you've called all those functions with the specified types before saving sys.so. Consequently, when a new julia session starts up, those compiled versions are already there waiting to be used, rather than needing to be JIT-compiled.

In the past, when I've wanted to investigate which functions should be added to precompile.jl, I've just "snooped" on codegen.cpp, for example having it print out both the name and types of the function it's compiling (and occasionally, the time required to complete the compilation). I don't remember exactly what I did to print the types, but jl_(li->specTypes) would be the first thing I'd try. (If that isn't useful, the ast surely would be.) Others may have better suggestions for how you go about picking which functions to add.

Sounds like a nice optimization. I've noticed starting workers slows things down a lot, looking forward to see how this feels.

timholy · 2015-06-04T11:25:23Z

base/precompile.jl

+precompile(Base.connect, (Base.ClusterManager, Int, Base.WorkerConfig))
+precompile(Base.connect_w2w, (Int, Base.WorkerConfig))
+precompile(Base.connect_to_worker, (AbstractString, Int))
+


I certainly don't see any problem with these.

amitmurthy · 2015-06-04T11:34:02Z

@timholy thanks for the tips.

yuyichao · 2015-06-04T12:01:48Z

The way I did it for #11359 was adding jl_(li) in to_function when nested_compile is false.

tkelman · 2015-06-04T18:17:32Z

base/multi.jl

@@ -1,47 +1,5 @@
 # This file is a part of Julia. License is MIT: https://julialang.org/license

-## multi.jl - multiprocessing


is this block of comments no longer accurate or something?

Historical API documentation. Has been added to the doc files long since.

Also, minor differences with current API.

okay as long as there's a better version somewhere else and this isn't the only place this is listed, seems okay to delete then

amitmurthy · 2015-06-09T05:09:32Z

After adding appropriate statements in precompile.jl (based on time taken by emit_function), addprocs(8) takes around 1 second and addprocs(4) around 700 ms.

@vtjnash , I have added start_worker toprecompile.jl. However it still shows up in my timing logs, i.e. emit_function is called. If I run julia --worker from a bash prompt, or run(julia --worker) from a REPL it is not printed, which leads me to believe that the correct version has actually been precompiled. However open(julia --worker, "r") does print it. Why does open not pick up the precompiled version of start_worker ?

@JeffBezanson The other change is addition of a boolean keyword arg strict to addprocs. strict=true would mean that addprocs will wait till all the workers have setup and tested their connections with each other. Default is false, i.e., addprocs will return once the master is connected to the workers, while the all-to-all connection happens in the background among the workers.

I am seriously considering changing the above option to connection=:master_slave and connection=:all_to_all. The former will result in connections being setup only from the master process to all workers, while the latter is the current all-to-all setup. Most usage of Julia parallel that I have seen only requires master-slave connections, worker-to-worker connections are not used at all. It will be a nice optimization especially when we have to deal to hundreds of workers. Default will be connection=:master_slave, and that will be a breaking change for code that expects a mesh network.

JeffBezanson · 2015-06-09T05:14:44Z

There should be a path here towards supporting other topologies, like hypercube.

amitmurthy · 2015-06-09T05:42:13Z

The other change that helped speed up addprocs is replacing begin .... end blocks with functions that can be precompiled.

@JeffBezanson, the master will still connect to all workers. Other topologies can be supported for worker-worker connections.

amitmurthy · 2015-06-09T11:01:03Z

I have removed changes related to network setup completion from this PR. Will submit another PR to support different network topologies.

Ready to merge if there are no more concerns.

tkelman · 2015-06-09T20:24:36Z

base/lock.jl

+type Semaphore
+ cnt::Int
+ n_acq::Int
+ c::Condition


these field names could stand to be clearer

thanks, the new version is definitely better

amitmurthy · 2015-06-10T04:25:43Z

Travis errors are unrelated to this PR and also not the usual OOM errors. Are they known ones? Have also noticed that they show up in other PRs in the last 48 hours.

exception on 2: ERROR: LoadError: LoadError: test error in expression: float(A) == float(full(A))
StackOverflowError:
 in convert at dates/periods.jl:22 (repeats 79996 times)
 in convert at dates/periods.jl:22 (repeats 79996 times)
while loading /tmp/julia/share/julia/test/sparsedir/sparse.jl, in expression starting on line 878
while loading /tmp/julia/share/julia/test/sparse.jl, in expression starting on line 3
    From worker 4:       * sorting              in  31.31 seconds
    From worker 3:       * bitarray             in  92.75 seconds
    From worker 5:       * subarray             in 181.53 seconds
ERROR: LoadError: LoadError: LoadError: test error in expression: float(A) == float(full(A))
StackOverflowError:
 in anonymous at task.jl:1378
while loading /tmp/julia/share/julia/test/sparsedir/sparse.jl, in expression starting on line 878
while loading /tmp/julia/share/julia/test/sparse.jl, in expression starting on line 3
while loading /tmp/julia/share/julia/test/runtests.jl, in expression starting on line 5

    From worker 2:       * sparse

and

exception on 2: ERROR: LoadError: TypeError: _methods: in typeassert, expected Array{Any,1}, got Array{Any,1}
 in _methods at reflection.jl:143
 in _methods at reflection.jl:134
 in methods at reflection.jl:128
 in code_lowered at reflection.jl:127
 in include at ./boot.jl:253
 in runtests at /tmp/julia/share/julia/test/testdefs.jl:197
 in anonymous at multi.jl:822
 in run_work_thunk at multi.jl:575
 in anonymous at task.jl:822
while loading /tmp/julia/share/julia/test/meta.jl, in expression starting on line 59
WARNING: Module MetaTest not defined on process 1
Worker 2 terminated.

Refactored asynchronous worker setup code. Speeds up cluster setup.

carnaval · 2015-06-11T13:19:24Z

@andreasnoack did you try this to see if it helps with the dreaded exascale 1+1 ?

andreasnoack · 2015-06-11T13:28:14Z

No, but I'll do that now.

andreasnoack · 2015-06-11T16:28:52Z

It is still quite fragile, but I've been able to connect with 1024 workers and compute 1+1 in parallel. The fragility might be that Slurm doesn't like to wait. The workers still die with

Worker 618 terminated.ERROR (unhandled task failure): EOFError: read end of file
 in read at ./iobuffer.jl:86
 in read at ./stream.jl:753
 in message_handler_loop at multi.jl:803
 in process_tcp_streams at multi.jl:791
 in anonymous at task.jl:13

at some random point.

There is also still something weird sequentilly going on. The timings for the first @everywhere 1+1 are

nworkers	seconds
64	3.57264
128	5.35693
256	10.8651
512	22.5759
1024	43.686

amitmurthy · 2015-06-12T01:38:30Z

Can you try commenting out

julia/base/multi.jl

Lines 938 to 940 in 84f14d1

 if remotecall_fetch(rpid, ()->:ok) != :ok 

 throw("ping test with remote peer failed") 

 end

and try. That should speed up your addprocs further and hence less waiting for Slurm. Those lines are there to guarantee that all workers have complete visibility to each other when addprocs returns on the master.

#11665 will provide user control for the same.

andreasnoack · 2015-06-12T15:05:49Z

I can try that, but I don't think the time spent in addprocs is the main issue here. It is the first remote calls after the workers have connected that take a lot of time. I guess each worker is waiting for the previous to finish compilation before it starts compiling.

amitmurthy · 2015-06-12T15:12:29Z

Unless slurm launches multiple workers per core, I don't see why each
worker is waiting for others to finish compiling.

Speeding up addprocs was in context of your suspicion that slurm is
shutting down "inactive" workers - the fragility that you mentioned. A
faster addprocs means your computation can start faster.

On Fri, Jun 12, 2015 at 8:35 PM, Andreas Noack [email protected]
wrote:

I can try that, but I don't think the time spent in addprocs is the main
issue here. It is the first remote calls after the workers have connected
that take a lot of time. I guess each worker is waiting for the previous to
finish compilation before it starts compiling.

—
Reply to this email directly or view it on GitHub
#11520 (comment).

andreasnoack · 2015-06-12T16:01:32Z

I don't think it is Slurm related. Here are similar numbers for julia.mit.edu. There is much more noise from the constant term because the machine doesn't have thousands of cores.

The `@everywhere 1+1` benchmark

nworkers	seconds
2	0.394349
4	0.393942
8	0.623837
16	0.796043
32	1.384
64	2.56447

amitmurthy · 2015-06-12T17:08:08Z

Found the problem. Should have a PR soon.

On Fri, Jun 12, 2015 at 9:31 PM, Andreas Noack [email protected]
wrote:

I don't think it is Slurm related. Here are similar numbers for
julia.mit.edu. There is much more noise from the constant term because
the machine doesn't have thousands of cores.
The @Everywhere 1+1 benchmark nworkers seconds 2 0.394349 4 0.393942
8 0.623837 16 0.796043 32 1.384 64 2.56447

—
Reply to this email directly or view it on GitHub
#11520 (comment).

amitmurthy mentioned this pull request Jun 3, 2015

Issues starting many workers #9591

Closed

amitmurthy force-pushed the amitm/multi_setup branch from 8fb9b2d to f2c200d Compare June 3, 2015 11:49

amitmurthy changed the title ~~WIP : Refactored asynchronous worker setup code~~ Refactored asynchronous worker setup code. Speeds up cluster setup. Jun 3, 2015

amitmurthy force-pushed the amitm/multi_setup branch from f2c200d to 5c68c30 Compare June 3, 2015 12:56

jakebolewski reviewed Jun 3, 2015
View reviewed changes

amitmurthy force-pushed the amitm/multi_setup branch from 5c68c30 to a1ad093 Compare June 4, 2015 04:37

ViralBShah added the parallelism Parallel or distributed computation label Jun 4, 2015

timholy reviewed Jun 4, 2015
View reviewed changes

tkelman reviewed Jun 4, 2015
View reviewed changes

amitmurthy mentioned this pull request Jun 5, 2015

precompile abspath, normpath for UTF8String for a startup speedup #11589

Merged

amitmurthy force-pushed the amitm/multi_setup branch from a1ad093 to 8f8a388 Compare June 9, 2015 04:46

amitmurthy force-pushed the amitm/multi_setup branch from 8f8a388 to 34d7892 Compare June 9, 2015 10:58

tkelman reviewed Jun 9, 2015
View reviewed changes

refactor asynchronous addition of workers

e554039

amitmurthy force-pushed the amitm/multi_setup branch from 34d7892 to e554039 Compare June 10, 2015 03:31

amitmurthy added a commit that referenced this pull request Jun 10, 2015

Merge pull request #11520 from JuliaLang/amitm/multi_setup

06bd8a8

Refactored asynchronous worker setup code. Speeds up cluster setup.

amitmurthy merged commit 06bd8a8 into master Jun 10, 2015

tkelman deleted the amitm/multi_setup branch June 10, 2015 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored asynchronous worker setup code. Speeds up cluster setup. #11520

Refactored asynchronous worker setup code. Speeds up cluster setup. #11520

amitmurthy commented Jun 1, 2015

amitmurthy commented Jun 3, 2015

jakebolewski Jun 3, 2015

amitmurthy Jun 4, 2015

amitmurthy commented Jun 4, 2015

timholy commented Jun 4, 2015

timholy Jun 4, 2015

amitmurthy commented Jun 4, 2015

yuyichao commented Jun 4, 2015

tkelman Jun 4, 2015

amitmurthy Jun 5, 2015

amitmurthy Jun 5, 2015

tkelman Jun 5, 2015

amitmurthy commented Jun 9, 2015

JeffBezanson commented Jun 9, 2015

amitmurthy commented Jun 9, 2015

amitmurthy commented Jun 9, 2015

tkelman Jun 9, 2015

amitmurthy Jun 10, 2015

tkelman Jun 10, 2015

amitmurthy commented Jun 10, 2015

carnaval commented Jun 11, 2015

andreasnoack commented Jun 11, 2015

andreasnoack commented Jun 11, 2015

amitmurthy commented Jun 12, 2015

andreasnoack commented Jun 12, 2015

amitmurthy commented Jun 12, 2015

andreasnoack commented Jun 12, 2015

amitmurthy commented Jun 12, 2015

		@@ -1,47 +1,5 @@
		# This file is a part of Julia. License is MIT: https://julialang.org/license

		## multi.jl - multiprocessing

Refactored asynchronous worker setup code. Speeds up cluster setup. #11520

Refactored asynchronous worker setup code. Speeds up cluster setup. #11520

Conversation

amitmurthy commented Jun 1, 2015

amitmurthy commented Jun 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitmurthy commented Jun 4, 2015

timholy commented Jun 4, 2015

Choose a reason for hiding this comment

amitmurthy commented Jun 4, 2015

yuyichao commented Jun 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitmurthy commented Jun 9, 2015

JeffBezanson commented Jun 9, 2015

amitmurthy commented Jun 9, 2015

amitmurthy commented Jun 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitmurthy commented Jun 10, 2015

carnaval commented Jun 11, 2015

andreasnoack commented Jun 11, 2015

andreasnoack commented Jun 11, 2015

amitmurthy commented Jun 12, 2015

andreasnoack commented Jun 12, 2015

amitmurthy commented Jun 12, 2015

andreasnoack commented Jun 12, 2015

The @everywhere 1+1 benchmark

amitmurthy commented Jun 12, 2015

The `@everywhere 1+1` benchmark