Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework of cluster manager. support additional launches via a worker #9309

Merged
merged 1 commit into from
Dec 19, 2014
Merged

rework of cluster manager. support additional launches via a worker #9309

merged 1 commit into from
Dec 19, 2014

Conversation

amitmurthy
Copy link
Contributor

This PR does the following:

  • simplifies the cluster manager interface
  • starts additional workers at a host from the first worker . Will both speed up and remove the
    need for unecessary ssh connections for each worker launch.
  • has the framework in place to support custom transport mechanisms at a later stage
  • addprocs can accept an array of hosts / tuples. Each tuple of the type (host, count).
    count can be "auto" or :auto, in which case it will launch as many workers as the number
    of cores on host . machinefile too supports synatx auto * [email protected] for a machine definition.

Consequently, it supersedes #9202 and #9046

@@ -79,52 +79,28 @@ end

abstract ClusterManager

type Worker
host::ByteString
port::UInt16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host/port information are maintained by the respective cluster managers and "config" dict

@amitmurthy
Copy link
Contributor Author

cc @JeffBezanson

@amitmurthy
Copy link
Contributor Author

I would really like to go ahead and merge this. @JeffBezanson any comments?

@ViralBShah
Copy link
Member

Any idea why both travis and appveyor are failing?

@ViralBShah
Copy link
Member

It appears that the test failures are related to this PR.

@amitmurthy
Copy link
Contributor Author

Sorry about that. Fixed.

@amitmurthy
Copy link
Contributor Author

AppVeyor error seems unrelated.

@tkelman
Copy link
Contributor

tkelman commented Dec 12, 2014

Looks like a fluke, make.exe segfaulted.

end
end
w.bind_addr
w.config[:bind_addr]
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The w.bind_addr field still exists. Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, an oversight. Should be removed. Will do so.

@JeffBezanson
Copy link
Sponsor Member

Why the switch from passing ClusterManagers to passing types? That forces you to have basically a single global instance of each kind of ClusterManager; functions like manage don't have access to any state that might be stored in a ClusterManager.

@amitmurthy
Copy link
Contributor Author

manage continues to have access to the ClusterManager object (and hence common state) via config[:manager].

The change came because while implementing #9046 , the worker-to-worker connections too will be implemented by the cluster manager. And in the worker, a ClusterManager object may not be appropriate.

If we support user defined transports in the future, custom ClusterManagers will implement their own connect_m2w and connect_w2w, with the default implementations (with tcp sockets) being defined as

connect_m2w{T<:ClusterManager}(::Type{T}, pid::Int, config::Dict) and
connect_w2w{T<:ClusterManager}(::Type{T}, pid::Int, config::Dict)

@JeffBezanson
Copy link
Sponsor Member

I don't get it; it seems awkward to have this hidden coupling where T has to be the type of config[:manager]. If config has all the information in it, you could call manage(config, command) etc.

It also might make sense to use types to separate which fields are mandatory, or common to all workers, from manager-specific custom fields.

@amitmurthy
Copy link
Contributor Author

No, we need the type to dispatch to the correct implementation of manage. Each ClusterManager implements its own manage.

config does have all the information about the worker - some fields required by Base and any other worker specific key-value pairs that the ClusterManager may add.

My current thinking is to keep it this way and just document these fields in more detail. The config dict is in effect the interface between code in multi.jl for worker management and ClusterManager implementations.

For example, the launch method for SSHManager adds fields :host, :port, :bind_addr, :count to each worker's config that it launches. This information is used for the subsequent connection setup and launch of additional workers. An MPI cluster manager using MPI for transport may not add :host, :port, :bind_addr, but add a :mpi_id and use that in its own connect_m2w{T<:MPIClusterManager}(::Type{T}, ... implementation.

@JeffBezanson
Copy link
Sponsor Member

Ok, we can keep the Dict.

My thinking was that you could write

manage(config::Dict, command) = manage(config[:manager], config, command)

and then we would have normal dispatch on instances. It's very odd to have the object you're actually dispatching on hidden inside a dictionary.

@amitmurthy
Copy link
Contributor Author

OK. I'll revert it to use instances instead. It is more logical.

Also, rethinking now, your idea of replacing config with a type with Nullable fields is not a bad idea at all. One of the fields in this can be userdata::Any which can be set by managers with their own worker specific information. Let me try it that out.

@amitmurthy
Copy link
Contributor Author

Have made changes and rebased as discussed. Will merge this in a few days if no objections are raised.

One of the AppVeyor runs failed. Seems unrelated to these changes.

amitmurthy added a commit that referenced this pull request Dec 19, 2014
rework of cluster manager. support additional launches via a worker
@amitmurthy amitmurthy merged commit cc0cbfe into JuliaLang:master Dec 19, 2014
@ViralBShah ViralBShah added the domain:parallelism Parallel or distributed computation label Dec 19, 2014
@ViralBShah
Copy link
Member

The improvements deserve a mention in NEWS.md.

@JeffBezanson
Copy link
Sponsor Member

Looks good! Great to have steps in place towards using different transports.

@habemus-papadum
Copy link

Hi -- I'm interested in using the updated SSHManager with the features added by this pull request. I'm currently on 0.3.6 and from what I can tell base/managers.jl is not present in this version (correct me if I am wrong).

If i update my head node to use the current master branch, but leave my ssh nodes to be on 0.3.6, does anyone know if that is likely to work (basically I am looking for an efficient way to start many instances per remote box and take advantage of:

  • starts additional workers at a host from the first worker . Will both speed up and remove the
    need for unecessary ssh connections for each worker launch.

thanks!
(Let me know if I should move this to the issues forum...)

@amitmurthy
Copy link
Contributor Author

Unfortunately, you will need 0.4 on the ssh nodes too. Backporting this to 0.3 (which is on a bugfix/maintenance schedule) is not planned - as it involves changes to the cluster manager interface itself.

@habemus-papadum
Copy link

Thanks for the info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants