Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user defined transports #9434

Merged
merged 1 commit into from
Jan 22, 2015
Merged

user defined transports #9434

merged 1 commit into from
Jan 22, 2015

Conversation

amitmurthy
Copy link
Contributor

This builds on #9309 and supports user definable transports.
It supersedes #9046 - the discussion there outlines much of the motivations for this PR.

To implement this, custom cluster managers would need to provide their own
connect(manager::FooManager, pid::Integer, config::WorkerConfig) method

connect should return a pair of AsyncStream objects, one for reading data sent from worker pid, and the other to write data that needs to be sent to worker pid. custom cluster managers can use an in-memory BufferStream as the plumbing to ferry data between the custom, non AsyncStream transport and the Julia parallel infrastructure.

A BufferStream wraps a PipeBuffer and condition variables to make a waitable stream.

examples/clustermanager/0mq is an example of how they are used to setup a star network with a 0MQ broker in the middle.

connect is optional, and the default implementation is based on TCP as a transport mechanism.

Another optional method is kill(manager::FooManager, pid::Int, config::WorkerConfig), which is called to remove a worker from the cluster.

Two example implementations are provided:

  • examples/clustermanager/simple shows the use of unix domain sockets as transport
  • examples/clustermanager/0mq shows the use of 0MQ as transport

One thought is to move the examples to package ClusterManagers.jl instead of adding it here. Have kept it in this PR so that folks can have a look. Will move it based on feedback.

Documentation still needs to updated.

@amitmurthy amitmurthy added the domain:parallelism Parallel or distributed computation label Dec 21, 2014
@amitmurthy
Copy link
Contributor Author

Documentation has been added. In the absence of any concerns I'll merge this over the next day or two.

cc: @JeffBezanson

@tkelman
Copy link
Contributor

tkelman commented Jan 19, 2015

Do the examples here get run by test/examples.jl?

Answering my own question, they don't. Could they?

@ViralBShah
Copy link
Member

Wouldn't that introduce new dependencies on Base, such as zeromq? Perhaps the test can be conditional on finding zeromq.

@tkelman
Copy link
Contributor

tkelman commented Jan 19, 2015

Good point. Would the simple one be quick and painless to test, or does it require some infrastructure?

@amitmurthy
Copy link
Contributor Author

Good idea, I'll add the simple example (unix domain sockets) to test. And maybe the 0mq one with a dummy module at least to ensure that the code loads.

@tkelman
Copy link
Contributor

tkelman commented Jan 19, 2015

should probably be @unix_only

@tkelman
Copy link
Contributor

tkelman commented Jan 19, 2015

And would that test be possible in environments with limited/no internet access like distribution buildbots? As long it's not more demanding in resources than our current socket or parallel tests it should be fine.

@amitmurthy
Copy link
Contributor Author

As long as networking is enabled, we can test by starting just 2 additional workers - so that would be a total of 3 julia processes.

@amitmurthy
Copy link
Contributor Author

@tkelman - could you take a look now? I had to add a line to the Makefile to copy these examples to the build doc dir. The simple manager is run @unix_only out of process since mixing multiple cluster managers in the same setup is not fully supported (workers started via different cluster managers do not know how to connect to each other). As for 0mq, only code loading is tested.

@tkelman
Copy link
Contributor

tkelman commented Jan 21, 2015

LGTM. Not sure whether this could cause any issues on distribution buildbots, but there's an easy way to find out.

I might've done the temporary-file test using a with-do block, but we're not too rigorous about cleaning up after all the other tests if there happens to be a failure.

# code loads using a stub module definition for ZMQ.
zmq_found=true
try
using zmq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, this should probably be capitalized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch.

@amitmurthy amitmurthy changed the title RFC: user defined transports user defined transports Jan 22, 2015
amitmurthy added a commit that referenced this pull request Jan 22, 2015
@amitmurthy amitmurthy merged commit 7fbea9d into JuliaLang:master Jan 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants