Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: detach() and attach() from/to worker processes #3428

Closed
wants to merge 1 commit into from
Closed

RFC: detach() and attach() from/to worker processes #3428

wants to merge 1 commit into from

Conversation

amitmurthy
Copy link
Contributor

It provides a means for the client REPL to be detached from the set of worker processes and reattached later. Typical uses cases:

  • Long running computation in julia - spanning hours/days. The client terminal can be safely detached after starting the computation
  • Similar to above, when the workers are in a cloud environment but the client console is the users local machine/laptop.

Currently implemented are detach and attach. Both can only be executed on the client (id =1) process.

detach(connection_file::String) safely removes the client from the process group and writes complete connection information to the connection_file.

attach(connection_file::String) uses the information as garnered during detach and reconnects to the cluster.

Only one client process can be connected to the cluster at any time.

TO BE DONE

Since the above results in the client (id = 1) being detached, it leads to certain issues with the parallel computing infrastructure we currently have - for example, @parallel and pmap. In typical usage the client process (id =1) will be the controller processes for the entire distributed job execution and currently is meant to be interactive - in the sense the terminal (REPL) is expected to be kept open.

Since the client process can now be detached and closed, we should provide an alternative mechanism for the user to push computation work to the "background", query its status and retrieve results independent of the client process.

We can have the following new macros/functions:

@bg_exec(key::String, code_block) - runs the block of code, code_block, in the background, i.e., on the worker with the lowest process id, (lowest, non pid=1 process). The result of the block of code will be stored in a Dict on the said worker with the specified key

bg_clear() - clears the dict on the worker where background jobs are controlled from.

bg_take(key), bg_fetch(key) - takes and fetches the responses from the dict

bg_put(key) - application code can add its own information to this Dict, for example progress information on long running computations that can be queried periodically.

NOTE:

  • The above scheme puts the process with lowest pid (other than the client) in a special role of keeping some state of the long running computational tasks.
  • given the co-operative multitasking nature of Julia, application code must play nice and periodically call yield() in order for the client process to be able to attach() and detach() at will.
  • Due to the seg fault mentioned in PR # 3394, extensive testing of detach/attach has not been possible yet.

Any issues/ suggestions / different schemes for having a clean and consistent implementation of client detach/attach is welcome.

@amitmurthy
Copy link
Contributor Author

  • While I don't particularly like the idea of detach and attach taking a filename as a parameter, it is required in the current context where we do not have a port mapper infrastructure, and the client is the only one with complete cluster connection details - including whether an ssh tunnel is required or not.
  • How about renaming attach to reattach?
  • I am also not very sure about on the entire bg_* set of functions. Maybe, for now, we should let the user be aware of situations where the console may be detached, and hence explicitly execute the long running compute control script on one of the worker processes?

Your views @JeffBezanson @StefanKarpinski @ViralBShah ?

@ViralBShah
Copy link
Member

I like this general idea, but it would be nice to hear what others have to say. Also, this is one of those things where having a prototype of it working would give a good feel on developing it further. @timholy I believe you do some fairly large computations with julia. Would love to hear your views on the topic.

@JeffBezanson
Copy link
Member

Cool functionality. I think it should be handled strictly by the REPL though, instead of by killing one of the processes.

@amitmurthy amitmurthy closed this Jul 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants