Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup worker-worker connections lazily #42

Open
amitmurthy opened this issue May 20, 2017 · 7 comments
Open

Setup worker-worker connections lazily #42

amitmurthy opened this issue May 20, 2017 · 7 comments

Comments

@amitmurthy
Copy link
Contributor

The default all_to_all topology connects all processes to each other. While this is fine for small clusters, the total number of TCP connections increases rapidly as (N^2)/2.

Considering that a large class of parallel problems only need master-worker connections we should change the default topology to all_to_all_lazy where worker-worker connections are setup only on the first request from a worker to another worker. And also introduce another topology master_routed which only connects master to workers, and in case of a worker-worker call, routes the request through the master.

To summarize, implement 2 new topologies:

  1. all_to_all_lazy where worker-worker connections are setup lazily, and is the default for addprocs and

  2. master_routed in which only the master connects to workers and worker-worker messages are routed via the master.

@ViralBShah
Copy link
Member

This would solve major connection time issues on large clusters that we have repeatedly seen.

@andreasnoack
Copy link
Member

Just wanted mention that it also seemed that JuliaLang/julia#22588 made adding remote workers noticeably faster.

@amitmurthy
Copy link
Contributor Author

I wonder how and why JuliaLang/julia#22588 affected worker startup time. @vtjnash ?

@amitmurthy
Copy link
Contributor Author

@andreasnoack / @ViralBShah care to comment on the interface for lazy connection setup in JuliaLang/julia#22814?

@andreasnoack
Copy link
Member

Sorry for the noise here. Just did some more systematic timings and my previous impression must have been based on differences in the connection.

@StefanKarpinski
Copy link
Sponsor Member

Bump – are we still planning on doing this?

@bisraelsen
Copy link

bump

@vtjnash vtjnash transferred this issue from JuliaLang/julia Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants