-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support multiple topologies #11665
support multiple topologies #11665
Conversation
There seem to be two different terms in play here: "interconnect" and "topology". I find "topology" to be much clearer and more obvious – can we just use that one consistently throughout? In either case, I feel like we should use just one of these two terms consistently. |
My rationale for choosing the term "interconnect" over "topology" is that I tend to associate plain "topology" with the physical topology of the cluster. While "interconnect" here is the logical topology of the interconnections. For example, all workers, each with the same physical network characteristics may opt for a master_slave only interconnect model since a) the algorithm does not require any worker-worker communication, b) startup time is faster and c) reduces unnecessary tasks and socket in the workers. But this may just be me. If the meaning of "topology" is widely used to describe both physical and logical descriptions of interconnections, I can change over to the same. |
@@ -1197,6 +1197,9 @@ export | |||
init_worker, | |||
interrupt, | |||
isready, | |||
IC_MASTER_SLAVE, | |||
IC_ALL_TO_ALL, | |||
IC_CUSTOM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth/necessary exporting three new names to have this new option? They are only used in set_interconnect
, rught? So maybe just have set_interconnect(s::Symbol)
with :master_slave
, :all_to_all
and :custom
and then have the enums unexported. I'm also wondering if it is enough to call the function interconnect
(or topology
) and avoid the underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of defeats the purpose of enums, no? We could still unexport it and require usage to be Multiprocessing.IC_MASTER_SLAVE
or Parallel.IC_MASTER_SLAVE
once #11638 is merged.
interconnect
is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used enums in Julia yet so I haven't thought much about what is the right way to use them. It just seems a bit much to export new names when they are only used in his function and at the same time Multiprocessing.IC_MASTER_SLAVE
or Parallel.IC_MASTER_SLAVE
is a bit long.
I really think that the term "topology" is less awkward than "interconnect" here. |
IMO the term topology is usually used in the logical sense. Eg. in MPI: Whereas interconnect is used for physical network layer - like Infiniband interconnect. Eg: |
That seems to qualify as a vote to call this "topology", no? |
Yes - my vote is to use topology as the term for this work. Also, I think this is really cool - which got trumped in the discussion on terminology! |
Indeed – this is great to have – and absolutely crucial for high performance distributed computing. |
I totally agree with Viral on terminology, and that this is awesome :) |
+1 for topology, and for #11665 (comment) |
Just my own two cents .... nothing in computer science is ever defined of course Interconnect ... for me is often about the wiring and perhaps the networking topology ..... is what one might call the graph theory topology could be instantiated in wires, in communications software or in algorithms Being a math geek, i like the word "topology" too |
10129fb
to
b1cb795
Compare
OK. "topology" it is. |
371cb3e
to
7853fa5
Compare
This PR has the first steps towards a complete topology support:
The arguments for merging this now instead of waiting for 0.5 are
|
I'll rebase and merge this in 2-3 days if there are no issues. |
f522f6e
to
7207a8a
Compare
Merging this once CI is green. |
support multiple topologies
Is there a reason this feature is undocumented in 0.4.2? It is quite helpful! |
The documentation is sparse, but it is there - http:https://docs.julialang.org/en/release-0.4/manual/parallel-computing/#specifying-network-topology-experimental |
Ahh, I was looking at the REPL docs for |
Yes, we should incorporate it in the stdlib docs too. |
Also in |
edit : current iteration : #11665 (comment)
Currently all workers are connected to all other workers. However, a large number of use cases require connections between the master and workers only, and the we can optimize the total number of TCP connections for such cases, especially with hundreds of workers.
This PR provides a means of supporting multiple connection topologies in the future. Currently ALL_TO_ALL and MASTER_SLAVE are supported.
Default is MASTER_SLAVE, which is a BREAKING change for programs that expect all to all connections.
With
all_to_all
,@time addprocs(8)
takes around 1.7 seconds.With
master_slave
,@time addprocs(8)
takes around 0.8 seconds.-I
or--interconnect
specifies the interconnect type to be setup byaddprocs
all_to_all
,master_slave
andcustom
. Default ismaster_slave
custom
indicates that the cluster manager is responsible for the connection topology and it will provide the list of workers that a worker must connect with. Currently unsupported.While currently we cannot send requests between nodes that are not connected to each other, we can build on this to implement routing mechanisms that will efficiently transport messages via intermediate, mutually connected nodes. Especially required for broadcast type of requests.
edit : standardized on the term "topology" in place of "interconnect".