You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
so there are several stages where communication is happening - where one pipeline stage sends its output to the next (pipe_send_output), where one pipeline stage receives gradients from the previous stage (pipe_recv_grad), where gradients (reduce_grads) and tied gradients (reduce_tied_grads) are reduced across machines and where partitions are allgathered in the zero optimizer (a part of step).
In total, all these stages take up 85% of one iteration across 4 machines. We have no idea why we aren't saturating the throughput for the communication steps, but it definitely is the bottleneck
Our code is spending an absurd amount of time doing communication. Here’s a breakdown for one iteration
so there are several stages where communication is happening - where one pipeline stage sends its output to the next (pipe_send_output), where one pipeline stage receives gradients from the previous stage (pipe_recv_grad), where gradients (reduce_grads) and tied gradients (reduce_tied_grads) are reduced across machines and where partitions are allgathered in the zero optimizer (a part of step).
In total, all these stages take up 85% of one iteration across 4 machines. We have no idea why we aren't saturating the throughput for the communication steps, but it definitely is the bottleneck
The text was updated successfully, but these errors were encountered: