Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out what’s taking so long to do comms #126

Closed
StellaAthena opened this issue Feb 17, 2021 · 0 comments
Closed

Figure out what’s taking so long to do comms #126

StellaAthena opened this issue Feb 17, 2021 · 0 comments
Assignees
Labels
bug Something isn't working experiments Experiments we wish to perform on the codebase help wanted This issue needs assistance
Projects

Comments

@StellaAthena
Copy link
Member

Our code is spending an absurd amount of time doing communication. Here’s a breakdown for one iteration

%comms: 85.43695925203818 
%optimizer_step 33.48256986158767 
%forward: 3.2507325623113363 
%backward: 9.970107994957203

rank=0 time (ms) | train_batch: 87510.72 | batch_input: 78.66 | forward: 2844.74 | pipe_send_output: 5377.75 | comms: 74766.41 | pipe_recv_grad: 10826.45 | backward: 8724.91 | reduce_tied_grads: 0.80 | reduce_grads: 29651.29 | step: 29300.81 | _step_clipping: 0.20 | _step_step: 29295.69 | _step_zero_grad: 2.97 | _step_check_overflow: 0.80M 

so there are several stages where communication is happening - where one pipeline stage sends its output to the next (pipe_send_output), where one pipeline stage receives gradients from the previous stage (pipe_recv_grad), where gradients (reduce_grads) and tied gradients (reduce_tied_grads) are reduced across machines and where partitions are allgathered in the zero optimizer (a part of step).

In total, all these stages take up 85% of one iteration across 4 machines. We have no idea why we aren't saturating the throughput for the communication steps, but it definitely is the bottleneck

4 GPUS, 1 Node, pp=2, dp=2:
%comms: 48.58126341752161
%optimizer_step 1.4845065630029386
%forward: 12.781570852708713
%backward: 37.94681723603108

rank=0 time (ms) | train_batch: 186642.76 | batch_input: 521.31 | forward: 23855.87 | pipe_send_output: 26564.80 | comms: 90673.35 | pipe_recv_grad: 2339.02 | backward: 70824.96 | reduce_tied_grads: 0.58 | reduce_grads: 59596.19 | step: 2770.72 | _step_clipping: 0.17 | _step_step: 2766.47 | _step_zero_grad: 2.57 | _step_check_overflow: 0.61
8 GPUS, 1 Node, pp=2,dp=4:

%comms: 39.16614816474212
%optimizer_step 4.681605995231978
%forward: 15.035426649657623
%backward: 44.66480575797971

rank=0 time (ms) | train_batch: 79330.51 | batch_input: 335.96 | forward: 11927.67 | pipe_send_output: 16225.03 | comms: 31070.66 | pipe_recv_grad: 2078.04 | backward: 35432.79 | reduce_tied_grads: 0.76 | reduce_grads: 9496.98 | step: 3713.94 | _step_clipping: 0.14 | _step_step: 3709.45 | _step_zero_grad: 2.74 | _step_check_overflow: 0.67
16 GPUS, 2 Nodes, pp=2, dp=8

%comms: 52.51962050838257
%optimizer_step 7.945747803546135
%forward: 11.523001023940648
%backward: 34.636627991799266

rank=0 time (ms) | train_batch: 51027.40 | batch_input: 189.07 | forward: 5879.88 | pipe_send_output: 4394.57 | comms: 26799.35 | pipe_recv_grad: 9948.94 | backward: 17674.15 | reduce_tied_grads: 0.84 | reduce_grads: 8792.55 | step: 4054.50 | _step_clipping: 0.14 | _step_step: 4050.02 | _step_zero_grad: 2.76 | _step_check_overflow: 0.67
32 GPUS, 4 Nodes, pp=2, dp=16

%comms: 86.65624745291373
%optimizer_step 33.94892399535246
%forward: 3.0413984988055836
%backward: 9.345754953621045

rank=0 time (ms) | train_batch: 93470.05 | batch_input: 74.21 | forward: 2842.79 | pipe_send_output: 6155.78 | comms: 80997.49 | pipe_recv_grad: 10431.66 | backward: 8735.47 | reduce_tied_grads: 0.59 | reduce_grads: 33104.09 | step: 31732.03 | _step_clipping: 0.16 | _step_step: 31727.09 | _step_zero_grad: 2.96 | _step_check_overflow: 0.77
32 GPUS, 4 Nodes, pp=4, dp=8

%comms: 90.24613435844124
%optimizer_step 9.220154240256214
%forward: 2.199843158054571
%backward: 7.178205155294351

rank=0 time (ms) | train_batch: 125101.70 | batch_input: 140.49 | forward: 2752.04 | pipe_send_output: 43070.37 | comms: 112899.34 | pipe_recv_grad: 50213.59 | backward: 8980.05 | reduce_tied_grads: 0.69 | reduce_grads: 8346.74 | step: 11534.56 | _step_clipping: 0.14 | _step_step: 11531.15 | _step_zero_grad: 1.64 | _step_check_overflow: 0.66
@StellaAthena StellaAthena added bug Something isn't working help wanted This issue needs assistance labels Feb 17, 2021
@StellaAthena StellaAthena added this to To do in 1T or BUST via automation Feb 17, 2021
@StellaAthena StellaAthena added the experiments Experiments we wish to perform on the codebase label Feb 17, 2021
1T or BUST automation moved this from To do to Done Aug 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working experiments Experiments we wish to perform on the codebase help wanted This issue needs assistance
Projects
Development

No branches or pull requests

2 participants