Figure out what’s taking so long to do comms #126

StellaAthena · 2021-02-17T17:24:54Z

Our code is spending an absurd amount of time doing communication. Here’s a breakdown for one iteration

%comms: 85.43695925203818 
%optimizer_step 33.48256986158767 
%forward: 3.2507325623113363 
%backward: 9.970107994957203

rank=0 time (ms) | train_batch: 87510.72 | batch_input: 78.66 | forward: 2844.74 | pipe_send_output: 5377.75 | comms: 74766.41 | pipe_recv_grad: 10826.45 | backward: 8724.91 | reduce_tied_grads: 0.80 | reduce_grads: 29651.29 | step: 29300.81 | _step_clipping: 0.20 | _step_step: 29295.69 | _step_zero_grad: 2.97 | _step_check_overflow: 0.80M

so there are several stages where communication is happening - where one pipeline stage sends its output to the next (pipe_send_output), where one pipeline stage receives gradients from the previous stage (pipe_recv_grad), where gradients (reduce_grads) and tied gradients (reduce_tied_grads) are reduced across machines and where partitions are allgathered in the zero optimizer (a part of step).

In total, all these stages take up 85% of one iteration across 4 machines. We have no idea why we aren't saturating the throughput for the communication steps, but it definitely is the bottleneck

4 GPUS, 1 Node, pp=2, dp=2:
%comms: 48.58126341752161
%optimizer_step 1.4845065630029386
%forward: 12.781570852708713
%backward: 37.94681723603108

rank=0 time (ms) | train_batch: 186642.76 | batch_input: 521.31 | forward: 23855.87 | pipe_send_output: 26564.80 | comms: 90673.35 | pipe_recv_grad: 2339.02 | backward: 70824.96 | reduce_tied_grads: 0.58 | reduce_grads: 59596.19 | step: 2770.72 | _step_clipping: 0.17 | _step_step: 2766.47 | _step_zero_grad: 2.57 | _step_check_overflow: 0.61

8 GPUS, 1 Node, pp=2,dp=4:

%comms: 39.16614816474212
%optimizer_step 4.681605995231978
%forward: 15.035426649657623
%backward: 44.66480575797971

rank=0 time (ms) | train_batch: 79330.51 | batch_input: 335.96 | forward: 11927.67 | pipe_send_output: 16225.03 | comms: 31070.66 | pipe_recv_grad: 2078.04 | backward: 35432.79 | reduce_tied_grads: 0.76 | reduce_grads: 9496.98 | step: 3713.94 | _step_clipping: 0.14 | _step_step: 3709.45 | _step_zero_grad: 2.74 | _step_check_overflow: 0.67

16 GPUS, 2 Nodes, pp=2, dp=8

%comms: 52.51962050838257
%optimizer_step 7.945747803546135
%forward: 11.523001023940648
%backward: 34.636627991799266

rank=0 time (ms) | train_batch: 51027.40 | batch_input: 189.07 | forward: 5879.88 | pipe_send_output: 4394.57 | comms: 26799.35 | pipe_recv_grad: 9948.94 | backward: 17674.15 | reduce_tied_grads: 0.84 | reduce_grads: 8792.55 | step: 4054.50 | _step_clipping: 0.14 | _step_step: 4050.02 | _step_zero_grad: 2.76 | _step_check_overflow: 0.67

32 GPUS, 4 Nodes, pp=2, dp=16

%comms: 86.65624745291373
%optimizer_step 33.94892399535246
%forward: 3.0413984988055836
%backward: 9.345754953621045

rank=0 time (ms) | train_batch: 93470.05 | batch_input: 74.21 | forward: 2842.79 | pipe_send_output: 6155.78 | comms: 80997.49 | pipe_recv_grad: 10431.66 | backward: 8735.47 | reduce_tied_grads: 0.59 | reduce_grads: 33104.09 | step: 31732.03 | _step_clipping: 0.16 | _step_step: 31727.09 | _step_zero_grad: 2.96 | _step_check_overflow: 0.77

32 GPUS, 4 Nodes, pp=4, dp=8

%comms: 90.24613435844124
%optimizer_step 9.220154240256214
%forward: 2.199843158054571
%backward: 7.178205155294351

rank=0 time (ms) | train_batch: 125101.70 | batch_input: 140.49 | forward: 2752.04 | pipe_send_output: 43070.37 | comms: 112899.34 | pipe_recv_grad: 50213.59 | backward: 8980.05 | reduce_tied_grads: 0.69 | reduce_grads: 8346.74 | step: 11534.56 | _step_clipping: 0.14 | _step_step: 11531.15 | _step_zero_grad: 1.64 | _step_check_overflow: 0.66

The text was updated successfully, but these errors were encountered:

StellaAthena added bug Something isn't working help wanted This issue needs assistance labels Feb 17, 2021

StellaAthena assigned sdtblck Feb 17, 2021

StellaAthena added this to To do in 1T or BUST via automation Feb 17, 2021

StellaAthena added the experiments Experiments we wish to perform on the codebase label Feb 17, 2021

StellaAthena closed this as completed Aug 28, 2021

1T or BUST automation moved this from To do to Done Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out what’s taking so long to do comms #126

Figure out what’s taking so long to do comms #126

StellaAthena commented Feb 17, 2021

Figure out what’s taking so long to do comms #126

Figure out what’s taking so long to do comms #126

Comments

StellaAthena commented Feb 17, 2021