BF16 optimizer sync parameters with module when load_module_only=True #56

ZHAOTING · 2023-09-11T07:30:37Z

This PR lets a BF16 optimizer synchronize parameters with module in when load_module_only=True.

With the previous conditional statement if self.optimizer is not None and self.fp16_enabled():, an engine with a BF16 optimizer will not execute self.optimizer.refresh_fp32_params() and thus the optimizer will maintain random parameters, and after the first update step, model parameters will be updated to these random parameters.

After changing the conditional statement to if self.optimizer is not None:, a BFf16 optimizer can correctly sync its parameters with the loaded module parameters.

Following are logs of finetuning a pre-trained model with the argument finetune=True

Before the change, the loss dramatically rises after an update.

 samples/sec: 5.547 | iteration        1/     200 | elapsed time per iteration (ms): 92300.5 | learning rate: 6.000E-05 | approx flops per GPU: 206.3TFLOPS | lm_loss: 1.016858E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 61117.0322265625 | max allocated: 63131.30126953125 | reserved: 67136.0 | max reserved: 78768.0
time (ms) | forward: 24868.52 | backward: 66994.60 | backward-backward: 66992.72 | backward-allreduce: 0.00 | optimizer: 406.53 | batch generator: 65.67
 samples/sec: 5.863 | iteration        2/     200 | elapsed time per iteration (ms): 87331.7 | learning rate: 6.000E-05 | approx flops per GPU: 218.1TFLOPS | lm_loss: 1.118344E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 19865.54 | backward: 66595.99 | backward-backward: 66594.18 | backward-allreduce: 0.00 | optimizer: 789.99 | batch generator: 56.22
...

After the change, things look fine,

samples/sec: 5.493 | iteration        1/     200 | elapsed time per iteration (ms): 93208.1 | learning rate: 6.000E-05 | approx flops per GPU: 204.3TFLOPS | lm_loss: 1.016858E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 61117.0322265625 | max allocated: 63131.30126953125 | reserved: 72086.0 | max reserved: 78688.0
time (ms) | forward: 23067.47 | backward: 69325.02 | backward-backward: 69323.16 | backward-allreduce: 0.00 | optimizer: 786.37 | batch generator: 72.65
samples/sec: 5.871 | iteration        2/     200 | elapsed time per iteration (ms): 87204.2 | learning rate: 6.000E-05 | approx flops per GPU: 218.4TFLOPS | lm_loss: 1.012163E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 19784.81 | backward: 66665.86 | backward-backward: 66664.03 | backward-allreduce: 0.00 | optimizer: 685.85 | batch generator: 59.20
samples/sec: 5.865 | iteration        3/     200 | elapsed time per iteration (ms): 87297.5 | learning rate: 6.000E-05 | approx flops per GPU: 218.1TFLOPS | lm_loss: 1.025831E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 19837.46 | backward: 66535.76 | backward-backward: 66533.92 | backward-allreduce: 0.00 | optimizer: 859.87 | batch generator: 57.98
...

…only mode.

AIproj · 2023-11-26T00:55:50Z

Recommend merging, it fixes bugs similar to those mentioned in this issue by implementing some of those solutions (alternatively, DeeperSpeed may need to be brought to the most recent DeepSpeed main branch).

Quentin-Anthony · 2023-11-26T01:14:11Z

Recommend merging, it fixes bugs similar to those mentioned in this issue by implementing some of those solutions (alternatively, DeeperSpeed may need to be brought to the most recent DeepSpeed main branch).

I'm currently working on the latter

Quentin-Anthony · 2023-11-26T02:49:36Z

Merging this in the meantime

ZHAOTING added 3 commits September 11, 2023 16:18

Let bf16 optimizer synchronize parameters with module in load_module_…

014780f

…only mode.

Fix BF16 parameter groups

292e906

enable loading lr scheduler even when loading module only

e220f9e

AIproj mentioned this pull request Nov 26, 2023

Support Mistral Models EleutherAI/gpt-neox#1050

Closed

Quentin-Anthony merged commit b926043 into EleutherAI:main Nov 26, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BF16 optimizer sync parameters with module when load_module_only=True #56

BF16 optimizer sync parameters with module when load_module_only=True #56

ZHAOTING commented Sep 11, 2023

AIproj commented Nov 26, 2023

Quentin-Anthony commented Nov 26, 2023

Quentin-Anthony commented Nov 26, 2023

BF16 optimizer sync parameters with module when load_module_only=True #56

BF16 optimizer sync parameters with module when load_module_only=True #56

Conversation

ZHAOTING commented Sep 11, 2023

AIproj commented Nov 26, 2023

Quentin-Anthony commented Nov 26, 2023

Quentin-Anthony commented Nov 26, 2023