Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF16 optimizer sync parameters with module when load_module_only=True #56

Merged
merged 3 commits into from
Nov 26, 2023

Conversation

ZHAOTING
Copy link

This PR lets a BF16 optimizer synchronize parameters with module in when load_module_only=True.

With the previous conditional statement if self.optimizer is not None and self.fp16_enabled():, an engine with a BF16 optimizer will not execute self.optimizer.refresh_fp32_params() and thus the optimizer will maintain random parameters, and after the first update step, model parameters will be updated to these random parameters.

After changing the conditional statement to if self.optimizer is not None:, a BFf16 optimizer can correctly sync its parameters with the loaded module parameters.

Following are logs of finetuning a pre-trained model with the argument finetune=True

  • Before the change, the loss dramatically rises after an update.
     samples/sec: 5.547 | iteration        1/     200 | elapsed time per iteration (ms): 92300.5 | learning rate: 6.000E-05 | approx flops per GPU: 206.3TFLOPS | lm_loss: 1.016858E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
    after 1 iterations memory (MB) | allocated: 61117.0322265625 | max allocated: 63131.30126953125 | reserved: 67136.0 | max reserved: 78768.0
    time (ms) | forward: 24868.52 | backward: 66994.60 | backward-backward: 66992.72 | backward-allreduce: 0.00 | optimizer: 406.53 | batch generator: 65.67
     samples/sec: 5.863 | iteration        2/     200 | elapsed time per iteration (ms): 87331.7 | learning rate: 6.000E-05 | approx flops per GPU: 218.1TFLOPS | lm_loss: 1.118344E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
    time (ms) | forward: 19865.54 | backward: 66595.99 | backward-backward: 66594.18 | backward-allreduce: 0.00 | optimizer: 789.99 | batch generator: 56.22
    ...
    
  • After the change, things look fine,
    samples/sec: 5.493 | iteration        1/     200 | elapsed time per iteration (ms): 93208.1 | learning rate: 6.000E-05 | approx flops per GPU: 204.3TFLOPS | lm_loss: 1.016858E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
    after 1 iterations memory (MB) | allocated: 61117.0322265625 | max allocated: 63131.30126953125 | reserved: 72086.0 | max reserved: 78688.0
    time (ms) | forward: 23067.47 | backward: 69325.02 | backward-backward: 69323.16 | backward-allreduce: 0.00 | optimizer: 786.37 | batch generator: 72.65
    samples/sec: 5.871 | iteration        2/     200 | elapsed time per iteration (ms): 87204.2 | learning rate: 6.000E-05 | approx flops per GPU: 218.4TFLOPS | lm_loss: 1.012163E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
    time (ms) | forward: 19784.81 | backward: 66665.86 | backward-backward: 66664.03 | backward-allreduce: 0.00 | optimizer: 685.85 | batch generator: 59.20
    samples/sec: 5.865 | iteration        3/     200 | elapsed time per iteration (ms): 87297.5 | learning rate: 6.000E-05 | approx flops per GPU: 218.1TFLOPS | lm_loss: 1.025831E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
    time (ms) | forward: 19837.46 | backward: 66535.76 | backward-backward: 66533.92 | backward-allreduce: 0.00 | optimizer: 859.87 | batch generator: 57.98
    ...
    

@AIproj
Copy link

AIproj commented Nov 26, 2023

Recommend merging, it fixes bugs similar to those mentioned in this issue by implementing some of those solutions (alternatively, DeeperSpeed may need to be brought to the most recent DeepSpeed main branch).

@Quentin-Anthony
Copy link
Member

Recommend merging, it fixes bugs similar to those mentioned in this issue by implementing some of those solutions (alternatively, DeeperSpeed may need to be brought to the most recent DeepSpeed main branch).

I'm currently working on the latter

@Quentin-Anthony
Copy link
Member

Merging this in the meantime

@Quentin-Anthony Quentin-Anthony merged commit b926043 into EleutherAI:main Nov 26, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants