Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model does not train without FP16 #322

Closed
kipgparker opened this issue May 11, 2021 · 1 comment
Closed

Model does not train without FP16 #322

kipgparker opened this issue May 11, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@kipgparker
Copy link
Contributor

Currently model does not train without FP16

if you set fp16["enabled"] = false in the "small.yml" config and run pretrain_gpt2.py it crashes with an error

  File "pretrain_gpt2.py", line 26, in <module>
    iteration = train(
  File "/home/aleph/repos/neox/gpt-neox/megatron/training.py", line 461, in train
    overflow_monitor.check(skipped_iter)  # check for repeated overflow
  File "/home/aleph/repos/neox/gpt-neox/megatron/utils.py", line 366, in check
    pretrain(neox_args=neox_args)
  File "/home/aleph/repos/neox/gpt-neox/megatron/training.py", line 96, in pretrain
    if self.optimizer.overflow and len(self.history) == self.n and all(self.history):
AttributeError: 'FusedAdam' object has no attribute 'overflow'```

Running on 2x3090's
@kipgparker kipgparker added the bug Something isn't working label May 11, 2021
@sdtblck
Copy link
Contributor

sdtblck commented May 11, 2021

@kipgparker that's intended behaviour for now - deepspeed in general does not work without fp16
My bf16 branch is almost ready to merge, though.

@sdtblck sdtblck closed this as completed Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants