-
Notifications
You must be signed in to change notification settings - Fork 977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running with bf16 error #939
Comments
Also, if running with zero stage 1, there is a NotImplementedError:
These deepspeed code snippets are different from latest official deepspeed. |
Updata
|
This PR changed how we process bfloat16 configuration #787, but it doesn't look like the demo file was updated. Apologies for the oversight. Can you try deleting the |
Yeah, my apologies @Life-0-1 -- putting it in the |
@Life-0-1 does this fix solve your issue? |
Yes, this enables bfloat16 training with a more clear config. |
@Life-0-1 what error do you get when you use |
This error occurs with
Below is the relevant deepspeed code, and the logic behind the code is bf16 only works with zero optimization stage 0.
After setting zero stage to 0, below error occurs:
|
encountered the same problem. |
Describe the bug
I tested bf16 training with various configurations, including the one given at
configs/bf16_125M.yml
, but all of them failed.I list the configurations and corresponding errors bellow.
Code is running on 8 A100 GPUS and the latest gpt-neox repo is used.
Config 1, which is the same as
configs/bf16_125M.yml
error
Config 2
errors
The text was updated successfully, but these errors were encountered: