Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeepSpeed bf16 configuration #787

Merged
merged 30 commits into from
May 16, 2023
Merged

Add DeepSpeed bf16 configuration #787

merged 30 commits into from
May 16, 2023

Conversation

dashstander
Copy link
Contributor

@dashstander dashstander commented Feb 14, 2023

For main DeepSpeed, setting the precision configuration parameter is insufficient to enable bf16 training. DeepSpeed uses a special bf16 dict configuration to enable using bf16, as documented here. This adds that configuration to NeoX and main DeepSpeed does the rest of the work.

Signed-off-by: Dashiell Stander <[email protected]>
@dashstander dashstander requested a review from a team as a code owner February 14, 2023 22:30
@dashstander dashstander requested review from Quentin-Anthony and StellaAthena and removed request for a team February 14, 2023 22:30
Base automatically changed from deepspeed_main to main March 9, 2023 16:55
@StellaAthena
Copy link
Member

StellaAthena commented Apr 3, 2023

Since it seems like this is a dict with a single value, instead of making the user specify it wouldn’t it make more sense to have the code set the configs that DeepSpeed expects automatically when the precision is set to bf16?

@CLAassistant
Copy link

CLAassistant commented Apr 23, 2023

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Member

I think at it’s core my question is whether this is actually more intuitive than:

precision = 'bfloat16'
precision_args = {whatever you need goes here}

@dashstander
Copy link
Contributor Author

As things currently work we can't just have a precision_args argument that gets turned into a bf16 or fp16 dict for DeepSpeed because every argument that gets passed to DeepSpeed needs to exist in the megatron/neox_args/deepspeed_args.py file.

The changes I've made (and didn't push all of last night, whoops) make it so that a user only has to set the precision argument and the bare-bones of everything else gets set. It also raises an error if those dictionaries are set but the precision argument disagrees.

I'm open to other approaches though.

@dashstander
Copy link
Contributor Author

I still need to test this to confirm there aren't any stupid bugs, but otherwise @StellaAthena and @Quentin-Anthony I'd appreciate if you took a look at this when you have a moment.

If people still don't like this--and honestly I don't love it--then I think we should open another issue where we figure out a way for people pass stuff straight into DeepSpeed without getting checked by NeoX or needing to have it be checked into our code. I know we've talked about it before. In that scenario we'd be able to just handle the default cases when people set precision and for advanced users who want to mess with the fp16 args dict they can do so.

@@ -30,6 +30,7 @@ def _download_file(*args, **kwargs):
import sys
import dataclasses
from functools import partial
from pathlib import Path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this change for?


Default = None

Configuration for using bfloat16 floating-point format as an alternative to FP16. BFLOAT16 requires hardware support (e.g., NVIDIA A100).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd we remove these details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're tied to the specific bf16 config that you wanted removed. I can move them to the section on the precision config?

@Quentin-Anthony Quentin-Anthony merged commit 056e9ca into main May 16, 2023
@Quentin-Anthony Quentin-Anthony deleted the bfloat16 branch May 16, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants