Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in bfloat16 optimizer related to checkpointing #4434

Merged
merged 10 commits into from
Oct 7, 2023

Conversation

okoge-kaz
Copy link
Contributor

@okoge-kaz okoge-kaz commented Oct 1, 2023

Overview

Fixed conditional branching where errors occurred.

-        elif self.bfloat16_enabled() and not self.zero_optimization():
+        elif self.bfloat16_enabled() and hasattr(self.optimizer, "bf16_groups"):

Error Detail

I encountered the following error while training with zero 1 optimization and bf16 using Megatron-DeepSpeed.

.env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3288, in _get_zero_param_shapes
    ) == 2 else self.optimizer.fp16_groups
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
    ) == 2 else self.optimizer.fp16_groups
    ) == 2 else self.optimizer.fp16_groups
AttributeError: 'BF16_Optimizer' object has no attribute 'fp16_groups'
    ) == 2 else self.optimizer.fp16_groups
AttributeError: 'BF16_Optimizer' object has no attribute 'fp16_groups'. Did you mean: 'bf16_groups'?
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'BF16_Optimizer' object has no attr

and also when loading the checkpoint, I got another error.

.env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2873, in _load_zero_checkpoint
    self.optimizer.load_state_dict(state_dict_list=zero_sd_list,
TypeError: BF16_Optimizer.load_state_dict() got an unexpected keyword argument 'load_serial'

Related Pull Request

#3759

Related Issue

#4272

@okoge-kaz okoge-kaz changed the title Fix a bug in bfloat16 optimizer when saving checkpoints. Fix bug in bfloat16 optimizer when saving checkpoints. Oct 1, 2023
@okoge-kaz okoge-kaz changed the title Fix bug in bfloat16 optimizer when saving checkpoints. Fix bug in bfloat16 optimizer related checkpointing Oct 1, 2023
@okoge-kaz okoge-kaz changed the title Fix bug in bfloat16 optimizer related checkpointing Fix bug in bfloat16 optimizer related to checkpointing Oct 1, 2023
@tjruwase tjruwase added this pull request to the merge queue Oct 4, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 4, 2023
@okoge-kaz
Copy link
Contributor Author

okoge-kaz commented Oct 5, 2023

@tjruwase

Thanks for your review.
I have also confirmed that checkpoints can be saved & loaded with bf16 in Megatron-DeepSpeed!

(A100 40GB x 16(2node))

image

@tjruwase
Copy link
Contributor

tjruwase commented Oct 5, 2023

@okoge-kaz, thanks for sharing your results.

@tjruwase
Copy link
Contributor

tjruwase commented Oct 5, 2023

Fix #4272

@okoge-kaz
Copy link
Contributor Author

@tjruwase All test passed!
May I ask for a merge?

@tjruwase tjruwase added this pull request to the merge queue Oct 7, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2023
@tjruwase tjruwase added this pull request to the merge queue Oct 7, 2023
Merged via the queue into microsoft:master with commit 7ed952e Oct 7, 2023
15 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Oct 9, 2023
* fix: bf16 optimizer if condition

* fix: unexpected keyword argument 'load_serial'

* fix: add load_serial arg to bf16_optimizer

* style: fix indentation

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
baodii pushed a commit to baodii/DeepSpeed that referenced this pull request Nov 7, 2023
* fix: bf16 optimizer if condition

* fix: unexpected keyword argument 'load_serial'

* fix: add load_serial arg to bf16_optimizer

* style: fix indentation

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants