[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #4944

exnx · 2024-01-12T05:17:41Z

Using GPT-NeoX (which uses DeepSpeed), we noticed that with finetune: True, DeepSpeed is somehow incorrectly loading the checkpoint. We think it has something to do with module vs. optimizer parameters not being updated (or tied) correctly, similarly described in this issue. Supposedly this was fixed with pull requests, but I think it broke again at some point.

We were able to pinpoint this bug to happen between DeepSpeed version 0.10.0 (works ok), and 0.10.2 (which has this bug). Specifically, we tested this by using the finetune flag like a resume function, where we set the learning rate to be the same as where pretraining left off. We would expect the loss to be in range as during pretraining. (With finetune=True, the only difference from resuming pretraining is that the optimizer states are not loaded, but we set the learning rate to be there same as where it left off).

What happens instead is, the very first step (step 0) is close where pretraining left off, but then the next step it jumps super high, and then tapers down (not fully recovering), but basically behaves like it is training from scratch there after. This happens in newer DeepSpeed versions.

Using the older DeepSpeed version 0.10.0 avoids this behavior, and the loss starts off and stays inline with pretraining, ie continuing to improve. When I upgrade to anything after this version (0.10.2+), I get the bug.

I am using Torch 2.0, Cuda 11.7, on Ubuntu 22.04.

Has anybody had issue with finetune settings using DeepSpeed? Thanks!

The text was updated successfully, but these errors were encountered:

exnx · 2024-01-13T20:54:24Z

Update: it's not dependent on DeepSpeed version actually. It's dependent on whether optimizer states are available or not. If I force no loading via no_load_optim, this bug appears in both new and old DeepSpeed version. So my guess is that there's something wrong with the optimizer using the correct (pretrained) parameters when no optimizer states are loaded.

loadams · 2024-01-17T22:13:40Z

Thanks for the bug report @exnx. I'll work on reproducing it on my side and getting a fix together after that.

exnx · 2024-01-17T22:37:48Z

We figured it out, it was reported on the GPT-neox issues too, and they had a fix already in review.

this line is the problem:

DeepSpeed/deepspeed/runtime/engine.py

Line 2833 in e278076

if self.optimizer is not None and self.fp16_enabled():

the fix (among other things). We just need to remove the constraint on datatype to enable self.optimizer.refresh_fp32_params() for any datatype.

https://github.com/microsoft/DeepSpeed/pull/4141/commits

loadams · 2024-01-17T23:21:47Z

Interesting, I'll take a look and we will prioritize review/merging. Thanks @exnx, lets move discussion to that PR, I'll link them so that PR will close this issue.

@Quentin-Anthony

…ia `load_module_only` (#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: #4017 Fixes: #4944 and #4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

@Quentin-Anthony

…ia `load_module_only` (microsoft#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: microsoft#4017 Fixes: microsoft#4944 and microsoft#4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

exnx added bug Something isn't working training labels Jan 12, 2024

tjruwase assigned loadams Jan 13, 2024

loadams mentioned this issue Jan 17, 2024

Fixes for training models with bf16 + freshly initialized optimizer via load_module_only #4141

Merged

loadams closed this as completed in #4141 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #4944

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #4944

exnx commented Jan 12, 2024 •

edited

Loading

exnx commented Jan 13, 2024

loadams commented Jan 17, 2024

exnx commented Jan 17, 2024 •

edited

Loading

loadams commented Jan 17, 2024

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #4944

[BUG] Setting Finetune=True causes checkpoint loading to not work correctly #4944

Comments

exnx commented Jan 12, 2024 • edited Loading

exnx commented Jan 13, 2024

loadams commented Jan 17, 2024

exnx commented Jan 17, 2024 • edited Loading

loadams commented Jan 17, 2024

exnx commented Jan 12, 2024 •

edited

Loading

exnx commented Jan 17, 2024 •

edited

Loading