Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning GPT-NeoX doesn't work (for many scenarios) with the 16-bit stage-0 optimizer #568

Open
igor0 opened this issue Feb 20, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@igor0
Copy link

igor0 commented Feb 20, 2022

TL,DR: The non-ZeRO ("stage 0") optimizer in DeepSpeed makes fragile assumptions about the optimizer state in the checkpoint, even when finetune: true configuration parameter is set. A mitigating factor is that the 20B model is too big to use with the stage-0 optimizer, at least if you are fine-tuning the entire model.

Detailed Explanation:
The issue is with the stage-0 optimizer (i.e., with ZeRO disabled). A mitigating factor is that the 20B model can't be fine-tuned with a stage-0 optimizer because there is no GPU in the world where it will fit.

You will hit these problems if you either:

  • Try fine-tuning a smaller GPT-NeoX model with stage-0 optimizer
  • Try fine-tuning with a small set of unfrozen parameters on the 20B model (like soft prompt tuning or LoRA). You may have to modify the gpt-neox code to do that, though, so arguably you are outside of the "supported" scenarios for gpt-neox.

If you use the ZeRO optimizer (stages 1,2,3) for fine tuning, then that's a different optimizer codebase, so the issues I'll talk about probably don't apply. With the context set, let's look into the issues.

Loading Stage-0 Optimizer from Checkpoint
There are 3 types of state that the Stage-0 optimizer (FP16_Optimizer) loads from the checkpoint:

  1. "Normal" Optimizer State: momentum vector, etc
  2. Loss Scaling State: dynamic_loss_scale, cur_scale, clip_grad, etc
  3. 32-Bit Parameters: the normal parameters stored in the checkpoint are 16-bit, so the optimizer stores a copy of 32-bit parameters in the optimizer state

Here is how the finetune config parameter impacts this:

  • finetune: false, all 3 types of state will be loaded from the checkpoint
  • finetune: true, (1) will not be loaded from the checkpoint, but (2) and (3) will

This is a surprising behavior. I'd expect that when fine-tuning, I shouldn't need any optimizer state in my checkpoint. My checkpoint may not have any optimizer state (e.g., GPT-NeoX slim weights) or may have state incompatible with the stage-0 optimizer.

What's the Problem?
There are two problems that I've seen.

  1. Missing loss scaling state. If the checkpoint doesn't contain optimizer state related to loss scaling, the optimizer will crash on startup.
  2. Incorrect 32-bit parameters in optimizer state. The parameters stored in the checkpoint only include the parameters that were unfrozen at the time of the checkpoint. If different parameters were frozen at the checkpoint time compared to fine-tuning time, this will blow up or incorrectly initialize the parameters. Note that DeeperSpeed added a workaround for this (as a part of soft prompt tuning work), but it isn't guaranteed to work. I can share more details on this, but the writeup is getting long as-is :slight_smile:

The net result is that the optimizer state in the checkpoint needs to be "right" or else the optimizer will either crash, or will behave incorrectly (second case), even when finetune: true is set.

Does Upstream DeepSpeed Have This Problem?
As far as I can tell, DeepSpeed has these same issues, even on the latest version.

What's the Solution?
I have a fix that works for me: basically, I have the FP16_Optimizer fully reinitialize itself in the fine-tuning case. But the DeepSpeed code is convoluted enough that it's conceivable that the fix breaks some other crazy scenario. Also, the fix is a deviation from the DeepSpeed codebase.

So, at this point, this is the state of fine-tuning GPT-NeoX, from my understanding:

  • Fine-tune the 20B model from the full 268GB checkpoint with same optimizer settings as during training: should work
  • Fine-tune the 20B model from the slim 39GB checkpoint with stage 1-3, but not stage 0 optimizer: I didn't test this, but the issues I know about wouldn't apply
  • Fine-tune the 20B model with stage 0 optimizer and partly frozen parameters: you'll hit the issues above
  • Fine-tune a smaller model with stage 0 optimizer: you'll hit the issues above

If there is interest in supporting the latter two cases, it's probably worth working through the intended usage, writing some documentation, and putting in any fixes needed.

@igor0 igor0 added the bug Something isn't working label Feb 20, 2022
@igor0
Copy link
Author

igor0 commented Feb 21, 2022

Let me add a bit more details on this:

Note that DeeperSpeed added a workaround for this (as a part of soft prompt tuning work), but it isn't guaranteed to work.

DeeperSpeed wraps this for loop in in FP16_Optimizer.load_state_dict with a try-except:

try:
    for current, saved in zip(
        self.fp32_groups_flat, state_dict["fp32_groups_flat"]
    ):
        current.data.copy_(saved.data)
except RuntimeError as error:
    print(error)
    print(
        "Error in loading fp32 model parameters!\nRefreshing fp32 model params from the model's fp16 params instead. This may incur some precision loss."
    )
    self.refresh_fp32_params()

The code zips together fp32_groups_flat and state_dict["fp32_groups_flat"]. The former list is based on what's currently unfrozen, while the latter is based on what was unfrozen in the checkpoint.

A few things can happen:

  1. The two lists match perfectly. The 32-bit parameter values get restored from checkpoint["optimizer"]["fp32_groups_flat"] and all is good.
  2. The two lists mismatch horribly. The current.data.copy_(saved.data) will throw an exception, and the code will take the fallback path of restoring the parameter values from the 16-bit values in the checkpoint. That's also OK.
  3. The two lists mismatch, but in a way that doesn't trigger the exception. In particular, zip() on two unequal-length lists will return length equal to the shorter one of the lists. In this case, some parameters will not be loaded at all. Or it's possible that the tensor shapes will be the same, but still corresponding to different parameters.

The problem is in case (3): it is not guaranteed for the DeeperSpeed workaround to handle the mismatch between the parameters unfrozen in the checkpoint and the parameters unfrozen at checkpoint load time. It looks like the intent here was to improve behavior over the upstream DeepSpeed with a minimalistic change. But it's still important to note that the workaround may not work in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant