-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning GPT-NeoX doesn't work (for many scenarios) with the 16-bit stage-0 optimizer #568
Comments
Let me add a bit more details on this:
DeeperSpeed wraps this for loop in in FP16_Optimizer.load_state_dict with a try-except: try:
for current, saved in zip(
self.fp32_groups_flat, state_dict["fp32_groups_flat"]
):
current.data.copy_(saved.data)
except RuntimeError as error:
print(error)
print(
"Error in loading fp32 model parameters!\nRefreshing fp32 model params from the model's fp16 params instead. This may incur some precision loss."
)
self.refresh_fp32_params() The code zips together fp32_groups_flat and state_dict["fp32_groups_flat"]. The former list is based on what's currently unfrozen, while the latter is based on what was unfrozen in the checkpoint. A few things can happen:
The problem is in case (3): it is not guaranteed for the DeeperSpeed workaround to handle the mismatch between the parameters unfrozen in the checkpoint and the parameters unfrozen at checkpoint load time. It looks like the intent here was to improve behavior over the upstream DeepSpeed with a minimalistic change. But it's still important to note that the workaround may not work in all cases. |
TL,DR: The non-ZeRO ("stage 0") optimizer in DeepSpeed makes fragile assumptions about the optimizer state in the checkpoint, even when
finetune: true
configuration parameter is set. A mitigating factor is that the 20B model is too big to use with the stage-0 optimizer, at least if you are fine-tuning the entire model.Detailed Explanation:
The issue is with the stage-0 optimizer (i.e., with ZeRO disabled). A mitigating factor is that the 20B model can't be fine-tuned with a stage-0 optimizer because there is no GPU in the world where it will fit.
You will hit these problems if you either:
If you use the ZeRO optimizer (stages 1,2,3) for fine tuning, then that's a different optimizer codebase, so the issues I'll talk about probably don't apply. With the context set, let's look into the issues.
Loading Stage-0 Optimizer from Checkpoint
There are 3 types of state that the Stage-0 optimizer (FP16_Optimizer) loads from the checkpoint:
Here is how the
finetune
config parameter impacts this:finetune: false
, all 3 types of state will be loaded from the checkpointfinetune: true
, (1) will not be loaded from the checkpoint, but (2) and (3) willThis is a surprising behavior. I'd expect that when fine-tuning, I shouldn't need any optimizer state in my checkpoint. My checkpoint may not have any optimizer state (e.g., GPT-NeoX slim weights) or may have state incompatible with the stage-0 optimizer.
What's the Problem?
There are two problems that I've seen.
The net result is that the optimizer state in the checkpoint needs to be "right" or else the optimizer will either crash, or will behave incorrectly (second case), even when finetune: true is set.
Does Upstream DeepSpeed Have This Problem?
As far as I can tell, DeepSpeed has these same issues, even on the latest version.
What's the Solution?
I have a fix that works for me: basically, I have the FP16_Optimizer fully reinitialize itself in the fine-tuning case. But the DeepSpeed code is convoluted enough that it's conceivable that the fix breaks some other crazy scenario. Also, the fix is a deviation from the DeepSpeed codebase.
So, at this point, this is the state of fine-tuning GPT-NeoX, from my understanding:
If there is interest in supporting the latter two cases, it's probably worth working through the intended usage, writing some documentation, and putting in any fixes needed.
The text was updated successfully, but these errors were encountered: