-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029
Comments
This is definitely problematic, thanks for making an issue @exnx ! I'm going to take a look at this |
We ended up making a local patch in our cloned repo. We forced Specifically, inside
|
@exnx just confirmed it works. Thanks so much! I made a PR with your patch (and you as a contributor). Unless you'd like to make the PR yourself, we just need a review from @Quentin-Anthony |
Great! Worth asking, but were you able to confirm there was a problem in the first place? Just thought I'd check to make sure we didn't inadvertently change something to cause it to break. Feel free to make the change :) |
Hello, I am using a cosine decay learning rate (LR) scheduler. When I resume from a checkpoint (eg like if something crashed before), I noticed the LR doesn't continue with the decay schedule, it stays constant. It does, however, load the correct learning from the last checkpoint - it just stays constant from then on.
What I do know is that the cosine schedule works fine if training from scratch and uninterrupted.
Also, when checking the LR within the
AnnealingLR
step function (by grabbing it from the optimizer.param_groups), it does print out the correct LR and it decays at each step.However, outside the LR scheduler, like in the training loop, just before the wandb logging, the LR is constant.
What I suspect is that the optimizer param groups and the lr_scheduler param groups are actually different objects. So the lr_scheduler is updating the lr in the param_groups, but the optimizer isn't actually using those param_groups.
In my debugging, I noticed with
deepspeed.initialize
, it passes back a model, lr_scheduler and optimizer. The optimizer is wrapped with deepspeed, but the lr_schedule is not wrapped around deepspeed (it's just a regular AnnealingLR object). The deepspeed documentation says the lr_scheduler should be wrapped with a deepspeed wrapper. I can't tell if that's the reason things break.Is it possible deepspeed is not correctly initializing the lr_scheduler upon resuming?
Has anybody also run into this kind of issue? Thanks ahead of time!
Eric
The text was updated successfully, but these errors were encountered: