Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029

Closed
exnx opened this issue Sep 15, 2023 · 4 comments · Fixed by #1046
Closed

resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029

exnx opened this issue Sep 15, 2023 · 4 comments · Fixed by #1046
Assignees
Labels
bug Something isn't working

Comments

@exnx
Copy link
Contributor

exnx commented Sep 15, 2023

Hello, I am using a cosine decay learning rate (LR) scheduler. When I resume from a checkpoint (eg like if something crashed before), I noticed the LR doesn't continue with the decay schedule, it stays constant. It does, however, load the correct learning from the last checkpoint - it just stays constant from then on.

What I do know is that the cosine schedule works fine if training from scratch and uninterrupted.

Also, when checking the LR within the AnnealingLR step function (by grabbing it from the optimizer.param_groups), it does print out the correct LR and it decays at each step.

However, outside the LR scheduler, like in the training loop, just before the wandb logging, the LR is constant.

What I suspect is that the optimizer param groups and the lr_scheduler param groups are actually different objects. So the lr_scheduler is updating the lr in the param_groups, but the optimizer isn't actually using those param_groups.

In my debugging, I noticed with deepspeed.initialize, it passes back a model, lr_scheduler and optimizer. The optimizer is wrapped with deepspeed, but the lr_schedule is not wrapped around deepspeed (it's just a regular AnnealingLR object). The deepspeed documentation says the lr_scheduler should be wrapped with a deepspeed wrapper. I can't tell if that's the reason things break.

Is it possible deepspeed is not correctly initializing the lr_scheduler upon resuming?

Has anybody also run into this kind of issue? Thanks ahead of time!

Eric

@exnx exnx added the bug Something isn't working label Sep 15, 2023
@dashstander
Copy link
Contributor

This is definitely problematic, thanks for making an issue @exnx ! I'm going to take a look at this

@dashstander dashstander self-assigned this Sep 25, 2023
@exnx
Copy link
Contributor Author

exnx commented Sep 26, 2023

We ended up making a local patch in our cloned repo. We forced lr_scheduler to use the same optimizer as the model.

Specifically, inside train.py, and inside the def setup_model_and_optimizer function, we set this just before the return, and it worked for us.

    # need this for correct lr scheduling resume from ckpt
    lr_scheduler.optimizer = model.optimizer
    lr_scheduler.param_groups = model.optimizer.param_groups
    lr_scheduler.model = model

@dashstander dashstander linked a pull request Sep 26, 2023 that will close this issue
@dashstander
Copy link
Contributor

@exnx just confirmed it works. Thanks so much! I made a PR with your patch (and you as a contributor). Unless you'd like to make the PR yourself, we just need a review from @Quentin-Anthony

@exnx
Copy link
Contributor Author

exnx commented Sep 26, 2023

Great! Worth asking, but were you able to confirm there was a problem in the first place? Just thought I'd check to make sure we didn't inadvertently change something to cause it to break.

Feel free to make the change :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants