resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029

exnx · 2023-09-15T06:31:50Z

Hello, I am using a cosine decay learning rate (LR) scheduler. When I resume from a checkpoint (eg like if something crashed before), I noticed the LR doesn't continue with the decay schedule, it stays constant. It does, however, load the correct learning from the last checkpoint - it just stays constant from then on.

What I do know is that the cosine schedule works fine if training from scratch and uninterrupted.

Also, when checking the LR within the AnnealingLR step function (by grabbing it from the optimizer.param_groups), it does print out the correct LR and it decays at each step.

However, outside the LR scheduler, like in the training loop, just before the wandb logging, the LR is constant.

What I suspect is that the optimizer param groups and the lr_scheduler param groups are actually different objects. So the lr_scheduler is updating the lr in the param_groups, but the optimizer isn't actually using those param_groups.

In my debugging, I noticed with deepspeed.initialize, it passes back a model, lr_scheduler and optimizer. The optimizer is wrapped with deepspeed, but the lr_schedule is not wrapped around deepspeed (it's just a regular AnnealingLR object). The deepspeed documentation says the lr_scheduler should be wrapped with a deepspeed wrapper. I can't tell if that's the reason things break.

Is it possible deepspeed is not correctly initializing the lr_scheduler upon resuming?

Has anybody also run into this kind of issue? Thanks ahead of time!

Eric

The text was updated successfully, but these errors were encountered:

dashstander · 2023-09-25T20:43:53Z

This is definitely problematic, thanks for making an issue @exnx ! I'm going to take a look at this

exnx · 2023-09-26T07:20:36Z

We ended up making a local patch in our cloned repo. We forced lr_scheduler to use the same optimizer as the model.

Specifically, inside train.py, and inside the def setup_model_and_optimizer function, we set this just before the return, and it worked for us.

    # need this for correct lr scheduling resume from ckpt
    lr_scheduler.optimizer = model.optimizer
    lr_scheduler.param_groups = model.optimizer.param_groups
    lr_scheduler.model = model

dashstander · 2023-09-26T23:32:47Z

@exnx just confirmed it works. Thanks so much! I made a PR with your patch (and you as a contributor). Unless you'd like to make the PR yourself, we just need a review from @Quentin-Anthony

exnx · 2023-09-26T23:50:28Z

Great! Worth asking, but were you able to confirm there was a problem in the first place? Just thought I'd check to make sure we didn't inadvertently change something to cause it to break.

Feel free to make the change :)

exnx added the bug Something isn't working label Sep 15, 2023

dashstander self-assigned this Sep 25, 2023

dashstander mentioned this issue Sep 26, 2023

Patch LR Annealing Bug #1046

Merged

dashstander linked a pull request Sep 26, 2023 that will close this issue

Patch LR Annealing Bug #1046

Merged

Quentin-Anthony closed this as completed in #1046 Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029

resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029

exnx commented Sep 15, 2023

dashstander commented Sep 25, 2023

exnx commented Sep 26, 2023 •

edited

Loading

dashstander commented Sep 26, 2023

exnx commented Sep 26, 2023 •

edited

Loading

resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029

resume from checkpoint doesn't continue decaying the learning rate - it stays constant #1029

Comments

exnx commented Sep 15, 2023

dashstander commented Sep 25, 2023

exnx commented Sep 26, 2023 • edited Loading

dashstander commented Sep 26, 2023

exnx commented Sep 26, 2023 • edited Loading

exnx commented Sep 26, 2023 •

edited

Loading

exnx commented Sep 26, 2023 •

edited

Loading