-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed AnnealingLR Class and Cosine Decay Schedule #1008
Conversation
Important context from @kshitijkg on Discord:
|
Overall, this is not a bug, it's intended behavior. The schedule you propose:
Is simply not cosine learning rate decay because the rate has been reduced by If you have evidence this schedule performs better and would like it to be introduced as an alternative to cosine decay, that's fine, but it shouldn't replace cosine learning decay. As for the |
Hi Quentin! Thank you for the information. I was curious if thats what is used generally and it might be worth doing an ablation to see what works well with typical datasets like LLaMA and Pile? Upon more investigation I found that other repositories like Megatron LM (https://github.com/NVIDIA/Megatron-LM/blob/0609f27fe8376f17ab65c001d3d8f35cd8175950/megatron/optimizer_param_scheduler.py#L77C9-L77C9), MPT (https://github.com/mosaicml/composer/blob/cc35953ef374b9aad17938d4fdc08cfc2d09fc42/composer/optim/scheduler.py#L384), and PyTorch (https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR) use the one I proposed above. I think this is the same one used to train Chinchilla, and Gopher as well. Its the one mentioned in this paper: https://arxiv.org/pdf/1608.03983v5.pdf |
Ah you're correct and I was mistaken. I'm happy with this change. |
@StellaAthena and @haileyschoelkopf -- FYI |
Fixed AnnealingLR Class and Cosine Decay Schedule (EleutherAI#1008)
No description provided.