Fixed AnnealingLR Class and Cosine Decay Schedule #1008

kshitijkg · 2023-08-05T17:14:21Z

No description provided.

CLAassistant · 2023-08-05T17:14:26Z

All committers have signed the CLA.

Quentin-Anthony · 2023-08-06T23:31:33Z

Important context from @kshitijkg on Discord:

So I looked into the LR schedule. Here is what I found. The current Cosine LR schedule in gpt neox is designed to produce the LR schedule we get, this is what is used for training pythia and all other models. Not a bug introduced by us, but it looks like a bug in GPT-NeoX

-The cosine schedule itself does not keep the min_lr in mind, the only place min_lr is used currently is when returning the LR: return max(lr, self.min_lr). So it does not go below the min_lr
-Secondly, there is this weird statement: num_iters_ = min(self.num_iters, self.end_iter - self.warmup_iter)
The above statement produces the weird artifact at the end of training, because when you are in the last warmup_iter iterations, num_iters will always be equal to self.end_iter - self.warmup_iter, which is fixed

Current formula used for the cosine decay:
cur_iter = min(cur_iter, total_iter - wamrup_iters) 
if cur_iter < wamrup_iters: do warmup
cur_iter = cur_iter - self.warmup_iter
lr =  0.5*(max_lr)*(1+cos(pi*cur_iter/total_iter))
lr = max(lr, min_lr)

What I think we should use:
if cur_iter < wamrup_iters: do warmup
cur_iter = cur_iter - self.warmup_iter
lr = min_lr + 0.5*(max_lr-min_lr)*(1+cos(pi*cur_iter/total_iter))

The flat part we observe is because of both the max function and the min part below:

4 plots:
A) Current 
B) Current without Max
C) Current without Max and Min
C) Proposed

To keep legend small: 
When I say max, I mean: lr = max(lr, min_lr)
When I say min, I mean: cur_iter = min(cur_iter, total_iter - wamrup_iters)

Quentin-Anthony · 2023-08-07T00:27:42Z

Overall, this is not a bug, it's intended behavior. The schedule you propose:

lr = min_lr + 0.5*(max_lr-min_lr)*(1+cos(pi*cur_iter/total_iter))

Is simply not cosine learning rate decay because the rate has been reduced by min_lr, which will lead to significantly higher LRs near the end of training like in your figure.

If you have evidence this schedule performs better and would like it to be introduced as an alternative to cosine decay, that's fine, but it shouldn't replace cosine learning decay.

As for the num_iters_ = min(self.num_iters, self.end_iter - self.warmup_iter), I agree it's strange. I would expect it to be: num_iters_ = self.num_iters - self.warmup_iter like in https://github.com/NVIDIA/Megatron-LM/blob/0609f27fe8376f17ab65c001d3d8f35cd8175950/megatron/optimizer_param_scheduler.py#L101C27-L101C37

kshitijkg · 2023-08-07T02:06:28Z

Hi Quentin! Thank you for the information. I was curious if thats what is used generally and it might be worth doing an ablation to see what works well with typical datasets like LLaMA and Pile?

Upon more investigation I found that other repositories like Megatron LM (https://github.com/NVIDIA/Megatron-LM/blob/0609f27fe8376f17ab65c001d3d8f35cd8175950/megatron/optimizer_param_scheduler.py#L77C9-L77C9), MPT (https://github.com/mosaicml/composer/blob/cc35953ef374b9aad17938d4fdc08cfc2d09fc42/composer/optim/scheduler.py#L384), and PyTorch (https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR) use the one I proposed above. I think this is the same one used to train Chinchilla, and Gopher as well. Its the one mentioned in this paper: https://arxiv.org/pdf/1608.03983v5.pdf

Quentin-Anthony · 2023-08-07T15:50:07Z

Hi Quentin! Thank you for the information. I was curious if thats what is used generally and it might be worth doing an ablation to see what works well with typical datasets like LLaMA and Pile?

Upon more investigation I found that other repositories like Megatron LM (https://github.com/NVIDIA/Megatron-LM/blob/0609f27fe8376f17ab65c001d3d8f35cd8175950/megatron/optimizer_param_scheduler.py#L77C9-L77C9), MPT (https://github.com/mosaicml/composer/blob/cc35953ef374b9aad17938d4fdc08cfc2d09fc42/composer/optim/scheduler.py#L384), and PyTorch (https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR) use the one I proposed above. I think this is the same one used to train Chinchilla, and Gopher as well. Its the one mentioned in this paper: https://arxiv.org/pdf/1608.03983v5.pdf

Ah you're correct and I was mistaken. I'm happy with this change.

Quentin-Anthony · 2023-08-07T16:10:17Z

@StellaAthena and @haileyschoelkopf -- FYI

Fixed AnnealingLR Class and Cosine Decay Schedule (EleutherAI#1008)

Fixed AnnealingLR Class and Cosine Decay Schedule

d3e481c

Update NeoXArgs docs automatically

ecc7075

kshitijkg requested a review from Quentin-Anthony August 5, 2023 17:18

Quentin-Anthony self-assigned this Aug 6, 2023

kshitijkg marked this pull request as ready for review August 7, 2023 16:06

kshitijkg requested a review from a team as a code owner August 7, 2023 16:06

kshitijkg requested a review from ShivanshuPurohit August 7, 2023 16:06

Quentin-Anthony approved these changes Aug 7, 2023

View reviewed changes

Quentin-Anthony merged commit 009018e into main Aug 7, 2023
2 checks passed

Quentin-Anthony deleted the fix_cosine branch August 7, 2023 16:10

kshitijkg added a commit to CERC-AAI/multimodal that referenced this pull request Aug 7, 2023

Merge pull request #42 from EleutherAI/main

8125ea3

Fixed AnnealingLR Class and Cosine Decay Schedule (EleutherAI#1008)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed AnnealingLR Class and Cosine Decay Schedule #1008

Fixed AnnealingLR Class and Cosine Decay Schedule #1008

kshitijkg commented Aug 5, 2023

CLAassistant commented Aug 5, 2023 •

edited

Loading

Quentin-Anthony commented Aug 6, 2023

Quentin-Anthony commented Aug 7, 2023

kshitijkg commented Aug 7, 2023 •

edited

Loading

Quentin-Anthony commented Aug 7, 2023

Quentin-Anthony commented Aug 7, 2023

Fixed AnnealingLR Class and Cosine Decay Schedule #1008

Fixed AnnealingLR Class and Cosine Decay Schedule #1008

Conversation

kshitijkg commented Aug 5, 2023

CLAassistant commented Aug 5, 2023 • edited Loading

Quentin-Anthony commented Aug 6, 2023

Quentin-Anthony commented Aug 7, 2023

kshitijkg commented Aug 7, 2023 • edited Loading

Quentin-Anthony commented Aug 7, 2023

Quentin-Anthony commented Aug 7, 2023

CLAassistant commented Aug 5, 2023 •

edited

Loading

kshitijkg commented Aug 7, 2023 •

edited

Loading