-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up Neox configuration #132
Conversation
Created a draft implementation. Example usage:
The parameters in the yaml files are automatically separated into the DS runner, Megatron and DS config file parts. They are then converted into the "old" format and provided to the scripts. I wanted to do it this was so that I made as little changes as possible to the megatron codebase - making it easier in the future to merge upstream changes. Some parameters are also automatically derived, such as the megatron "fp16" param from the DS runner "fp16" param. |
configs/ds_pretrain_gpt2.yml
Outdated
"num-attention-heads":"16", | ||
"seq-length":"1024", | ||
"max-position-embeddings":"1024", | ||
"batch-size":"9", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s taken from the corresponding examples script: examples/ds_pretrain_gpt2.sh
It was mentioned in megatron keys but not defined.
The example above exactly replicates the parameters used in |
I see. No problem. I thought all the keys were supposed to be initialized.
…On Tue, Feb 23, 2021, 3:02 PM Josh Levy-Kramer ***@***.***> wrote:
./deepy.py pretrain_gpt2.py -d configs ds_pretrain_gpt2.yml eleutherai_cluster.yaml
The example above exactly replicates the parameters used in
examples/ds_pretrain_gpt2.sh - try it for yourself. It is not intended to
show all possible configurations. I can create such a config later.
@ShivanshuPurohit <https://github.com/ShivanshuPurohit> I am going to
undo your commit as pipe-parallel-size isn't used in the original example.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#132 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKHCCSL42WBKJFAKJE25SRLTANY2JANCNFSM4X7AF65Q>
.
|
# Conflicts: # megatron/config_monster.py
Okay, deduplicated the remaining params. Zero parameters should be set deepspeed style, like so:
same with optimizer params, like so:
(options are "adam", "onebitadam", "cpu_adam", "cpu_torch_adam") gradient clipping should be set with "gradient_clipping" instead of clip-grads and i think that's about it. Josh and I figured out the batch size related problems - so when doing model parallel Should be ready to merge now imo - maybe would be good to get solid documentation first though, to avoid confusion |
@sdtblck looks a lot better! Do you think there are other params that would be worth bundling together, deepspeed-style? I mostly have the checkpointing args in mind here, I think. |
@StellaAthena i think whether we do checkpointing args deepspeed style or not is inconsequential, really. But i can set it up that way if you think it'd be more user friendly. |
"train_batch_size": 224, | ||
"train_micro_batch_size_per_gpu": 4, | ||
"steps_per_print": 10, | ||
"optimizer": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't actually do anything bc the optimizer is initialized within the megatron code. I believe there is an optimizer arg in megatron/arguments.py
- the only time we need the 'optimizer' in the deepspeed config is when we're using onebitadam, because in that case the optimizer has to be initialized within the deepspeed code because reasons.
We should find a cleaner way to do this
"betas": [0.9, 0.95] | ||
} | ||
}, | ||
"gradient_clipping": 1.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate of clip-grad
Clean up neox configuration so config files can be used instead of a mishmash of files, command line args and enviroment variables.
Aim:
Nice to haves:
Todo:
micro_batch_per_gpu*GAS*n_gpus
)