Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up Neox configuration #132

Merged
merged 94 commits into from
Feb 27, 2021
Merged

Clean up Neox configuration #132

merged 94 commits into from
Feb 27, 2021

Conversation

joshlk
Copy link
Member

@joshlk joshlk commented Feb 21, 2021

Clean up neox configuration so config files can be used instead of a mishmash of files, command line args and enviroment variables.

Aim:

  • All parameters can be set using passed json files
  • No parameters are repeated
  • Modify megatron's codebase as little as possible to make it easier to merge upstream megatron changes in the future.

Nice to haves:

Todo:

  • Convert all examples and configs to new configuration
  • Create config documentation with all possible parameters
  • Separate configs into: model and system
  • Cast numbers to numbers in JSON (suggested by @StellaAthena)
  • Calculate batch size from other parms (micro_batch_per_gpu*GAS*n_gpus)

@StellaAthena StellaAthena linked an issue Feb 22, 2021 that may be closed by this pull request
@joshlk
Copy link
Member Author

joshlk commented Feb 22, 2021

Created a draft implementation. Example usage:

./deepy.py pretrain_gpt2.py -d configs ds_pretrain_gpt2.yml eleutherai_cluster.yml
  • configs is the directory the config files are in
  • ds_pretrain_gpt2.yml, eleutherai_cluster.yml are two JSON-ish (ok they are YAML but its basically JSON with comments) files. deepy takes all config files and merges then (so we can separate out the parameters how we want). Currently eleutherai_cluster.yaml has system stuff and ds_pretrain_gpt2.yml is to do with the model

The parameters in the yaml files are automatically separated into the DS runner, Megatron and DS config file parts. They are then converted into the "old" format and provided to the scripts. I wanted to do it this was so that I made as little changes as possible to the megatron codebase - making it easier in the future to merge upstream changes.

Some parameters are also automatically derived, such as the megatron "fp16" param from the DS runner "fp16" param.

"num-attention-heads":"16",
"seq-length":"1024",
"max-position-embeddings":"1024",
"batch-size":"9",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 9?

Copy link
Member Author

@joshlk joshlk Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s taken from the corresponding examples script: examples/ds_pretrain_gpt2.sh

@joshlk
Copy link
Member Author

joshlk commented Feb 23, 2021

./deepy.py pretrain_gpt2.py -d configs ds_pretrain_gpt2.yml eleutherai_cluster.yml

The example above exactly replicates the parameters used in examples/ds_pretrain_gpt2.sh - try it for yourself. It is not intended to show all possible configurations. I can create such a config later. @ShivanshuPurohit I am going to undo your commit as pipe-parallel-size isn't used in the original example.

@ShivanshuPurohit
Copy link
Contributor

ShivanshuPurohit commented Feb 23, 2021 via email

configs/ds_pretrain_gpt2.yml Outdated Show resolved Hide resolved
@sdtblck
Copy link
Contributor

sdtblck commented Feb 26, 2021

Okay, deduplicated the remaining params.

Zero parameters should be set deepspeed style, like so:

   "zero_optimization": {
    "stage": 0,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": False,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": False,
    "cpu_offload": False
},

same with optimizer params, like so:

     "type": "Adam",
     "params": {
       "lr": 0.00015,
       "max_grad_norm": 1.0,
       "betas": [0.9, 0.95]
     }
   }

(options are "adam", "onebitadam", "cpu_adam", "cpu_torch_adam")

gradient clipping should be set with "gradient_clipping" instead of clip-grads

and i think that's about it.

Josh and I figured out the batch size related problems - so when doing model parallel world_size according to deepspeed is actually dp_parallel_group_size
so if you have model paralellism (either pipeline or model) of 2, and a world size of 16, deepspeed engine sets the world size to 8.
Before we weren't setting all three params (train_batch, g.a.s, train_microbatch_per_gpu) in conjunction, so never ran into the error.

Should be ready to merge now imo - maybe would be good to get solid documentation first though, to avoid confusion

@StellaAthena
Copy link
Member

@sdtblck looks a lot better! Do you think there are other params that would be worth bundling together, deepspeed-style? I mostly have the checkpointing args in mind here, I think.

@sdtblck
Copy link
Contributor

sdtblck commented Feb 27, 2021

@StellaAthena i think whether we do checkpointing args deepspeed style or not is inconsequential, really. But i can set it up that way if you think it'd be more user friendly.

"train_batch_size": 224,
"train_micro_batch_size_per_gpu": 4,
"steps_per_print": 10,
"optimizer": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't actually do anything bc the optimizer is initialized within the megatron code. I believe there is an optimizer arg in megatron/arguments.py - the only time we need the 'optimizer' in the deepspeed config is when we're using onebitadam, because in that case the optimizer has to be initialized within the deepspeed code because reasons.

We should find a cleaner way to do this

"betas": [0.9, 0.95]
}
},
"gradient_clipping": 1.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate of clip-grad

@sdtblck sdtblck marked this pull request as ready for review February 27, 2021 20:35
@sdtblck sdtblck requested a review from a team as a code owner February 27, 2021 20:35
@sdtblck sdtblck merged commit 0df36b2 into main Feb 27, 2021
@sdtblck sdtblck deleted the config_monster branch February 27, 2021 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix our configs
4 participants