Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ok, this pr is a compendium of lots of stuff.
Main change is that the model branch where pp=0 no longer exists. pipe parallel defaults to 1. This means we only have to maintain a single model. I verified both models acheived exactly the same loss here, and in fact pp=1 (GPT2ModelPipe) is slightly faster, for some reason, so there's no reason that i can see to keep the other model branch around.
Fix the wandb group name logging (we ended up adding multiple uuids to the end of the group name)
Removes a lot of dead code from:
- megatron/mpu/random/.py -> handled activation checkpointing. With the pipeline model, this is all handled by deepspeed
- megatron/fp16 stuff -> handles loss scaling / fp16 conversion. Again, all handled by deepspeed, so can be safely removed.
- megatron/mpu/grads.py -> handled gradient clipping, also now handled by deepspeed.
- megatron/memory.py -> wasn't used anywhere.
Makes the apex dependency optional (better to have it as apex fusedadam is slightly faster than deepspeed's version).
Some mild reorganization of the layout of megatron/model to make it easier to work with / navigate
renamed megatron/arguments/megatron_args.py to megatron/arguments/neox_args.py (in the future i think we should rename the whole package to neox - I think we're now sufficiently different lol, but couldn't be arsed to go through the hassle rn)
Updated requirements (separated optional ones from mandatory ones) and updated dockerfile