-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest DeepSpeed Support #663
Conversation
@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move? |
Small stuff like logging format, some more detailed timers, and the forward hooks functionality in deeperspeed. I've already pushed the major features into upstream DeepSpeed. My thoughts are that most gpt-neox users don't need/rely on these features and can switch to the latest DeepSpeed. |
The only thing I disagree with here is the detailed timers, which I and I think many others find quite useful. Would there be an easy way to make them part of GPT-NeoX as opposed to DeeperSpeed? |
No there's no way to bring those out of DeeperSpeed. Should we update the DeeperSpeed main branch to just be the DeepSpeed main branch, but with timers (throwing everything else away)? We'd have to update it periodically, but merges would be pretty simple that way. I think bringing these timers into upstream DeepSpeed would be a hard sell. |
Who would do the selling though? |
Us to the DeepSpeed team. I'm saying it would be difficult to convince them that these timers are needed when they already have the FLOPs profiler and communication logger. |
Signed-off-by: Dashiell Stander <[email protected]>
* WIP: Add support for Maximal Update Parametrization and Hyperparameter Transfer (mup) * Update to use MuAdam and MuSGD, fix minor errors * Fix more errors with arguments * Fix error caused by not calling to_sequential on delta model * Update NeoXArgs docs automatically * Address PR feedback * Fix minor error * Update NeoXArgs docs automatically * Revert small.yml config * Update NeoXArgs docs automatically * Reinitialize weights using mup's replacements after set_base_shapes is called * Update NeoXArgs docs automatically * Implement rescale parameters on the output layer, adjust learning rate based on width * Update NeoXArgs docs automatically * Remove debug prints * Update NeoXArgs docs automatically * Add preliminary support for coord check (WIP: not yet functional in this commit) * Update NeoXArgs docs automatically * Add untracked file from last commit * Update NeoXArgs docs automatically * Update for coord check plots * Update NeoXArgs docs automatically * Add all but one (and a half) of the new hyperparameters from the zero-shot hp transfer paper * Update NeoXArgs docs automatically * Add last mup HP * Add mup readme file * Update NeoXArgs docs automatically * Revert changes to configs/small.yml * Update NeoXArgs docs automatically * Update README-MUP.md * Update NeoXArgs docs automatically * Clean up code for PR * Update NeoXArgs docs automatically * Make mup import optional * Update NeoXArgs docs automatically * Revert "Update NeoXArgs docs automatically" This reverts commit a7b97fd. * Update NeoXArgs docs automatically * Revert "Update NeoXArgs docs automatically" This reverts commit 8161a56. * Update NeoXArgs docs automatically * Add neox arg for mup delta model width scale * Update NeoXArgs docs automatically Co-authored-by: Nick Sarkauskas <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Stella Biderman <[email protected]> Co-authored-by: Quentin-Anthony <[email protected]>
…nt for using the SlurmRunner Signed-off-by: Dashiell Stander <[email protected]>
@StellaAthena @ShivanshuPurohit
Note: we will not merge this unless we decide to get rid of DeeperSpeed
This branch completely does away with DeeperSpeed, and instead is based on upstream DeepSpeed. It doesn't take many gpt-neox changes to do this, but we lose some of the DeeperSpeed features. Feel free to use this branch unless your gpt-neox code explicitly relies on DeeperSpeed features.
Tested with: