Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest DeepSpeed Support #663

Merged
merged 33 commits into from
Mar 9, 2023
Merged

Latest DeepSpeed Support #663

merged 33 commits into from
Mar 9, 2023

Conversation

Quentin-Anthony
Copy link
Member

@Quentin-Anthony Quentin-Anthony commented Sep 2, 2022

@StellaAthena @ShivanshuPurohit

Note: we will not merge this unless we decide to get rid of DeeperSpeed

This branch completely does away with DeeperSpeed, and instead is based on upstream DeepSpeed. It doesn't take many gpt-neox changes to do this, but we lose some of the DeeperSpeed features. Feel free to use this branch unless your gpt-neox code explicitly relies on DeeperSpeed features.

Tested with:

@Quentin-Anthony Quentin-Anthony marked this pull request as draft September 2, 2022 17:26
@StellaAthena
Copy link
Member

@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?

@Quentin-Anthony
Copy link
Member Author

@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?

Small stuff like logging format, some more detailed timers, and the forward hooks functionality in deeperspeed. I've already pushed the major features into upstream DeepSpeed.

My thoughts are that most gpt-neox users don't need/rely on these features and can switch to the latest DeepSpeed.

@StellaAthena
Copy link
Member

@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?

Small stuff like logging format, some more detailed timers, and the forward hooks functionality in deeperspeed. I've already pushed the major features into upstream DeepSpeed.

My thoughts are that most gpt-neox users don't need/rely on these features and can switch to the latest DeepSpeed.

The only thing I disagree with here is the detailed timers, which I and I think many others find quite useful. Would there be an easy way to make them part of GPT-NeoX as opposed to DeeperSpeed?

@Quentin-Anthony
Copy link
Member Author

@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?

Small stuff like logging format, some more detailed timers, and the forward hooks functionality in deeperspeed. I've already pushed the major features into upstream DeepSpeed.
My thoughts are that most gpt-neox users don't need/rely on these features and can switch to the latest DeepSpeed.

The only thing I disagree with here is the detailed timers, which I and I think many others find quite useful. Would there be an easy way to make them part of GPT-NeoX as opposed to DeeperSpeed?

No there's no way to bring those out of DeeperSpeed. Should we update the DeeperSpeed main branch to just be the DeepSpeed main branch, but with timers (throwing everything else away)? We'd have to update it periodically, but merges would be pretty simple that way. I think bringing these timers into upstream DeepSpeed would be a hard sell.

@jamesthesnake
Copy link

Who would do the selling though?

@Quentin-Anthony
Copy link
Member Author

Who would do the selling though?

Us to the DeepSpeed team. I'm saying it would be difficult to convince them that these timers are needed when they already have the FLOPs profiler and communication logger.

@Quentin-Anthony Quentin-Anthony mentioned this pull request Sep 25, 2022
3 tasks
Quentin-Anthony and others added 5 commits December 8, 2022 00:39
* WIP: Add support for Maximal Update Parametrization and Hyperparameter Transfer (mup)

* Update to use MuAdam and MuSGD, fix minor errors

* Fix more errors with arguments

* Fix error caused by not calling to_sequential on delta model

* Update NeoXArgs docs automatically

* Address PR feedback

* Fix minor error

* Update NeoXArgs docs automatically

* Revert small.yml config

* Update NeoXArgs docs automatically

* Reinitialize weights using mup's replacements after set_base_shapes is called

* Update NeoXArgs docs automatically

* Implement rescale parameters on the output layer, adjust learning rate based on width

* Update NeoXArgs docs automatically

* Remove debug prints

* Update NeoXArgs docs automatically

* Add preliminary support for coord check (WIP: not yet functional in this commit)

* Update NeoXArgs docs automatically

* Add untracked file from last commit

* Update NeoXArgs docs automatically

* Update for coord check plots

* Update NeoXArgs docs automatically

* Add all but one (and a half) of the new hyperparameters from the zero-shot hp transfer paper

* Update NeoXArgs docs automatically

* Add last mup HP

* Add mup readme file

* Update NeoXArgs docs automatically

* Revert changes to configs/small.yml

* Update NeoXArgs docs automatically

* Update README-MUP.md

* Update NeoXArgs docs automatically

* Clean up code for PR

* Update NeoXArgs docs automatically

* Make mup import optional

* Update NeoXArgs docs automatically

* Revert "Update NeoXArgs docs automatically"

This reverts commit a7b97fd.

* Update NeoXArgs docs automatically

* Revert "Update NeoXArgs docs automatically"

This reverts commit 8161a56.

* Update NeoXArgs docs automatically

* Add neox arg for mup delta model width scale

* Update NeoXArgs docs automatically

Co-authored-by: Nick Sarkauskas <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Stella Biderman <[email protected]>
Co-authored-by: Quentin-Anthony <[email protected]>
@StellaAthena StellaAthena added this to the Release V2 milestone Dec 20, 2022
@StellaAthena StellaAthena merged commit 2b84f9a into main Mar 9, 2023
@StellaAthena StellaAthena deleted the deepspeed_main branch March 9, 2023 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Mixture of Experts Support ZeRO-Infinity
7 participants