Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates configs to allow for the third failure mode #64

Merged
merged 16 commits into from
Jan 14, 2021

Conversation

StellaAthena
Copy link
Member

No description provided.

StellaAthena and others added 16 commits January 14, 2021 11:20
Added activation checkpointing in the json itself
This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail.
This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail.
@StellaAthena StellaAthena requested a review from a team as a code owner January 14, 2021 22:36
@StellaAthena StellaAthena requested review from lucidrains and AranKomat and removed request for a team January 14, 2021 22:36
@StellaAthena StellaAthena merged commit 71c7a77 into stella Jan 14, 2021
@StellaAthena StellaAthena deleted the StellaAthena-patch-1 branch January 14, 2021 22:37
StellaAthena added a commit that referenced this pull request Jan 17, 2021
* Added reduce_bucket_size argument to optimizer

* Started to revamp train.py

* revised training function, version 1

* Update gpt_neox.py

* Update train.sh

* Update base_deepspeed.json

* Create train_gpt3small_pipeline.sh

* Update train_stella.py

* Update train_stella.py

* adding environment.yml file

* Set zero to stage 1

* set ZeRO to stage 2

* turned off activation checkpointing

* turn continuous gradients off

* turn activation checkpointing on

* removed reduce bucket size config

* turn on cont. graadients

* removed extraneous comma

* Turn off masking

* Turned on attention masking

* Change name in prep for merging

* Rename base_deepspeed.json to deepspeed_zero2.json

* Create deepspeed_zero1.json

* Updated with new config names

* Update with new config names

* Update with new config names

* Update with new config names

* Rename train_stella.py to train_pipeline.py

* Update README.md

* Changed training loop to be consistent with train_enwik8.py

* Reorder code to match train_enwik8.py

* Modified transformerblock to pass mask argument

Minor tweak to `forward` to align better with the demo code

* Add print statement for testing

* Revert changes to avoid pulling dev work into main

Forgot about the open PR and pushed dev code to this branch.

* Updates configs to allow for the third failure mode (#64)

* Pipeline + Checkpoint - ZeRO 2

This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail.

Co-authored-by: Shivanshu Purohit <[email protected]>

* Update train_enwik8.sh

* Update deepspeed_zero2.json

Co-authored-by: sdtblck <[email protected]>
Co-authored-by: Shivanshu Purohit <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants