Updates configs to allow for the third failure mode #64

StellaAthena · 2021-01-14T22:36:27Z

No description provided.

Added activation checkpointing in the json itself

This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail.

* Added reduce_bucket_size argument to optimizer * Started to revamp train.py * revised training function, version 1 * Update gpt_neox.py * Update train.sh * Update base_deepspeed.json * Create train_gpt3small_pipeline.sh * Update train_stella.py * Update train_stella.py * adding environment.yml file * Set zero to stage 1 * set ZeRO to stage 2 * turned off activation checkpointing * turn continuous gradients off * turn activation checkpointing on * removed reduce bucket size config * turn on cont. graadients * removed extraneous comma * Turn off masking * Turned on attention masking * Change name in prep for merging * Rename base_deepspeed.json to deepspeed_zero2.json * Create deepspeed_zero1.json * Updated with new config names * Update with new config names * Update with new config names * Update with new config names * Rename train_stella.py to train_pipeline.py * Update README.md * Changed training loop to be consistent with train_enwik8.py * Reorder code to match train_enwik8.py * Modified transformerblock to pass mask argument Minor tweak to `forward` to align better with the demo code * Add print statement for testing * Revert changes to avoid pulling dev work into main Forgot about the open PR and pushed dev code to this branch. * Updates configs to allow for the third failure mode (#64) * Pipeline + Checkpoint - ZeRO 2 This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail. Co-authored-by: Shivanshu Purohit <[email protected]> * Update train_enwik8.sh * Update deepspeed_zero2.json Co-authored-by: sdtblck <[email protected]> Co-authored-by: Shivanshu Purohit <[email protected]>

StellaAthena and others added 16 commits January 14, 2021 11:20

Trying new config

aa0c66d

Added dropped comma

a30ca42

removed extraneous argument

80f21c9

change number of checkpoints

c7e6cbb

Update deepspeed_zero2.json

8255f49

turn checkpointing completely off

f4ed32e

turn checkpointing on

ed4f83f

Update deepspeed_zero2.json

59809ed

Update deepspeed_zero2.json

520c273

Update deepspeed_zero2.json

2583d0a

Update deepspeed_zero2.json

c5fdb67

Update deepspeed_zero1.json

978c1d4

Added activation checkpointing in the json itself

Undid last commit

d4f06e3

removed stray comma

6e71fcb

Pipeline + Checkpoint - ZeRO 2

92d56ee

This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail.

Pipeline + Checkpoint - ZeRO 2

b510633

This illustrates the third combination: pipeline parallelism, activation checkpoints, and ZeRO Stage 1. Again, this works. Upgrading ZeRO to Stage 2 causes it to fail.

StellaAthena requested a review from a team as a code owner January 14, 2021 22:36

StellaAthena requested review from lucidrains and AranKomat and removed request for a team January 14, 2021 22:36

StellaAthena merged commit 71c7a77 into stella Jan 14, 2021

StellaAthena deleted the StellaAthena-patch-1 branch January 14, 2021 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates configs to allow for the third failure mode #64

Updates configs to allow for the third failure mode #64

StellaAthena commented Jan 14, 2021

Updates configs to allow for the third failure mode #64

Updates configs to allow for the third failure mode #64

Conversation

StellaAthena commented Jan 14, 2021