-
Notifications
You must be signed in to change notification settings - Fork 989
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add checkpoint saving / loading (#90)
* fix torch.utils.checkpoint error * fix breaks in train_pipeline.py * push fixes to train_pipeline.py * push changes to zero1 config * omnibus changes to *pipeline.py scripts * add checkpoint saving / loading * changed line 64 checkpoint_dirs = natural_sort(checkpoint_dir) rather than natural_sort(checkpoint_dirs) * Update utils.py * fix checkpoint saving / loading logic * fix checkpoint saving logic * Update gpt3_small.json * Change params for OnebitAdam Per my issue in deepspeed yesterday, I was told by a dev (microsoft/DeepSpeed#690 (comment)) that the error I was facing was due to the incorrect keyword. * Fixing batch size * Made consistent with ZeRO 1 * Made consistent with ZeRO 2 * Update deepspeed_zero2.json * Undo previous commit * Reverted back to normal adam from 1-bit-adam (#96) * Create checkpoints_config.json * Update train_gpt3small_pipeline.sh Co-authored-by: Shivanshu Purohit <[email protected]> Co-authored-by: Stella Biderman <[email protected]>
- Loading branch information
1 parent
39972e6
commit 4aee002
Showing
13 changed files
with
183 additions
and
45 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -281,3 +281,6 @@ TSWLatexianTemp* | |
|
||
# Makeindex log files | ||
*.lpz | ||
|
||
# saved model files | ||
*.pt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,5 +15,5 @@ | |
"n_layers": 6, | ||
"n_heads": 8, | ||
"dim_head": 64, | ||
"train_batch_size": 8 | ||
"checkpoint_dir": "./enwik8_model" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
{ | ||
"train_batch_size": 1280, | ||
"gradient_accumulation_steps": 80, | ||
"gradient_clipping": 1.0, | ||
"wall_clock_breakdown": true, | ||
"zero_allow_untested_optimizer": true, | ||
"tensorboard": { | ||
"enabled": true, | ||
"output_path": "./logs", | ||
"job_name": "gptneox" | ||
}, | ||
"optimizer": { | ||
"type": "OneBitAdam", | ||
"params": { | ||
"lr": 2e-4, | ||
"freeze_step":2, | ||
"cuda_aware":true | ||
} | ||
}, | ||
"scheduler": { | ||
"type": "WarmupLR", | ||
"params": { | ||
"warmup_min_lr": 0, | ||
"warmup_max_lr": 0.00015, | ||
"warmup_num_steps": 5000 | ||
} | ||
}, | ||
"fp16": { | ||
"enabled": true | ||
}, | ||
"zero_optimization": { | ||
"stage": 1, | ||
"contiguous_gradients" : true, | ||
"cpu_offload": false | ||
}, | ||
"activation_checkpointing": { | ||
"partition_activations": true, | ||
"cpu_checkpointing": false, | ||
"contiguous_memory_optimization": false, | ||
"number_checkpoints": 1, | ||
"synchronize_checkpoint_boundary": false, | ||
"profile": false | ||
} | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,5 +23,6 @@ | |
"n_layers": 12, | ||
"n_heads": 12, | ||
"dim_head": 64, | ||
"checkpoint_dir": "./gpt3small", | ||
"train_batch_size": 256 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
pkill -f "python -u train*" | ||
pkill -f "python -u train*" | ||
pkill -9 python |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,2 @@ | ||
mkdir logs | ||
NCCL_SHM_DISABLE=1 NCCL_DEBUG=info MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train_enwik8_pipeline.py --deepspeed --deepspeed_config configs/deepspeed_zero2.json | ||
NCCL_SHM_DISABLE=1 NCCL_DEBUG=info MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train_enwik8_pipeline.py --deepspeed --deepspeed_config configs/deepspeed_zero1.json --model base_model |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train_pipeline.py --deepspeed --deepspeed_config configs/deepspeed_zero1.json | ||
NCCL_SHM_DISABLE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=2000 deepspeed train_pipeline.py --deepspeed --deepspeed_config configs/checkpoints_config.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters