Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed training using Kubernetes #77

Merged
merged 17 commits into from
Jan 23, 2021
Merged
Prev Previous commit
Next Next commit
Added logging config
  • Loading branch information
StellaAthena committed Jan 23, 2021
commit 480dc36d56434c8ab82f55dfba313f72f9ef0c5e
6 changes: 5 additions & 1 deletion configs/deepspeed_zero2.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"train_batch_size": 256,
"train_batch_size": 1028,
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"tensorboard": {
Expand Down Expand Up @@ -31,6 +31,10 @@
"contiguous_gradients" : false,
"cpu_offload": false
},
"logging": {
"steps_per_print": 100,
"wall_clock_breakdown": true
},
"activation_checkpointing": {
"comment": "to turn on activation checkpointing, set this to a positive integer. Do not touch other params.",
"partition_activations": false,
Expand Down