A tutorial to find slow/fault nodes. (intelligent-machine-learning#922)

* A tutorial to find slow/fault nodes. * polish * Format markdown.
BalaBalaYi · Dec 31, 2023 · fcab049 · fcab049
1 parent 18f09bf
commit fcab049
Show file tree

Hide file tree

Showing 9 changed files with 619 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -212,10 +212,10 @@ Please refer to the [DEVELOPMENT](docs/developer_guide.md)
 
 ## Quick Start
 
-[Train a TensorFlow Estimator on Kubernetes](docs/tutorial/tf_elasticjob_on_k8s.md)
+[Train a PyTorch Model on Kubernetes.](docs/tutorial/torch_on_cloud.md)
 
-[Train a PyTorch Model on Kubernetes](docs/tutorial/torch_elasticjob_on_k8s.md)
+[Train a GPT Model on Kubernetes.](docs/tutorial/torch_ddp_nanogpt.md)
 
-[Train a GPT Model.](docs/tutorial/torch_nanogpt.md)
+[Use DLRover to find slow/fault nodes.](docs/tutorial/check_node_health.md)
 
-[Train a llama2 model.](examples/pytorch/llama2/README.md)
+[Train a TensorFlow Estimator on Kubernetes.](docs/tutorial/tf_ps_on_cloud.md)
diff --git a/docs/blogs/stabilize_llm_training_cn.md b/docs/blogs/stabilize_llm_training_cn.md
@@ -128,7 +128,7 @@ world，每个 world 内的节点上执行 allgather 任务并将成功与否上
 说明节点6 就是故障节点。然后，DLRover 会重新拉起一个 Pod，替换节点6。
 
 <div align="center">
-<img src="../figures/ft_llm_training/node_healthy_check.jpg" alt="Editor" width="600">
+<img src="../figures/ft_llm_training/node_health_check.jpg" alt="Editor" width="600">
 </div>
 
 ### DLRover 错误日志收集

diff --git a/docs/design/async-checkpoint.md b/docs/design/async-checkpoint.md
@@ -44,7 +44,7 @@ will block the training to save checkpoint with a little time.
 
 ## A daemon Subprocess of the Training Process Asynchronously Saves Checkpoint to the Storage
 
-We can start a daemon subprocess in the training process to save checkpoint to the storage. 
+We can start a daemon subprocess in the training process to save checkpoint to the storage.
 
 - Start a thread to save states from GPU to CPU memory.
 - Make the memory buffer to place Torch tensors of states.
@@ -149,7 +149,9 @@ finish the writing.
 
 ## Checkpoint APIs Design
 
-The engine synchronously saves the checkpointing state dict into the CPU memory buffer and notifies the checkpoint saver to save the checkpoint from CPU memory buffer to the storage.
+The engine synchronously saves the checkpointing state dict into the CPU memory
+buffer and notifies the checkpoint saver to save the checkpoint from CPU memory
+buffer to the storage.
 
 ```Python
 

diff --git a/docs/figures/ft_llm_training/node_health_check.jpg b/docs/figures/ft_llm_training/node_health_check.jpg
diff --git a/docs/figures/ft_llm_training/node_healthy_check.jpg b/docs/figures/ft_llm_training/node_healthy_check.jpg