Skip to content

Commit

Permalink
A tutorial to find slow/fault nodes. (intelligent-machine-learning#922)
Browse files Browse the repository at this point in the history
* A tutorial to find slow/fault nodes.

* polish

* Format markdown.
  • Loading branch information
workingloong committed Dec 31, 2023
1 parent 18f09bf commit fcab049
Show file tree
Hide file tree
Showing 9 changed files with 619 additions and 13 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,10 +212,10 @@ Please refer to the [DEVELOPMENT](docs/developer_guide.md)

## Quick Start

[Train a TensorFlow Estimator on Kubernetes](docs/tutorial/tf_elasticjob_on_k8s.md)
[Train a PyTorch Model on Kubernetes.](docs/tutorial/torch_on_cloud.md)

[Train a PyTorch Model on Kubernetes](docs/tutorial/torch_elasticjob_on_k8s.md)
[Train a GPT Model on Kubernetes.](docs/tutorial/torch_ddp_nanogpt.md)

[Train a GPT Model.](docs/tutorial/torch_nanogpt.md)
[Use DLRover to find slow/fault nodes.](docs/tutorial/check_node_health.md)

[Train a llama2 model.](examples/pytorch/llama2/README.md)
[Train a TensorFlow Estimator on Kubernetes.](docs/tutorial/tf_ps_on_cloud.md)
2 changes: 1 addition & 1 deletion docs/blogs/stabilize_llm_training_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ world,每个 world 内的节点上执行 allgather 任务并将成功与否上
说明节点6 就是故障节点。然后,DLRover 会重新拉起一个 Pod,替换节点6。

<div align="center">
<img src="../figures/ft_llm_training/node_healthy_check.jpg" alt="Editor" width="600">
<img src="../figures/ft_llm_training/node_health_check.jpg" alt="Editor" width="600">
</div>

### DLRover 错误日志收集
Expand Down
6 changes: 4 additions & 2 deletions docs/design/async-checkpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ will block the training to save checkpoint with a little time.

## A daemon Subprocess of the Training Process Asynchronously Saves Checkpoint to the Storage

We can start a daemon subprocess in the training process to save checkpoint to the storage.
We can start a daemon subprocess in the training process to save checkpoint to the storage.

- Start a thread to save states from GPU to CPU memory.
- Make the memory buffer to place Torch tensors of states.
Expand Down Expand Up @@ -149,7 +149,9 @@ finish the writing.

## Checkpoint APIs Design

The engine synchronously saves the checkpointing state dict into the CPU memory buffer and notifies the checkpoint saver to save the checkpoint from CPU memory buffer to the storage.
The engine synchronously saves the checkpointing state dict into the CPU memory
buffer and notifies the checkpoint saver to save the checkpoint from CPU memory
buffer to the storage.

```Python

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/figures/ft_llm_training/node_healthy_check.jpg
Binary file not shown.
Loading

0 comments on commit fcab049

Please sign in to comment.