Skip to content

Commit

Permalink
Fix by comments.
Browse files Browse the repository at this point in the history
  • Loading branch information
workingloong committed Mar 6, 2024
1 parent 4187971 commit 2a47a29
Show file tree
Hide file tree
Showing 5 changed files with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/blogs/stabilize_llm_training_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ ElasticAgent 就可以给拉起的子进程配置 local rank、global rank 和 w
的故障资源流程如下:

<div align="center">
<img src="../figures/ft_llm_training/dlrover_failure_recovery.png" alt="Editor" width="600">
<img src="../figures/ft_llm_training/dlrover_failure_recovery.jpg" alt="Editor" width="600">

<text> 图2:DLRover 训练故障自愈流程 </text>
</div>
Expand All @@ -112,7 +112,7 @@ DLRover 在重启训练前在每个 GPU 上启动子进程来运行一个轻量
详细见[检测脚本](../../dlrover/trainer/torch/run_network_check.py)。DLRover 启动训练任务前的检测流程如下。

<div align="center">
<img src="../figures/ft_llm_training/dlrover_node_check.png" alt="Editor" width="600">
<img src="../figures/ft_llm_training/dlrover_node_check.jpg" alt="Editor" width="600">

<text>图4:DLRover 节点检测流程 </text>
</div>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/figures/ft_llm_training/dlrover_node_check.png
Binary file not shown.

0 comments on commit 2a47a29

Please sign in to comment.