Skip to content

Commit

Permalink
Polish ReadME
Browse files Browse the repository at this point in the history
  • Loading branch information
workingloong committed Jan 2, 2024
1 parent 7094d6a commit 33e4f40
Showing 1 changed file with 10 additions and 8 deletions.
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,18 +49,20 @@ training job. The actions to restore training in DLRover are:
For detail, we can see [experiments](docs/tech_report/fault_tolerance_exps.md)
of fault-tolerance and elasticity.

### Flash Checkpoint to Reduce the Time Overhead to Save/Load Checkpoint
#### Fault Tolerance and Flash Checkpoint to Reduce Downtime of PyTorch Training.

DLRover Flash Checkpoint can save/load checkpoint in seconds and the training
can frequently do checkpoint with a little time. The actions of flash checkpoint are:
In addition to fault tolerance, DLRover provides the flash checkpoint to
save/load checkpoint in seconds. With flash checkpoint, the training can
frequently save checkpoints and reduce the roll-back step to resume training
from the latest checkpoint when a failure happens. The actions of flash checkpoint are:

1. Asynchronously persist the checkpoint to the storage.
2. Persist the checkpoint to the storage once the training process fails.
3. Load the checkpoint from the host memory after the training process restarts.

After applying the fault tolerance and flash checkpoint of DLRover, the overall goodput
(the time spent computing useful new steps over the elapsed time of the training job)
for the largest-scale training job using thousands of GPUs increased from 69% to 95%.
After applying the fault tolerance and flash checkpoint of DLRover, **the overall goodput
for the largest-scale training job using thousands of GPUs increased from 69% to 95%** .
The goodput is the time spent computing useful new steps over the elapsed time of the training job.
The downtime details are shown:

<div align="center">
Expand All @@ -76,8 +78,8 @@ DLRover can recover failed parameter servers and workers to resume training.
3. DLRover can automatically scale up the parameter servers to fit the model size.

In AntGroup, DLRover manages hundreds of DL training jobs every day on the customized Kubernetes cluster in AntGroup.
Except for the failed job resulting from code errors, the rate of completed jobs raise 89%
with tf-operator in KubeFlow to 95%. Other unrecoverable failure reasons of a job are data error,
Except for the failed job resulting from code errors, *the rate of completed jobs increase from 89%
with tf-operator in KubeFlow to 95%*. Other unrecoverable failure reasons of a job are data error,
NaN loss of the model, network breakdown, and so on.

<div align="center">
Expand Down

0 comments on commit 33e4f40

Please sign in to comment.