Skip to content

Commit

Permalink
Fix by comments.
Browse files Browse the repository at this point in the history
  • Loading branch information
workingloong committed Jan 4, 2024
1 parent adf5150 commit eb75d6d
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,10 @@ from the latest checkpoint when a failure happens. The actions of flash checkpoi
<text> The Performance of DLRover Flash Checkpoint to Save/Load GPT2-1.5B.</text>
</div>

The figure illustrates that the I/O time overhead to read checkpoint files
when resuming by restarting training processes. With DLRover Flash Checkpoint,
recovery directly from shared memory takes essentially
on the order of seconds wich is much faster than SSD and NAS.
The figure illustrates that the I/O time to read checkpoint files
when resuming training processes. With DLRover Flash Checkpoint,
recovery could be completed in the order of seconds by loading checkpoints directly from shared memory,
which is much faster compared to loading checkpoints from SSD and NAS.

#### Fault Tolerance Improves the Stability of TensorFlow PS Training

Expand Down
8 changes: 4 additions & 4 deletions docs/blogs/flash_checkpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,10 +307,10 @@ Compared to NAS remote file systems, FCP reduces the blocking time by nearly a h
<text>Figure 4: The Paused Training Time to Save Checkpoint.</text>
</div>

The figure illustrates that the I/O time overhead to read checkpoint files
when resuming by restarting training processes. With DLRover Flash Checkpoint,
recovery directly from shared memory takes essentially
on the order of seconds wich is much faster than SSD and NAS.
The figure illustrates that the I/O time to read checkpoint files
when resuming training processes. With DLRover Flash Checkpoint,
recovery could be completed in the order of seconds by loading checkpoints directly from shared memory,
which is much faster compared to loading checkpoints from SSD and NAS.

<div align="center">
<img src="../figures/ft_llm_training/checkpoint_load_time.jpg" alt="Editor" width="600">
Expand Down

0 comments on commit eb75d6d

Please sign in to comment.