Skip to content

Commit

Permalink
Update async-checkpoint.md (#1155)
Browse files Browse the repository at this point in the history
fix the mistake of "manager"
  • Loading branch information
cainiaogoroad committed Jun 3, 2024
1 parent c7013fa commit b74c2a4
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/design/async-checkpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,13 @@ implement the checkpointing process.
<img src="../figures/async-ckpt-classes.jpg" alt="Async Checkpoint Classes" width="1000">
</div>

- **AgentCkptManger**
- **AgentCkptManager**
- One instance runs in each agent process.
memory and the storage.
- Get the Shared lock of shared memory and save the checkpoint state into the storage.
- One of Agent check if all agents finish the writing and commit the checkpoint.

- **TrainCkptManger**
- **TrainCkptManager**
- One instance runs in each training process.
- Is responsible for coping the checkpointing state from GPU to shared memory.
- Notifies the AgentCkptManger to save the checkpoint state into the storage.
Expand Down

0 comments on commit b74c2a4

Please sign in to comment.