Design to backup checkpoint shards between nodes. (intelligent-machin…

…e-learning#1137) * Desigo to backup checkpoint shards between nodes. * Format markdown. * Polish the design to checkpoint in memory. * Format markdown files. * Fix by comments.
Peter00796 · May 23, 2024 · a9359aa · a9359aa
1 parent 77977a7
commit a9359aa
Show file tree

Hide file tree

Showing 3 changed files with 24 additions and 30 deletions.
diff --git a/docs/design/checkpoint-in-memory.md b/docs/design/checkpoint-in-memory.md
@@ -38,40 +38,34 @@ With different distributed training strategy,
 the layout of the model and optimizer shards is different. We need to implement different
 backup strategies for different distributed training strategies.
 
-### DDP
-
-In a DDP job, the each rank has a complete model and optimizer replica. Using Flash Checkpoint,
-the local rank 0 of each node will copy the checkpoint into the CPU memory. The each node
-has a complete checkpoint in the CPU memory. If a node breakdowns and restarts, the job can
-select a alive node to broadcat its checkpoint to the new node.
-
-### FSDP
-
-The each rank has an unique shard of model and optimizer states using FSDP. The ElasticAgent of each node
-has its checkpoint shard in the CPU memoy with Flash Checkpoint. The job splits the nodes into groups and
-each group has 2 nodes. Each node need to backup the their checkpoint shard to the other node in a group.
-We can build a process group with the ElasticAgent using gloo backend to transfer checkpoint shards.
-Then, the ElasticAgent uses `torch.distributed.all_gather_object`
-to backup checkpoint shards. If a node in a group breakdowns and restarts, the job need to select the other
-alive node to broadcast the backup checkpoint to the new node.
-If the nodes in a group all fails, the training can only resume from the checkpoint in the storage.
+### Backup Checkpoint Shards of Data Parallelism
+
+The ranks in different DP units has the same model replica. If a node breakdowns and restarts,
+the retarted node can gather the model checkpoint from the memory of the node in other DP unit.
+
+<div align="center">
+<img src="../figures/ft_llm_training/dp-ckpt-backup.jpg" alt="Async Checkpoint Classes" width="600">
+</div>
 
-### Megatron-LM
+### Backup Checkpoint Shards of ZERO Optimizer or FSDP
 
-The megatron-LM uses the 3D parallel to train a LLM. The model and optimizer shards of the node with
-the same PP rank in different DP ranks are same. Similar to DDP, the new node can restore
-the checkpoint from the alive node with the same PP rank. If the training uses the distributed optimizer
-which partition the optimizer states across all rank, we need to backup the checkpoint similar to FSDP.
+The each rank has an unique shard of model and optimizer states using DeepSpeed ZERO or FSDP.
+After saving the checkpoint into the shared memory, the job can split nodes into backup groups
+with two nodes. The ranks of a node in a backup group can exchange checkpoint shards in the shared memory by
+`torch.distributed.all_gather`. After allgathering checkpoint shards,
+each rank has two checkpoint shards in the shared memory. If a node in a group breakdowns and restarts,
+the ranks of the node can gather its checkpoint shard from the memory of backup nodes.
+If the nodes in a group all fails, the training can only resume from the checkpoint in the storage.
+
+<div align="center">
+<img src="../figures/ft_llm_training/zero-ckpt-backup.jpg" alt="Async Checkpoint Classes" width="600">
+</div>
 
 ### Group Nodes to Backup Checkpoint
 
 The job can pair the nodes in groups with two nodes according to their sequence numbers. For example,
 the groups are [{0, 1}, {2, 3}, {4, 5}, {6, 7}] if there are 8 nodes. After the training restarts,
-each node will report "yes" to the master if the checkpoint in its memory, otherwise report "No" if
-the node restarts. The following 3 cases may happen:
-
-- No any node restarts. The job master notifies all nodes to restore the checkpoint from its CPU memory.
-- One node in some group has restarted The job master notifies the other alive node to broardcast the
-checkpoint in the CPU memory to the restarted node. Then, all nodes rrestore the checkpoint from its CPU memory.
-- All the nodes in a group have restarted, which shows some checkpoint shards are lost in the memory.
-If no any node restarts,. The job master will notify all nodes to restore checkpoint from the storage.
+each rank will read checkpoint from its shared memory. The ranks of restarted nodes
+will gather the checkpoint from the back up nodes. Then, all ranks will use an AllReduce operation
+to check whether the step in the checkpoint of all ranks are the same. All ranks will read the
+checkpoint from the storage if the steps are not consistent.
diff --git a/docs/figures/ft_llm_training/dp-ckpt-backup.jpg b/docs/figures/ft_llm_training/dp-ckpt-backup.jpg
diff --git a/docs/figures/ft_llm_training/zero-ckpt-backup.jpg b/docs/figures/ft_llm_training/zero-ckpt-backup.jpg