diff --git a/README.md b/README.md index 23e948813..4ceafdea9 100644 --- a/README.md +++ b/README.md @@ -212,10 +212,10 @@ Please refer to the [DEVELOPMENT](docs/developer_guide.md) ## Quick Start -[Train a TensorFlow Estimator on Kubernetes](docs/tutorial/tf_elasticjob_on_k8s.md) +[Train a PyTorch Model on Kubernetes.](docs/tutorial/torch_on_cloud.md) -[Train a PyTorch Model on Kubernetes](docs/tutorial/torch_elasticjob_on_k8s.md) +[Train a GPT Model on Kubernetes.](docs/tutorial/torch_ddp_nanogpt.md) -[Train a GPT Model.](docs/tutorial/torch_nanogpt.md) +[Use DLRover to find slow/fault nodes.](docs/tutorial/check_node_health.md) -[Train a llama2 model.](examples/pytorch/llama2/README.md) +[Train a TensorFlow Estimator on Kubernetes.](docs/tutorial/tf_ps_on_cloud.md) diff --git a/docs/blogs/stabilize_llm_training_cn.md b/docs/blogs/stabilize_llm_training_cn.md index 6054c31d0..f931b30b6 100644 --- a/docs/blogs/stabilize_llm_training_cn.md +++ b/docs/blogs/stabilize_llm_training_cn.md @@ -128,7 +128,7 @@ world,每个 world 内的节点上执行 allgather 任务并将成功与否上 说明节点6 就是故障节点。然后,DLRover 会重新拉起一个 Pod,替换节点6。
-Editor +Editor
### DLRover 错误日志收集 diff --git a/docs/design/async-checkpoint.md b/docs/design/async-checkpoint.md index 553dd6e07..05d7636a9 100644 --- a/docs/design/async-checkpoint.md +++ b/docs/design/async-checkpoint.md @@ -44,7 +44,7 @@ will block the training to save checkpoint with a little time. ## A daemon Subprocess of the Training Process Asynchronously Saves Checkpoint to the Storage -We can start a daemon subprocess in the training process to save checkpoint to the storage. +We can start a daemon subprocess in the training process to save checkpoint to the storage. - Start a thread to save states from GPU to CPU memory. - Make the memory buffer to place Torch tensors of states. @@ -149,7 +149,9 @@ finish the writing. ## Checkpoint APIs Design -The engine synchronously saves the checkpointing state dict into the CPU memory buffer and notifies the checkpoint saver to save the checkpoint from CPU memory buffer to the storage. +The engine synchronously saves the checkpointing state dict into the CPU memory +buffer and notifies the checkpoint saver to save the checkpoint from CPU memory +buffer to the storage. ```Python diff --git a/docs/figures/ft_llm_training/node_health_check.jpg b/docs/figures/ft_llm_training/node_health_check.jpg new file mode 100644 index 000000000..f267afe4b Binary files /dev/null and b/docs/figures/ft_llm_training/node_health_check.jpg differ diff --git a/docs/figures/ft_llm_training/node_healthy_check.jpg b/docs/figures/ft_llm_training/node_healthy_check.jpg deleted file mode 100644 index 519bc0cf2..000000000 Binary files a/docs/figures/ft_llm_training/node_healthy_check.jpg and /dev/null differ diff --git a/docs/tech_report/fault_tolerance_exps.md b/docs/tech_report/fault_tolerance_exps.md new file mode 100644 index 000000000..f32d8d1ec --- /dev/null +++ b/docs/tech_report/fault_tolerance_exps.md @@ -0,0 +1,492 @@ +# Fault-tolerance and Elasticity Experiments of DLRover ElasticJob + +The tutorial shows experiments to test the fault-tolerance and elasticity +of DLRover ElasticJob. In the experiments, we use the chaos enginerring toolkit +[chaosblade](https://github.com/chaosblade-io/chaosblade) to simulate fault scenarios. + +## Preliminary + +- Create a k8s cluster and configure cluster credentials on your local computer. +- Deploy DLRover ElasticJob on the k8s cluster with the [tutorial](torch_on_cloud.md). +- Build the image with chaosblade like the [example](../../examples/pytorch/mnist/mnist_chaos.dockerfile). + +## Experiments of PyTorch Distributed Job + +We conduct experiments to simulate the following scenarios: + +- The Pod is preempted. +- The Pod is a straggler. +- The Pod is placed on a fualt node. +- The Pod network breaks down during training. +- The training process corrupts in the Pod. + +### Pod is Preempted + +In the experiment, we submit a job with the [example](../../examples/pytorch/mnist/choas_test_job.yaml) +and the command in the worker spec is + +```yaml + command: + - /bin/bash + - -c + - "dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \ + --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \ + examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ + --training_data /data/mnist_png/training/ \ + --validation_data /data/mnist_png/testing/" +``` + +The Pods of the job are: + +```text +chaos-test-edljob-worker-0 1/1 Running 0 85s +chaos-test-edljob-worker-1 1/1 Running 0 85s +chaos-test-edljob-worker-2 1/1 Running 0 85s +chaos-test-edljob-worker-3 1/1 Running 0 85s +elasticjob-chaos-test-dlrover-master 1/1 Running 0 89s +``` + +We kill the worker-0 to simulate that the Pod is preempted by the command + +```bash +kubectl -n dlrover delete pod chaos-test-edljob-worker-0 +``` + +After killing worker-0, job Pods are + +```text +chaos-test-edljob-worker-1 1/1 Running 0 2m3s +chaos-test-edljob-worker-2 1/1 Running 0 2m3s +chaos-test-edljob-worker-3 1/1 Running 0 2m3s +chaos-test-edljob-worker-4 1/1 Running 0 30s +elasticjob-chaos-test-dlrover-master 1/1 Running 0 2m7s +``` + +Then, we can see the log of worker to check whether the training restores. + +```bash +kubectl -n dlrover logs chaos-test-edljob-worker-1 +``` + +```text +loss = 2.298487901687622, step = 0 +INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. +INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. +loss = 2.195965051651001, step = 20 +loss = 1.2307546138763428, step = 40 +loss = 0.6579511761665344, step = 60 +loss = 1.0608341693878174, step = 80 +loss = 0.7761049270629883, step = 100 +``` + +### Straggler Pod + +In the experiment, we set replicas of worker to 4 in a job and +use chaosblade to perform a CPU full load 90% on the `worker-1` with the command + +```bash +chaosblade-1.7.2/blade create cpu load --cpu-percent 90 +``` + +If you use the image `registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:torch201-mnist`, +you can use chaosblade to create a chaos experiment by + +```bash +sh examples/pytorch/mnist/start_chaos.sh cpu-overload +``` + +and set the command in the yaml of elasticjob like the [example](../../examples/pytorch/mnist/choas_test_job.yaml). + +```yaml + command: + - /bin/bash + - -c + - "(bash examples/pytorch/mnist/start_chaos.sh cpu-overload &) && \ + dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \ + --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \ + examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ + --training_data /data/mnist_png/training/ \ + --validation_data /data/mnist_png/testing/" +``` + +After submitting an ElasticJob to the k8s cluster by +`kubectl -n dlrover apply -f examples/pytorch/mnist/choas_test_job.yaml`, +We can see the `worker-1` exits with errors like + +```text +elasticjob-torch-mnist-debug-dlrover-master 0/1 Completed 0 3h17m +torch-mnist-debug-edljob-worker-0 0/1 Completed 0 3h17m +torch-mnist-debug-edljob-worker-1 0/1 Error 0 3h17m +torch-mnist-debug-edljob-worker-2 0/1 Completed 0 3h17m +torch-mnist-debug-edljob-worker-3 0/1 Completed 0 3h17m +torch-mnist-debug-edljob-worker-4 0/1 Completed 0 3h10m +``` + +From the log of worker-1 by `kubectl -n dlrover logs torch-mnist-debug-edljob-worker-1`, +worker-1 fails because it is a straggler. If we don't want to the worker-1 fails due to +straggler, we can remove the config `dlrover-run --network-check --exclude-straggler` +from the command like `dlrover-run --network-check`. + +```text +[2023-09-26 03:52:20,235] [INFO] [training.py:707:run] Fault nodes are: [] and stragglers are: [1]. +Traceback (most recent call last): + File "/usr/local/bin/dlrover-run", line 8, in + sys.exit(main()) + ... + File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 733, in run + raise RuntimeError("The node is a straggler and exits.") +RuntimeError: The node is a straggler and exits. + +``` + +We can see the elapsed time of each node to run the task to check straggler in the master log. + +```bash +kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed +``` + +```text +Round 0: The node elapsed time are {2: 20.307, 3: 20.265, 0: 206.872, 1: 151.752} +Round 1: The node elapsed time are {2: 20.307, 3: 20.265, 0: 23.174, 1: 135.961} +Round 2: The node elapsed time aree {2: 21.491, 0: 22.685, 3: 20.889, 1: 23.097} +``` + +From the log, The worker-1 the elapsed time is much bigger than others in the first 2 rounds. +After the worker-1 fails, the ElasticJob relaunch a new Pod worker-4 to restore the failed Pod. +The elapsed times of all nodes have not significant differenct. Note. the index is the +`WOKRER_RANK` of node. The `WORKER_RANK` of worker-4 is the same as worker-1. + +### Fault Node + +In the experiment, we set replicas of worker to 4 in a job and +use chaosblade to kill the process to run `run_network_check.py` +to simulate the fault node. + +and set the command in the yaml of elasticjob like the [example](../../examples/pytorch/mnist/choas_test_job.yaml). + +```yaml +command: + - /bin/bash + - -c + - "(bash examples/pytorch/mnist/start_chaos.sh kill-process &) && \ + dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \ + --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \ + examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ + --training_data /data/mnist_png/training/ \ + --validation_data /data/mnist_png/testing/" +``` + +```text +chaos-test-edljob-worker-0 1/1 Running 0 12m +chaos-test-edljob-worker-1 0/1 Error 0 12m +chaos-test-edljob-worker-2 1/1 Running 0 12m +chaos-test-edljob-worker-3 1/1 Running 0 12m +chaos-test-edljob-worker-4 1/1 Running 0 3m59s +elasticjob-chaos-test-dlrover-master 1/1 Running 0 12m +``` + +The worker-1 fails with the message + +```text +Traceback (most recent call last): + .... + File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 732, in run + raise RuntimeError("The node network is breakdown.") +RuntimeError: The node network is breakdown. +``` + +From the master log by `kubectl -n dlrover logs elasticjob-chaos-test-dlrover-master | grep "The node status"`, +the worker-1 fails in the first 2 round check. Afther worker-4 starts to replace worker-1, +all nodes are noraml. + +```text +Round 1: The node status are {1: False, 2: True, 3: True, 0: False}. +Round 2: The node status are {1: False, 2: True, 3: True, 0: True}. +Round 3: The node status are {3: True, 0: True, 1: True, 2: True}. +``` + +### Network Breakdown + +In the experiment, we set replicas of worker to 4 in a job and +use chaosblade to set the network loss rate to 100% to simulate +that the network of the node is breakdown. + +We watch the log of worker-1 to check whether the training starts. +The training starts if `loss=..., step=...` in the log. +After the training starts, we perform a network loadd rate 100% +in the worker-1 to simulate the networker of worker-1 is breakdown. + +```bash +kubectl -n dlrover exec -it chaos-test-edljob-worker-1 bash +./chaosblade-1.7.2/blade create network loss --percent 100 --interface eth0 +``` + +Then, the worker-1 fails and a new worker-4 starts to replace the worker-1. + +```text +chaos-test-edljob-worker-0 1/1 Running 0 4m39s +chaos-test-edljob-worker-1 0/1 Error 0 4m39s +chaos-test-edljob-worker-2 1/1 Running 0 4m39s +chaos-test-edljob-worker-3 1/1 Running 0 4m39s +chaos-test-edljob-worker-4 1/1 Running 0 17s +elasticjob-chaos-test-dlrover-master 1/1 Running 0 4m43s +``` + +The training also restores after the worker-4 starts by the log of worker-0. + +```text +loss = 0.24101698398590088, step = 0 +INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. +INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. +loss = 0.4646361768245697, step = 20 +``` + +### Training Process Corruption + +In the experiment, we set replicas of worker to 4 in a job and +use chaosblade to kill a training process to simulate the GPU error. + +We watch the log of worker-1 to check whether the training starts. +The training starts if `loss=..., step=...` in the log. +After the training starts, we kill a process in the worker-1. + +```bash +kubectl -n dlrover exec -it chaos-test-edljob-worker-1 bash +ps -aux | grep cnn_train.py +``` + +Then, we can kill a training process by `kill -9 ${PID}`. The all workers +are still running and we can see that the training restarts from the log. + +```text +chaos-test-edljob-worker-0 1/1 Running 0 3m4s +chaos-test-edljob-worker-1 1/1 Running 0 3m4s +chaos-test-edljob-worker-2 1/1 Running 0 3m4s +chaos-test-edljob-worker-3 1/1 Running 0 3m4s +elasticjob-chaos-test-dlrover-master 1/1 Running 0 3m9s +``` + +### Scale Up Nodes + +In the experiment, we use the [example](../../examples/pytorch/mnist/elastic_job.yaml) +to submit an elastic training job. In the job, we set the `min_node=3` and +`max_node=$WORKER_NUM` as the number of replicas. The ElasticJob will set the replicas +into the environment `WORKER_NUM`. + +At first, there are 3 running workers and 1 pending worker due to the insufficient resource. + +```text +elasticjob-torch-mnist-dxlrover-master 1/1 Running 0 57s +torch-mnist-edljob-worker-0 1/1 Running 0 47s +torch-mnist-edljob-worker-1 1/1 Running 0 47s +torch-mnist-edljob-worker-2 1/1 Running 0 47s +torch-mnist-edljob-worker-3 0/1 Pending 0 47s +``` + +After about 2 min, we can see the training starts in 3 running workers with the log. + +```text +[2023-09-27 02:23:21,097] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result: + restart_count=0 + master_addr=192.168.0.71 + master_port=36725 + group_rank=0 + group_world_size=3 + local_ranks=[0, 1] + role_ranks=[0, 1] + global_ranks=[0, 1] + role_world_sizes=[6, 6] + global_world_sizes=[6, 6] + +rank 1 is initialized local_rank = 1 +loss = 2.3198373317718506, step = 0 +loss = 2.2946105003356934, step = 0 +loss = 1.7543025016784668, step = 20 +``` + +Then, we kill another job to release resource and the worker-3 will start. + +```text +elasticjob-torch-mnist-dlrover-master 1/1 Running 0 5m39s +torch-mnist-edljob-worker-0 1/1 Running 0 5m34s +torch-mnist-edljob-worker-1 1/1 Running 0 5m34s +torch-mnist-edljob-worker-2 1/1 Running 0 5m34s +torch-mnist-edljob-worker-3 1/1 Running 0 5m34s +``` + +From the log of worker-0, we can see the training starts with `group_world_size=4`. + +```text +[2023-09-27 02:25:43,362] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result: + restart_count=1 + master_addr=192.168.0.71 + master_port=58241 + group_rank=0 + group_world_size=4 + local_ranks=[0, 1] + role_ranks=[0, 1] + global_ranks=[0, 1] + role_world_sizes=[8, 8] + global_world_sizes=[8, 8] + +rank 1 is initialized local_rank = 1rank 0 is initialized local_rank = 0 + +loss = 2.2984073162078857, step = 0 +loss = 2.1407980918884277, step = 20 +loss = 1.1324385404586792, step = 40 +loss = 0.4783979058265686, step = 60 +loss = 0.5714012384414673, step = 80 +loss = 0.6941334009170532, step = 100 +``` + +### Scale Down Nodes + +In the experiment, we use the [example](../../examples/pytorch/mnist/elastic_job.yaml) +to submit an elastic training job. In the job, we set the `min_node=3` and +`max_node=$WORKER_NUM` as the number of replicas. The ElasticJob will set the replicas +into the environment `WORKER_NUM`. + +At first, there are 4 running workers. + +```text +elasticjob-torch-mnist-dlrover-master 1/1 Running 0 2m43s +torch-mnist-edljob-worker-0 1/1 Running 0 2m38s +torch-mnist-edljob-worker-1 1/1 Running 0 2m38s +torch-mnist-edljob-worker-2 1/1 Running 0 2m38s +torch-mnist-edljob-worker-3 0/1 Running 0 2m38s +``` + +Then, we use the chaosblade to make worker-1 failed. + +```bash +kubectl -n dlrover exec -it torch-mnist-edljob-worker-1 bash +./chaosblade-1.7.2/blade create process kill --process dlrover-run --signal 1 +``` + +```text +elasticjob-torch-mnist-dlrover-master 1/1 Running 0 4m43s +torch-mnist-edljob-worker-0 1/1 Running 0 4m38s +torch-mnist-edljob-worker-1 0/1 Error 0 4m38s +torch-mnist-edljob-worker-2 1/1 Running 0 4m38s +torch-mnist-edljob-worker-3 1/1 Running 0 4m38s +``` + +From the log of worker-0, we can see the training restores the model and data sampler +from the checkpoint and starts with `group_world_size=3`. + +```text +[2023-09-27 03:18:00,815] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result: + restart_count=1 + master_addr=192.168.0.66 + master_port=39705 + group_rank=0 + group_world_size=3 + local_ranks=[0, 1] + role_ranks=[0, 1] + global_ranks=[0, 1] + role_world_sizes=[6, 6] + global_world_sizes=[6, 6] + +[2023-09-27 03:18:05,957] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467 +[2023-09-27 03:18:05,958] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467 +loss = 0.2617453336715698, step = 0 +loss = 0.2548859417438507, step = 20 +``` + +## Experiments of TensorFlow PS Distributed Job + +We conduct experiments with the TF distributed job using PS to +test the fault-tolerance of worker and PS. + +### Fault-tolerance of Worker + +We can sumit a TensorFlow PS job using the [example](../../examples/tensorflow/criteo_deeprec/manual_job.yaml). +The job will launch 1 chief, 1 worker and 1 PS. + +```bash +kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml +``` + +```text +deepctr-manual-scale-edljob-chief-0 1/1 Running 0 88s +deepctr-manual-scale-edljob-ps-0 1/1 Running 0 88s +deepctr-manual-scale-edljob-worker-0 1/1 Running 0 88s +elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 99s +``` + +We use `kubectl` to kill a worker. + +```bash +kubectl -n dlrover delete pod deepctr-manual-scale-edljob-worker-0 +``` + +After the worker-0 is killed, the job relaunch the worker-1 to restore the failed node. + +```text +NAME READY STATUS RESTARTS AGE +deepctr-manual-scale-edljob-chief-0 1/1 Running 0 2m57s +deepctr-manual-scale-edljob-ps-0 1/1 Running 0 2m57s +deepctr-manual-scale-edljob-worker-1 1/1 Running 0 60s +elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 3m8s +``` + +After the job runs about 4min, the chief-0 fails with OOM due to the insufficient memory +configuration. The job relaunches the chief-1 with more memory to restore it. + +```text +deepctr-manual-scale-edljob-chief-0 0/1 OOMKilled 0 4m53s +deepctr-manual-scale-edljob-chief-1 1/1 Running 0 64s +deepctr-manual-scale-edljob-ps-0 1/1 Running 0 4m53s +deepctr-manual-scale-edljob-worker-1 1/1 Running 0 2m56s +``` + +We can view the memory of chief-0 and chief-1 by + +```bash +kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-0 -o yaml | grep memory + +>>> + memory: 4Gi + memory: 4Gi +``` + +```bash +kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-1 -o yaml | grep memory + +>>> + memory: 8Gi + memory: 8Gi +``` + +We can view the log of chief-1 to check whether the training restores. + +```shell +[2023-03-20 11:51:10,774] [INFO][session_manager.py:511:_try_run_local_init_op] Running local_init_op. +[2023-03-20 11:51:11,302] [INFO][session_manager.py:513:_try_run_local_init_op] Done running local_init_op. +[2023-03-20 11:51:14,279] [INFO][global_step_hook.py:39:before_run] global_step: 126 +``` + +### Fault-tolerance of PS + +We kill the ps-0 by `kubectl -n dlrover delete pod deepctr-manual-scale-edljob-ps-0`. +The job relaunches the ps-1 to restore the killed ps-0> + +```text +deepctr-manual-scale-edljob-chief-0 0/1 OOMKilled 0 10m +deepctr-manual-scale-edljob-chief-1 1/1 Running 0 7m1s +deepctr-manual-scale-edljob-ps-1 1/1 Running 0 109s +deepctr-manual-scale-edljob-worker-1 0/1 OOMKilled 0 8m53s +deepctr-manual-scale-edljob-worker-2 1/1 Running 0 4m13s +elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 11m +``` + +From the log of chief, the training job restore the model from the latest checkpoint +and contiune training the model. + +```text +[2023-09-26 19:24:00,861] [INFO][saver.py:1531:restore] Restoring parameters from /nas/deepctr/model.ckpt-126 +[2023-09-26 19:24:03,473] [INFO][session_manager.py:511:_try_run_local_init_op] Running local_init_op. +[2023-09-26 19:24:03,580] [INFO] [resource.py:164:report_resource] Report Resource CPU : 0.98, Memory 7146, GPU [] +[2023-09-26 19:24:03,670] [INFO][session_manager.py:513:_try_run_local_init_op] Done running local_init_op. +[2023-09-26 19:24:07,665] [INFO][basic_session_run_hooks.py:627:_save] Saving checkpoints for 126 into /nas/deepctr/model.ckpt. +``` diff --git a/docs/tutorial/check_node_health.md b/docs/tutorial/check_node_health.md new file mode 100644 index 000000000..ebc0e4ccd --- /dev/null +++ b/docs/tutorial/check_node_health.md @@ -0,0 +1,115 @@ +# Introduction to Check the Node Healthy using DLRover + +The document introduces how to use DLRover to check the slow or fault node +in a distributed job on a k8s cluster. + +## System Design + +In a distributed job, DLRover will launch a job master process to group the nodes. +Before starting the training subprocess, The job master initially divides all nodes +into multiple worlds in pairs. The two nodes in each world execute an allgather task +and a matmul task. After tasks are completed, each node will reports the execution time +to the job master. If the timeout happens in the allgather task, the execution time +will be 3600s. If the execution times of all nodes are not 3600s, the job master will +print the execution time in the log. Users can find the slow node with the execution +time. + +If the execution times of some nodes are 3600s, there may be fault node in the distributed +job. However, we cannot disagnose which node are fault. Because, the execution times +of two nodes in group both are 3600s to execute the allgather task if anyone is faulty. +Now, the job master will sort nodes by the execution time and +group together the node with the longest execution time and the one with +the shortest execution time. All nodes will execute the check task again. According to +the execution time of two rounds check, the job master can diagnose the node is faulty +whose execution time in two rounds both are 3600s. + +For example, if there are 6 nodes in a job, the 1st round of division results might be +[{1,2}, {3,4}, {5,6}]. If allgather fails in the group {5, 6}, the node 5 and 6 are the +potential faulty nodes. Therefore, the 2nd round of division would be [{1,2}, {3,5}, {4,6}]. +If {4,6} fails, it indicates that node 6 is the faulty node. The node 6 will exit with +an exception. + +
+Editor +
+ +## Check Node Healthy in Distributed Job with dlrover-run + +We can use `dlrover-run` to replace the `torchrun` to check the node health in +a distributed PyTorch job. The `dlrover-run` is developed based `torchrun` and +`dlrover-run` support all arguments of `torchrun`. + +### Check Node Healthy in a DLRover ElasticJob + +The DLRover ElasticJob can relaunch a new Pod to replace the fault Pod and can +scale up/down Pods to train if the number of available nodes changes during +the training. If you can develop a k8s CRD on a cluster, we strongly suggest you deploy the +ElasticJob CRD on the cluster by the following steps. + +```bash +git clone git@github.com:intelligent-machine-learning/dlrover.git +cd dlrover/go/operator/ +make deploy IMG=easydl/elasticjob-controller:master +# Grant permission for the DLRover master to Access CRDs. +kubectl -n dlrover apply -f config/manifests/bases/default-role.yaml +``` + +After deploying the ElasticJob, we can submit a distributed job with +the `dlrover-run --network-check` in the job yaml like + +```yaml +command: + - /bin/bash + - -c + - "dlrover-run --network-check --nnodes=3:$NODE_NUM --nproc_per_node=2 --max_restarts=3 \ + examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ + --training_data /data/mnist_png/training/ \ + --validation_data /data/mnist_png/testing/" +``` + +Then, we submit the job yaml to the cluster like + +```bash +kubectl -n dlrover apply -f examples/pytorch/mnist/elastic_job.yaml +``` + +After the job starts, we can check the execution time of nodes to execute the check +task in the log of job master. The name of job master is like "elasticjob-{JOB-NAME}-dlrover-master". +We can find the execution time of each node by the command + +```bash +kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed +``` + +```text +Round 0: The node elapsed time are {2: 20.307, 3: 20.265, 0: 206.872, 1: 151.752} +Round 1: The node elapsed time are {2: 20.307, 3: 20.265, 0: 23.174, 1: 135.961} +Round 2: The node elapsed time aree {2: 21.491, 0: 22.685, 3: 20.889, 1: 23.097} +``` + +In the [experiment](../tech_report/fault_tolerance_exps.md) to find thestraggler node, +the execution time of Node 1 is much bigger than others, so the node 1 is probably a slow node. + +### Check Node Healthy in a Kubeflow/PyTorch Job + +If you cannot deploy an ElasticJob CRD on the k8s cluster, you also can use +`dlrover-run` in a [Kubeflow/PyTorch](https://www.kubeflow.org/docs/components/training/pytorch/) +job to check the node health. The Kubeflow/PyTorch job +is commonly used on the k8s cluster to manage a PyTorch distributed job. We only need to set +the `NODE_RANK` and `DLROVER_MASTER_ADDR` before `dlrover-run`. `dlrover-run` will +launch a job master process in the master Pod of the PyTorch job. +For example, the `PyTorchJob` has set the `RANK`, `MASTER_ADDR` and `MASTER_PORT` +into environments. We can run `dlrover-run` like + +```bash +NODE_RANK=$RANK DLROVER_MASTER_ADDR=$MASTER_ADDR:$MASTER_PORT \ +dlrover-run --standalone --network-check --nnodes=$NODE_NUM --nproc_per_node=$NUM_TRAINERS train_script.py +``` + +Then, we can search the execution time on the master Pod of PyTorchJob. + +```bash +kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed +``` + +We can find the slow of fault node by the execution time. diff --git a/examples/pytorch/nanogpt/README.md b/examples/pytorch/nanogpt/README.md index 947b8d0fc..a94377b39 100644 --- a/examples/pytorch/nanogpt/README.md +++ b/examples/pytorch/nanogpt/README.md @@ -62,8 +62,6 @@ use the command in the containers of worker. ```bash dlrover-run --nnodes=$WORKER_NUM \ --nproc_per_node=${GPU_NUM_PER_NODE} --max_restarts=1 \ - ./examples/pytorch/nanogpt/train.py \ - --data_dir '/data/nanogpt/' \ - --gradient_accumulation_steps $((WORKER_NUM * 1)) \ - --use_fsdp + ./examples/pytorch/nanogpt/fsdp_train.py \ + --data_dir '/data/nanogpt/' --epochs 50 --checkpoint_step 50 \ ``` diff --git a/examples/pytorch/nanogpt/ds_train.py b/examples/pytorch/nanogpt/ds_train.py index 3b37bbfc8..1f0c3703c 100644 --- a/examples/pytorch/nanogpt/ds_train.py +++ b/examples/pytorch/nanogpt/ds_train.py @@ -17,8 +17,7 @@ dlrover-run --nnodes=1 --max_restarts=2 --nproc_per_node=2 \ ds_train.py --n_layer 36 --n_head 20 --n_embd 1280 \ - --data_dir './' \ - --device cuda --ds_config ./ds_config.json \ + --data_dir './' --ds_config ./ds_config.json \ --epochs 50 --checkpoint_step 50 """