# Introduction to Check the Node Healthy using DLRover The document introduces how to use DLRover to check the slow or fault node in a distributed job on a k8s cluster. ## System Design In a distributed job, DLRover will launch a job master process to group the nodes. Before starting the training subprocess, The job master initially divides all nodes into multiple worlds in pairs. The two nodes in each world execute an allgather task and a matmul task. After tasks are completed, each node will reports the execution time to the job master. If the timeout happens in the allgather task, the execution time will be 3600s. If the execution times of all nodes are not 3600s, the job master will print the execution time in the log. Users can find the slow node with the execution time. If the execution times of some nodes are 3600s, there may be fault node in the distributed job. However, we cannot disagnose which node are fault. Because, the execution times of two nodes in group both are 3600s to execute the allgather task if anyone is faulty. Now, the job master will sort nodes by the execution time and group together the node with the longest execution time and the one with the shortest execution time. All nodes will execute the check task again. According to the execution time of two rounds check, the job master can diagnose the node is faulty whose execution time in two rounds both are 3600s. For example, if there are 6 nodes in a job, the 1st round of division results might be [{1,2}, {3,4}, {5,6}]. If allgather fails in the group {5, 6}, the node 5 and 6 are the potential faulty nodes. Therefore, the 2nd round of division would be [{1,2}, {3,5}, {4,6}]. If {4,6} fails, it indicates that node 6 is the faulty node. The node 6 will exit with an exception.
Editor
## Check Node Healthy in Distributed Job with dlrover-run We can use `dlrover-run` to replace the `torchrun` to check the node health in a distributed PyTorch job. The `dlrover-run` is developed based `torchrun` and `dlrover-run` support all arguments of `torchrun`. ### Check Node Healthy in a DLRover ElasticJob The DLRover ElasticJob can relaunch a new Pod to replace the fault Pod and can scale up/down Pods to train if the number of available nodes changes during the training. If you can develop a k8s CRD on a cluster, we strongly suggest you deploy the ElasticJob CRD on the cluster by the following steps. ```bash git clone git@github.com:intelligent-machine-learning/dlrover.git cd dlrover/go/operator/ make deploy IMG=easydl/elasticjob-controller:master # Grant permission for the DLRover master to Access CRDs. kubectl -n dlrover apply -f config/manifests/bases/default-role.yaml ``` After deploying the ElasticJob, we can submit a distributed job with the `dlrover-run --network-check` in the job yaml like ```yaml command: - /bin/bash - -c - "dlrover-run --network-check --nnodes=3:$NODE_NUM --nproc_per_node=2 --max_restarts=3 \ examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ --training_data /data/mnist_png/training/ \ --validation_data /data/mnist_png/testing/" ``` Then, we submit the job yaml to the cluster like ```bash kubectl -n dlrover apply -f examples/pytorch/mnist/elastic_job.yaml ``` After the job starts, we can check the execution time of nodes to execute the check task in the log of job master. The name of job master is like "elasticjob-{JOB-NAME}-dlrover-master". We can find the execution time of each node by the command ```bash kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed ``` ```text Round 0: The node elapsed time are {2: 20.307, 3: 20.265, 0: 206.872, 1: 151.752} Round 1: The node elapsed time are {2: 20.307, 3: 20.265, 0: 23.174, 1: 135.961} Round 2: The node elapsed time aree {2: 21.491, 0: 22.685, 3: 20.889, 1: 23.097} ``` In the [experiment](../tech_report/fault_tolerance_exps.md) to find thestraggler node, the execution time of Node 1 is much bigger than others, so the node 1 is probably a slow node. ### Check Node Healthy in a Kubeflow/PyTorch Job If you cannot deploy an ElasticJob CRD on the k8s cluster, you also can use `dlrover-run` in a [Kubeflow/PyTorch](https://www.kubeflow.org/docs/components/training/pytorch/) job to check the node health. The Kubeflow/PyTorch job is commonly used on the k8s cluster to manage a PyTorch distributed job. We only need to set the `NODE_RANK` and `DLROVER_MASTER_ADDR` before `dlrover-run`. `dlrover-run` will launch a job master process in the master Pod of the PyTorch job. For example, the `PyTorchJob` has set the `RANK`, `MASTER_ADDR` and `MASTER_PORT` into environments. We can run `dlrover-run` like ```bash NODE_RANK=$RANK DLROVER_MASTER_ADDR=$MASTER_ADDR:$MASTER_PORT \ dlrover-run --standalone --network-check --nnodes=$NODE_NUM --nproc_per_node=$NUM_TRAINERS train_script.py ``` Then, we can search the execution time on the master Pod of PyTorchJob. ```bash kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed ``` We can find the slow of fault node by the execution time.