Skip to content

Latest commit

 

History

History
81 lines (57 loc) · 2.29 KB

torch_allreduce_on_cloud.md

File metadata and controls

81 lines (57 loc) · 2.29 KB

AllReduce Training Using DLRover on Public Cloud

This document explains how to run a DLRover elastic job using torchrun on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).

Preliminary

  • Create a Kubernetes cluster on ACK.
  • Configure cluster credentials on your local computer.
  • Create a NAS storage and mount it to the cluster.

Deploy the ElasticJob CRD on ACK

  1. Deploy the controller on the cluster.
make deploy IMG=easydl/elasticjob-controller:master
  1. Grant permission for the DLRover master to Access CRDs.
kubectl -n dlrover apply -f dlrover/go/operator/config/rbac/default_role.yaml 

Submit a Job

  • Submit a job to train a CNN model with MNIST dataset.
kubectl -n dlrover apply -f dlrover/examples/torch_mnist_master_backend_job.yaml
  • Check the job status
kubectl -n dlrover get elasticjob torch-mnist 
NAME          PHASE     AGE
torch-mnist   Running   19h
  • Check the Pod status
kubectl -n dlrover get pods -l elasticjob-name=torch-mnist
NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          26s
torch-mnist-edljob-worker-0             1/1     Running   0          29s
torch-mnist-edljob-worker-1             1/1     Running   0          32s

Test Fault-tolerance

  • Delete a worker.
kubectl -n dlrover delete pod torch-mnist-edljob-worker-1

Then, we can see there are only one worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m12s
torch-mnist-edljob-worker-0             1/1     Running   0          1m15s

For a while, DLRover will restore the deleted worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m52s
torch-mnist-edljob-worker-0             1/1     Running   0          1m55s
torch-mnist-edljob-worker-1             1/1     Running   0          32s