# AllReduce Training Using DLRover on Public Cloud

This document explains how to run a DLRover elastic job using torchrun
on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).

## Preliminary

- Install GO 1.18.
- Create a Kubernetes cluster on [ACK](https://help.aliyun.com/document_detail/309552.htm?spm=a2c4g.11186623.0.0.168f6b7aegH7nI#task-2112671).
- Configure cluster credentials on your local computer.
- Create a [NAS](https://help.aliyun.com/document_detail/477380.html?spm=a2c4g.11186623.0.0.10635c83Xn7Tkh)
storage and mount it to the cluster.

If you do not have a Kubernetes cluster on Cloud, you also can start
a local kubernetes cluster by [Minikube start](https://minikube.sigs.k8s.io/docs/start/).

## Deploy the ElasticJob CRD on the Kubernetes Cluster

1. Clone the repo to your host.

```bash
git clone git@github.com:intelligent-machine-learning/dlrover.git
```

2. Deploy the controller on the cluster.

```bash
cd dlrover/dlrover/go/operator/
make deploy IMG=easydl/elasticjob-controller:master  # GO 1.18
```

3. Grant permission for the DLRover master to Access CRDs.

```bash
kubectl -n dlrover apply -f config/manifests/bases/default-role.yaml
```

## Submit a Job

- Submit a job to train a CNN model with MNIST dataset.

```bash
kubectl -n dlrover apply -f examples/pytorch/mnist/elastic_job.yaml
```

- Check the job status

```bash
kubectl -n dlrover get elasticjob torch-mnist 
```

```bash
NAME          PHASE     AGE
torch-mnist   Running   19h
```

- Check the Pod status

```bash
kubectl -n dlrover get pods -l elasticjob-name=torch-mnist
```

```bash
NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          26s
torch-mnist-edljob-worker-0             1/1     Running   0          29s
torch-mnist-edljob-worker-1             1/1     Running   0          32s
```

We can view the training log of the worker by

```bash
kubectl -n dlrover logs torch-mnist-edljob-worker-0
```

```text
loss = 0.016916541382670403, step = 400
Save checkpoint.
loss = 0.05502168834209442, step = 420
loss = 0.13794168829917908, step = 440
loss = 0.023234723135828972, step = 460
Test model after epoch 18
Test the model ...

Test set: Average loss: 0.0499, Accuracy: 9828/10000 (98%)
```

## Test Fault-tolerance

- Delete a worker.

```bash
kubectl -n dlrover delete pod torch-mnist-edljob-worker-1
```

Then, we can see there are only one worker.

```bash
NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m12s
torch-mnist-edljob-worker-0             1/1     Running   0          1m15s
```

For a while, DLRover will restore the deleted worker.

```bash
NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m52s
torch-mnist-edljob-worker-0             1/1     Running   0          1m55s
torch-mnist-edljob-worker-1             1/1     Running   0          32s
```