Skip to content

Commit

Permalink
Add tutorial to enable GPU support on minikube cluster. (intelligent-…
Browse files Browse the repository at this point in the history
…machine-learning#383)

* Minikube gpu enable support added.

* add nvida-plugin-config.

* Add time-slicing to get more gpu resource.

* fix job yaml

* Add helm command to update nvidia-k8s
-device-plugin.

* Fix formatting issues.

* remove nvidia.com/gpu from torch_mnist_master_job
to make gpu optional.
  • Loading branch information
Antlera committed Apr 9, 2023
1 parent 9fe1fad commit 6043ada
Show file tree
Hide file tree
Showing 3 changed files with 110 additions and 7 deletions.
10 changes: 5 additions & 5 deletions dlrover/examples/torch_mnist_master_backend_job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ spec:
- /bin/bash
- -c
- "python -m dlrover.python.elastic_agent.torch.prepare \
&& torchrun --nnodes=1:$WORKER_NUM --nproc_per_node=1
--max_restarts=3 --rdzv_backend=dlrover-master \
model_zoo/pytorch/mnist_cnn.py \
--training_data /data/mnist_png/training/elastic_ds.txt \
--validation_data /data/mnist_png/testing/elastic_ds.txt"
&& torchrun --nnodes=1:$WORKER_NUM --nproc_per_node=1
--max_restarts=3 --rdzv_backend=dlrover-master \
model_zoo/pytorch/mnist_cnn.py \
--training_data /data/mnist_png/training/elastic_ds.txt \
--validation_data /data/mnist_png/testing/elastic_ds.txt"
resources:
limits:
cpu: "1"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 2
101 changes: 99 additions & 2 deletions docs/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,112 @@ go mod vendor
```

## Running the Operator Locally

Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.

### 1. Preliminary

#### Minikube Install

Install [minikube](https://kubernetes.io/docs/tasks/tools/) on your loptop.
And you can start minikube by the command

#### Minikube with GPU Support

To enable GPU support, follow the doc as follows:

- install [containerd](https://github.com/containerd/containerd/blob/main/docs/getting-started.md)
and [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)
- Enable [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)

- Test your GPU by the official [gpu-pod](https://github.com/NVIDIA/k8s-device-plugin#running-gpu-jobs)

It is highly recommended to have more than one GPU resource in your workspace.

However, there is still a workaround to divide your singie GPU resource into multiple ones.

For this, enable [shared-access-to-gpus with CUDA Time-Slicing](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) to get more GPU resources.

Check the doc and modify your ``nvidia-k8s-device-plugin`` or simply update the plugin by ``helm`` with the command

```bash
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.13.0 \
--namespace nvidia-device-plugin \
--create-namespace \
--set-file config.map.config=./dlrover/go/operator/config/gpu
```

Then test your GPU resources by

```bash
$ kubectl get nodes -ojson | jq .items[].status.capacity
>
{
"cpu": "8",
"ephemeral-storage": "229336240Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "32596536Ki",
"nvidia.com/gpu": "2", # create one more gpu resource on your laptop
"pods": "110"
}
```

Deploy this deployment to test your GPU resources.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-gpu
spec:
replicas: 2 # replace this to your amount of GPU resources
selector:
matchLabels:
app: test-gpu
template:
metadata:
labels:
app: test-gpu
spec:
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
```

```bash
NAME READY STATUS RESTARTS AGE
dlrover-controller-manager-6c464d59f8-np7tg 2/2 Running 0 55m
test-gpu-59c9677b99-qtxbv 0/1 Completed 2 (24s ago) 27s
test-gpu-59c9677b99-sxd6n 0/1 Completed 2 (24s ago) 27s

$ kubectl logs test-gpu-59c9677b99-qtxbv
>
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

#### Start Your Minikube

After preparing your minikube cluster you can start minikube with the command:

```bash
minikube start --vm-driver=docker --cpus 6 --memory 6144

# If you wish to run minikube with GPUs, recommended commands are as follows.(root privilege requried)

minikube start --driver=none --container-runtime='containerd' --apiserver-ips 127.0.0.1 --apiserver-name localhost --cpus 6 --memory 6144
```

### Configure KUBECONFIG and KUBEFLOW_NAMESPACE
Expand All @@ -59,7 +156,6 @@ export KUBEFLOW_NAMESPACE=$(your_namespace)

- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other resources (e.g. the resource lock) internal in the same namespace. It is optional, use `default` namespace if not set.


### 2. Run ElasticJob Controller

We can run the ElasticJob in the terminal or deploy the controller with
Expand Down Expand Up @@ -114,6 +210,7 @@ elasticjob-sample-edljob-worker-1 1/1 Running 0 2m42s
```

### 6. Create a release

Change pip version and docker image tag when creating a new release.

## Go version
Expand Down

0 comments on commit 6043ada

Please sign in to comment.