Add tutorial to enable GPU support on minikube cluster. (intelligent-…

…machine-learning#383) * Minikube gpu enable support added. * add nvida-plugin-config. * Add time-slicing to get more gpu resource. * fix job yaml * Add helm command to update nvidia-k8s -device-plugin. * Fix formatting issues. * remove nvidia.com/gpu from torch_mnist_master_job to make gpu optional.
BalaBalaYi · Apr 9, 2023 · 6043ada · 6043ada
1 parent 9fe1fad
commit 6043ada
Show file tree

Hide file tree

Showing 3 changed files with 110 additions and 7 deletions.
diff --git a/dlrover/examples/torch_mnist_master_backend_job.yaml b/dlrover/examples/torch_mnist_master_backend_job.yaml
@@ -21,11 +21,11 @@ spec:
  - /bin/bash
  - -c
  - "python -m dlrover.python.elastic_agent.torch.prepare \
- && torchrun --nnodes=1:$WORKER_NUM --nproc_per_node=1
- --max_restarts=3 --rdzv_backend=dlrover-master \
- model_zoo/pytorch/mnist_cnn.py \
- --training_data /data/mnist_png/training/elastic_ds.txt \
- --validation_data /data/mnist_png/testing/elastic_ds.txt"
+  && torchrun --nnodes=1:$WORKER_NUM --nproc_per_node=1
+  --max_restarts=3 --rdzv_backend=dlrover-master \
+  model_zoo/pytorch/mnist_cnn.py \
+  --training_data /data/mnist_png/training/elastic_ds.txt \
+  --validation_data /data/mnist_png/testing/elastic_ds.txt"
  resources:
  limits:
  cpu: "1"

diff --git a/dlrover/go/operator/config/gpu/nvidia-device-plugin-gpu-shared.yaml b/dlrover/go/operator/config/gpu/nvidia-device-plugin-gpu-shared.yaml
@@ -0,0 +1,6 @@
+version: v1
+sharing:
+ timeSlicing:
+ resources:
+ - name: nvidia.com/gpu
+ replicas: 2
diff --git a/docs/developer_guide.md b/docs/developer_guide.md
@@ -36,15 +36,112 @@ go mod vendor
 ```
 
 ## Running the Operator Locally
+
 Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.
 
 ### 1. Preliminary
 
+#### Minikube Install
+
 Install [minikube](https://kubernetes.io/docs/tasks/tools/) on your loptop.
-And you can start minikube by the command
+
+#### Minikube with GPU Support
+
+To enable GPU support, follow the doc as follows:
+
+- install [containerd](https://github.com/containerd/containerd/blob/main/docs/getting-started.md)
+and [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)
+- Enable [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
+
+- Test your GPU by the official [gpu-pod](https://github.com/NVIDIA/k8s-device-plugin#running-gpu-jobs)
+
+It is highly recommended to have more than one GPU resource in your workspace.
+
+However, there is still a workaround to divide your singie GPU resource into multiple ones.
+
+For this, enable [shared-access-to-gpus with CUDA Time-Slicing](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) to get more GPU resources.
+
+Check the doc and modify your ``nvidia-k8s-device-plugin`` or simply update the plugin by ``helm`` with the command
+
+```bash
+$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
+ --version=0.13.0 \
+ --namespace nvidia-device-plugin \
+ --create-namespace \
+ --set-file config.map.config=./dlrover/go/operator/config/gpu
+```
+
+Then test your GPU resources by
+
+```bash
+$ kubectl get nodes -ojson | jq .items[].status.capacity
+>
+{
+ "cpu": "8",
+ "ephemeral-storage": "229336240Ki",
+ "hugepages-1Gi": "0",
+ "hugepages-2Mi": "0",
+ "memory": "32596536Ki",
+ "nvidia.com/gpu": "2", # create one more gpu resource on your laptop
+ "pods": "110"
+}
+```
+
+Deploy this deployment to test your GPU resources.
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: test-gpu
+spec:
+ replicas: 2 # replace this to your amount of GPU resources
+ selector:
+ matchLabels:
+ app: test-gpu
+ template:
+ metadata:
+ labels:
+ app: test-gpu
+ spec:
+ containers:
+ - name: cuda-container
+ image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
+ resources:
+ limits:
+ nvidia.com/gpu: 1 # requesting 1 GPU
+ tolerations:
+ - key: nvidia.com/gpu
+ operator: Exists
+ effect: NoSchedule
+```
+
+```bash
+NAME READY STATUS RESTARTS AGE
+dlrover-controller-manager-6c464d59f8-np7tg 2/2 Running 0 55m
+test-gpu-59c9677b99-qtxbv 0/1 Completed 2 (24s ago) 27s
+test-gpu-59c9677b99-sxd6n 0/1 Completed 2 (24s ago) 27s
+
+$ kubectl logs test-gpu-59c9677b99-qtxbv
+>
+[Vector addition of 50000 elements]
+Copy input data from the host memory to the CUDA device
+CUDA kernel launch with 196 blocks of 256 threads
+Copy output data from the CUDA device to the host memory
+Test PASSED
+Done
+```
+
+#### Start Your Minikube
+
+After preparing your minikube cluster you can start minikube with the command:
 
 ```bash
 minikube start --vm-driver=docker --cpus 6 --memory 6144
+
+# If you wish to run minikube with GPUs, recommended commands are as follows.(root privilege requried)
+
+minikube start --driver=none --container-runtime='containerd' --apiserver-ips 127.0.0.1 --apiserver-name localhost --cpus 6 --memory 6144
 ```
 
 ### Configure KUBECONFIG and KUBEFLOW_NAMESPACE
@@ -59,7 +156,6 @@ export KUBEFLOW_NAMESPACE=$(your_namespace)
 
 - KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other resources (e.g. the resource lock) internal in the same namespace. It is optional, use `default` namespace if not set.
 
-
 ### 2. Run ElasticJob Controller
 
 We can run the ElasticJob in the terminal or deploy the controller with
@@ -114,6 +210,7 @@ elasticjob-sample-edljob-worker-1 1/1 Running 0 2m42s
 ```
 
 ### 6. Create a release
+
 Change pip version and docker image tag when creating a new release.
 
 ## Go version