Skip to content

Latest commit

 

History

History
230 lines (182 loc) · 6.92 KB

tf_elasticjob_on_k8s.md

File metadata and controls

230 lines (182 loc) · 6.92 KB

PS Training Using DLRover on Public Cloud

This document explains how to run a DLRover elastic job for PS training with on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).

Preliminary

  • Create a Kubernetes cluster on ACK.
  • Configure cluster credentials on your local computer.
  • Create a NAS storage and mount it to the cluster.

Deploy the ElasticJob CRD on ACK

  1. Deploy the controller on the cluster.
make deploy IMG=easydl/elasticjob-controller:v0.1.1
  1. Grant permission for the DLRover master to Access CRDs.
kubectl -n dlrover apply -f dlrover/go/operator/config/rbac/default_role.yaml 

Submit an Auto-Scaling Job

  • Submit a job to train a DeepFM model without specified resource.
kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/autoscale_job.yaml
  • Check the job status
kubectl -n dlrover get elasticjob deepctr-auto-scale
NAME                 PHASE     AGE
deepctr-auto-scale   Running   4s
  • Check the Pod status
kubectl -n dlrover get pods -l elasticjob.dlrover/name=deepctr-auto-scale
NAME                                           READY   STATUS    RESTARTS   AGE
deepctr-auto-scale-edljob-chief-0              1/1     Running   0          78s
deepctr-auto-scale-edljob-evaluator-0          1/1     Running   0          78s
deepctr-auto-scale-edljob-ps-0                 1/1     Running   0          78s
elasticjob-deepctr-auto-scale-dlrover-master   1/1     Running   0          82s

Now, the speed is about 30 steps/s. After about 3min, DLRover scales up 3 workers and the speed is up to 100 steps/s.

NAME                                          READY   STATUS    RESTARTS   AGE
dlrover-auto-scale-edljob-chief-0             1/1     Running   0          6m17s
dlrover-auto-scale-edljob-ps-0                1/1     Running   0          6m17s
dlrover-auto-scale-edljob-worker-0            1/1     Running   0          3m19s
dlrover-auto-scale-edljob-worker-1            1/1     Running   0          3m19s
dlrover-auto-scale-edljob-worker-2            1/1     Running   0          3m19s

Submit a Mannul Scaling Job

Submit a job with initial resource configuration

  • Submit a job with the DeepFM model.
kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml
  • Check the job status
kubectl -n dlrover get elasticjob deepctr-manual-scaling
NAME                    PHASE     AGE
deepctr-manual-scaling  Running   2m20s
  • Check the Pod status
kubectl -n dlrover get pods -l elasticjob-name=deepctr-manual-scaling
NAME                                               READY   STATUS    RESTARTS   AGE
deepctr-manual-scale-edljob-chief-0                1/1     Running   0          12s
deepctr-manual-scale-edljob-worker-0               1/1     Running   0          12s
deepctr-manual-scale-edljob-ps-0                   1/1     Running   0          12s
elasticjob-deepctr-manual-scaling-dlrover-master   1/1     Running   0          19s

Mannually Scale Nodes of a Job

We can submit a ScalePlan CRD to scale up/down nodes of a job. In a ScalePlan, we need to set metadata.labels to specify which job to scale and metadata.labels["scale-type"] to "manual". For example, the ScalePlan is to scale workers of the job deepctr-manual-scaling.

apiVersion: elastic.iml.github.io/v1alpha1
kind: ScalePlan
metadata:
  name: deepctr-manual-scale-plan-0
  labels:
    elasticjob.dlrover/name: deepctr-manual-scale
    scale-type: manual
spec:
  ownerJob: deepctr-manual-scale
  replicaResourceSpecs:
    worker:
      replicas: 2

After scaling, there two worker nodes:

NAME                                             READY   STATUS    RESTARTS   AGE
deepctr-manual-scale-edljob-chief-0              1/1     Running   0          14m
deepctr-manual-scale-edljob-ps-0                 1/1     Running   0          14m
deepctr-manual-scale-edljob-worker-0             1/1     Running   0          14s
deepctr-manual-scale-edljob-worker-1             1/1     Running   0          3s
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running   0          14m

We can scale up PS nodes with the spec in ScalePlan like

apiVersion: elastic.iml.github.io/v1alpha1
kind: ScalePlan
metadata:
  namespace: dlrover
  name: deepctr-manual-scale-plan-1
  labels:
    elasticjob-name: deepctr-manual-scale
    scale-type: manual
spec:
  ownerJob: deepctr-auto-scale
  replicaResourceSpecs:
    ps:
      replicas: 2

After scaling, there two ps nodes:

NAME                                           READY   STATUS    RESTARTS   AGE
deepctr-auto-scale-edljob-chief-0              1/1     Running   0          7m36s
deepctr-auto-scale-edljob-ps-0                 1/1     Running   0          7m36s
deepctr-auto-scale-edljob-ps-1                 1/1     Running   0          2m50s
elasticjob-deepctr-auto-scale-dlrover-master   1/1     Running   0          7m43s

We can scale down PS nodes with the spec in ScalePlan like

apiVersion: elastic.iml.github.io/v1alpha1
kind: ScalePlan
metadata:
  namespace: dlrover
  name: deepctr-manual-scale-plan-2
  labels:
    elasticjob-name: deepctr-manual-scale
    scale-type: manual
spec:
  ownerJob: deepctr-auto-scale
  replicaResourceSpecs:
    ps:
      replicas: 1

After scaling, there two ps nodes:

NAME                                           READY   STATUS    RESTARTS   AGE
deepctr-auto-scale-edljob-chief-0              1/1     Running   0          9m30s
deepctr-auto-scale-edljob-ps-0                 1/1     Running   0          9m30s
elasticjob-deepctr-auto-scale-dlrover-master   1/1     Running   0          9m47s

We can migrate a PS with more resource like

apiVersion: elastic.iml.github.io/v1alpha1
kind: ScalePlan
metadata:
  namespace: dlrover
  name: deepctr-manual-scale-plan-3
  labels:
    elasticjob-name: deepctr-auto-scale
    scale-type: manual
spec:
  ownerJob: deepctr-auto-scale
  migratePods:
    - name: deepctr-auto-scale-edljob-ps-0
      resource:
        cpu: "2"
        memory: 4Gi

During migrating, a new ps node is started. When new ps is ready, master will inform workers to accept new ps.

NAME                                           READY   STATUS    RESTARTS   AGE
deepctr-auto-scale-edljob-chief-0              1/1     Running   0          22m
deepctr-auto-scale-edljob-ps-0                 1/1     Running   0          22m

After migrating, new ps joins and the old ps exit:

NAME                                           READY   STATUS    RESTARTS   AGE
deepctr-auto-scale-edljob-chief-0              1/1     Running   0          22m
deepctr-auto-scale-edljob-ps-2                 1/1     Running   0          20s