Name		Name	Last commit message	Last commit date
Latest commit History 797 Commits
.github		.github
dlrover		dlrover
docker		docker
docs		docs
model_zoo		model_zoo
scripts		scripts
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.markdonwlint.yaml		.markdonwlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Repository files navigation

DLRover: An Automatic Distributed Deep Learning System

DLRover automatically trains the Deep Learning model on the distributed cluster. It helps model developers to focus on model arichtecture, without taking care of any engineering stuff, say, hardware acceleration, distributed running, etc. Now, it provides automated operation and maintenance for deep learning training jobs on K8s/Ray. Detail features as

Fault-Tolerance, the training process can continue if some nodes fails.
Auto-Scaling, the training job can automatically scale up/down nodes.
Automatic Resource Optimization, DLRover can automatically optimize the job resource to improve the training performance.

Why DLRover?

Integration of Offline and Online Deep Learning.

Users can define a model with tf.estimator.Estimator and deploy an offline job on K8s with batch data or online job on Ray with streaming data to train the model. For detail to develop models, we can see the estimator example.

No Resource Configuration to Submit a Job.

Users need not to set any resource configuration to submit a distributed training job. The following example is an ElasticJob on K8s.

apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
  name: dlrover-dnn-iris
spec:
  distributionStrategy: ParameterServerStrategy
  replicaSpecs:
    ps:
      template:
        spec:
          containers:
            - name: main
              image: easydl/tf-estimator:iris_dnn_v0
              command:
                - "python -m model_zoo.tf_estimator.iris_dnn_elastic"
    worker:
      template:
        spec:
          containers:
            - name: main
              image: easydl/tf-estimator:iris_dnn_v0
              command:
                - "python -m model_zoo.tf_estimator.iris_dnn_elastic"

Fault Tolerance to Improve the Stable of Job.

DLRover can recover failed parameter servers and workers to resume the training. Some failed nodes do not interrupt the training and hurt the convergence accuracy. The main error is OOM of node due to user's insufficient memory configuration. DLRover can automatically launch a Pod with more memory to recover the OOM node. In AntGroup, DLRover manages hundreds of DL training jobs every day on the customized Kubernetes cluster in AntGroup. Except the failed job resulted by code errors, the rate of completed job raise 89% with tf-operator in KubeFlow to 95%. Other unrecoverable failure reasons of job are data error, NaN loss of the model, network breakdown and so on.

Auto-Scaling to Improve Training Performance.

DLRover can automatically scale up/down the number of nodes (parameter servers or workers) at runtime of a training job. By monitoring the workload of nodes and throughput, DLRover can diagnose the bottleneck of resource configuration. The common bottleneck contains node straggler, unbalanced workload of PS, insufficient CPU cores of nodes and the insufficient number of nodes. DLRover can improve the training performance by dynamic resource adjustment.

We use the dataset of Kaggle CRITEO to train Wide&Deep and xDeepFM with 10 epoches on a K8s cluster. DLRover can mitigate straggler to improve the training throughput and shorten the job competion time (JCT).

Auto-Scaling to improve Resource Utilization.

Different model training requires different resource. Users prefer to configure their jobs with over-provision resources to avoid any potential risk from insufficient resources. This usually ends up with huge resource waste. DLRover Auto-Scaling can allocate resource by the demand of model training to reduce the waste of resource.

What's Next?

Elastic data-parallel multi-GPU training.
Elastic hybrid-parallel multi-GPU training.
Auto-Parallelism of deep learning training.

Quick Start

Train a TensorFlow Estimator on Aliyun ACK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DLRover: An Automatic Distributed Deep Learning System

Why DLRover?

Integration of Offline and Online Deep Learning.

No Resource Configuration to Submit a Job.

Fault Tolerance to Improve the Stable of Job.

Auto-Scaling to Improve Training Performance.

Auto-Scaling to improve Resource Utilization.

What's Next?

Quick Start

About

Releases

Packages

Languages

License

workingloong/dlrover

Folders and files

Latest commit

History

Repository files navigation

DLRover: An Automatic Distributed Deep Learning System

Why DLRover?

Integration of Offline and Online Deep Learning.

No Resource Configuration to Submit a Job.

Fault Tolerance to Improve the Stable of Job.

Auto-Scaling to Improve Training Performance.

Auto-Scaling to improve Resource Utilization.

What's Next?

Quick Start

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages