Skip to content

Commit

Permalink
Merge pull request intelligent-machine-learning#196 from intelligent-…
Browse files Browse the repository at this point in the history
…machine-learning/fix-docs

Polish the README
  • Loading branch information
workingloong committed Jan 29, 2023
2 parents 1d1b61a + e34c17e commit 5153740
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 22 deletions.
24 changes: 16 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
DLRover, as it says, is making deep learning models' training easy. It helps model developers focus on model algorithm itself, without taking care of any engineering stuff, say, hardware acceleration, distribute running, etc. It provides static and dynamic nodes' configuration automatically, before and during a model training job running on k8s. Detail features as,

- Fault-Tolerance.
- Static and Dynamic resource configuration.
- Automatic distributed model training.
- Auto-Scaling.
- Automatic Resource Optimization.

DLRover consists three components:

Expand All @@ -22,14 +22,22 @@ DLRover can recover failed parameter servers and workers and resume the training
Some failed nodes do not interrupt the training and hurt the convergence
accuracy.

### Static and Dynamic Resource Configuration
### Auto-Scaling

DLRover can automatically configure the resources to start a training job
and monitor the performance of a training job and dynamically adjust
the resources to improve the training performance.
DLRover can automatically scale up and down the number of
nodes (parameter servers or workers) at runtime of a training job
with workload-aware alogrithms. In a training job of DLrover, nodes can come and
go at any time without interruptting the training process and wasted
work (e.g., initialization and iterations since the number of nodes changes)

### Automatic distributed model training

(To be added)
### Automatic Resource Optimization

DLRover can automatically configure the resources to start a training job.
After the job starts, DLRover monitors the performance (e.g. throughput and workload)
of a training job and dynamically adjusts the resources to
improve the training performance.

## Quick Start

[TensorFlow Estimator on Aliyun ACK](docs/tutorial/dlrover_cloud.md)
12 changes: 0 additions & 12 deletions docker/release.dockerfile

This file was deleted.

8 changes: 7 additions & 1 deletion docs/dev.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,13 @@ make deploy IMG=easydl/elasticjob-controller:test
kubectl apply -f dlrover/go/operator/config/rbac/default_role.yaml
```

### 4. Submit an ElasticJob.
### 4. Build the Image of DLRover Master

```bash
docker build -t easydl/dlrover-master:test -f docker/Dockerfile.
```

### 5. Submit an ElasticJob.

```bash
eval $(minikube docker-env)
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial/dlrover_cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ namely, Alibaba Cloud Container Service for Kubernetes(ACK).
1. Deploy the controller on the cluster.

```bash
make deploy IMG=easydl/elasticjob-controller:test
make deploy IMG=easydl/elasticjob-controller:v0.1.1
```

2. Grant permission for the DLRover master to Access CRDs.
Expand Down

0 comments on commit 5153740

Please sign in to comment.