Name		Name	Last commit message	Last commit date
parent directory ..
data		data
README.md		README.md
train.yaml		train.yaml

README.md

Training deepspeech.pytorch on Kubernetes using TorchElastic

Below are instructions to train a model using a GKE cluster and take advantage of pre-emptible VMs using TorchElastic.

gcloud container clusters create torchelastic \
    --machine-type=n1-standard-2 \
    --disk-size=15Gi \
    --zone=us-west1-b \
    --cluster-version=1.15 \
    --num-nodes=3 --min-nodes=0 --max-nodes=3 \
    --enable-autoscaling \
    --scopes=storage-full

# Add GPU pool
gcloud container node-pools create gpu-pool --cluster torchelastic \
    --accelerator type=nvidia-tesla-v100,count=1\
    --machine-type=n1-standard-4 \
    --disk-size=25Gi \
    --zone=us-west1-b \
    --preemptible \
    --num-nodes=1 --min-nodes=0 --max-nodes=1 \
    --enable-autoscaling \
    --scopes=storage-full

We use pre-emptive nodes to reduce costs. The code handles interruptions by saving state to GCS periodically.

Set up ElasticJob

git clone https://github.com/pytorch/elastic.git
cd elastic/kubernetes

kubectl apply -k config/default

Setup Volume

First we create a drive to store our data. The drive is fairly small and can be managed under the volumes tab on GCP. Modify the config as needs be.

cd deepspeech.pytorch/kubernetes/
gcloud compute disks create --size 10Gi audio-data --zone us-west1-b
kubectl apply -f data/storage.yaml
kubectl apply -f data/persistent_volume.yaml

Download Data

We run a job to download and extract the data onto our drive. Modify the config to match whatever data you'd like to download.

In our example we download and extract the AN4 data using the deepspeech.pytorch docker image.

kubectl apply -f data/transfer_data.yaml
kubectl logs transfer-data --namespace=elastic-job # Monitor logs to determine when the process is complete
kubectl delete -f data/transfer_data.yaml # Delete to free up resources

Install CRD/CUDA for Training

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl apply -f https://raw.githubusercontent.com/pytorch/elastic/master/kubernetes/config/samples/etcd.yaml
kubectl get svc -n elastic-job

Training

GCS Model Store

To store the checkpoint models, we use Google Cloud Storage. Create a bucket and make sure to modify checkpointing.gcs_bucket=deepspeech-1234 to train.yaml to point to the bucket that the cluster has access to.

kubectl apply -f train.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes

kubernetes

README.md

Training deepspeech.pytorch on Kubernetes using TorchElastic

Set up ElasticJob

Setup Volume

Download Data

Install CRD/CUDA for Training

Training

GCS Model Store

Files

kubernetes

Directory actions

More options

Directory actions

More options

Latest commit

History

kubernetes

Folders and files

parent directory

README.md

Training deepspeech.pytorch on Kubernetes using TorchElastic

Set up ElasticJob

Setup Volume

Download Data

Install CRD/CUDA for Training

Training

GCS Model Store