You can leverage Uber's Horovod mechanism for distributed deep learning training with FfDL. Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. Horovod improves efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training. Horovod enables distributed model training via MPI, a low-level interface for high-performance parallel computing.
-
Deploy FfDL on your Kubernetes Cluster.
-
In the main FfDL repository, run the following commands to obtain the object storage endpoint from your cluster.
node_ip=$PUBLIC_IP
s3_port=$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}')
s3_url=http:https://$node_ip:$s3_port
- Next, set up the default object storage access ID and KEY. Then create buckets for all the necessary training data and models.
export AWS_ACCESS_KEY_ID=test; export AWS_SECRET_ACCESS_KEY=test; export AWS_DEFAULT_REGION=us-east-1;
s3cmd="aws --endpoint-url=$s3_url s3"
$s3cmd mb s3:https://tf_training_data
$s3cmd mb s3:https://tf_trained_model
- Now, create a temporary repository, download the necessary images for training and labeling our TensorFlow model, and upload those images to your tf_training_data bucket.
mkdir tmp
for file in t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz;
do
test -e tmp/$file || wget -q -O tmp/$file http:https://yann.lecun.com/exdb/mnist/$file
$s3cmd cp tmp/$file s3:https://tf_training_data/$file
done
- Now you should have all the necessary training data set in your object storage. Let's go ahead to set up your restapi endpoint and default credentials for Deep Learning as a Service. Once you done that, you can start running jobs using the FfDL CLI (executable binary).
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http:https://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;
Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url.
if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/horovod/manifest_tfmnist.yml
else
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/horovod/manifest_tfmnist.yml
fi
Obtain the correct CLI for your machine and run the training job with our default Horovod model
CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/horovod/manifest_tfmnist.yml etc/examples/horovod
Congratulations, you had submitted your first Horovod TensorFlow job on FfDL. You can check your FfDL status either from the FfDL UI or simply run $CLI_CMD list
- For Kubeadm-DIND cluster, some users are having issue with inter-node pod communication. Thus, we suggest you to use a real Kubernetes cluster environment or only use one worker node if you are testing on Kubeadm-DIND environment. (e.g. run
export NUM_NODES=1
before provisioning your cluster)