Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed training using Kubernetes #77

Merged
merged 17 commits into from
Jan 23, 2021
Merged
Prev Previous commit
Next Next commit
Make deploy script
  • Loading branch information
leogao2 committed Jan 23, 2021
commit 032fde128c6f1a5e1def100b3c974ea3ffabdb3b
11 changes: 11 additions & 0 deletions kubernetes/deploy_k8s.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
kubectl delete deploy/eleuther-neox
kubectl apply -f deploy_k8s.yml
echo Waiting for deploy to complete...
kubectl wait --for=condition=available --timeout=600s deployment/eleuther-neox || exit

kubectl get pods -o wide | grep eleuther-neox | awk '{print $6 " slots=8"}' > hosts
export MASTER_ID=$(kubectl get pods | grep eleuther-neox | awk '{print $1}' | head -n 1)
echo $MASTER_ID
kubectl cp $PWD/hosts $MASTER_ID:/app
#echo 'git remote set-url origin https://github.com/EleutherAI/gpt-neox/ && git pull' | kubectl exec --stdin --tty $MASTER_ID -- /bin/bash
kubectl exec --stdin --tty $MASTER_ID -- /bin/bash
1 change: 0 additions & 1 deletion kubernetes/make_hosts_file.sh

This file was deleted.