Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed training using Kubernetes #77

Merged
merged 17 commits into from
Jan 23, 2021
Merged
Prev Previous commit
Next Next commit
Generate keys for worker machines
  • Loading branch information
leogao2 committed Jan 23, 2021
commit af90891a288761221a30912c6700e9e7702676a0
12 changes: 12 additions & 0 deletions deploy_k8s.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,23 @@
kubectl delete deploy/eleuther-neox
kubectl apply -f kubernetes/deploy_k8s.yml
echo Waiting for deploy to complete...
rm id_rsa*
ssh-keygen -t rsa -f id_rsa -N ""
kubectl wait --for=condition=available --timeout=600s deployment/eleuther-neox || exit

kubectl get pods -o wide | grep eleuther-neox | awk '{print $6 " slots=8"}' > hosts
export MASTER_ID=$(kubectl get pods | grep eleuther-neox | awk '{print $1}' | head -n 1)
echo $MASTER_ID
kubectl cp $PWD/hosts $MASTER_ID:/app
kubectl cp $PWD/id_ed25519 $MASTER_ID:/root/.ssh

mv id_rsa.pub authorized_keys

for id in $(kubectl get pods | grep eleuther-neox | awk '{print $1}')
do
echo copying keys to $id
kubectl cp $PWD/authorized_keys $MASTER_ID:/root/.ssh/
done
rm authorized_keys hosts
#echo 'git remote set-url origin https://github.com/EleutherAI/gpt-neox/ && git pull' | kubectl exec --stdin --tty $MASTER_ID -- /bin/bash
kubectl exec --stdin --tty $MASTER_ID -- /bin/bash