Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed training using Kubernetes #77

Merged
merged 17 commits into from
Jan 23, 2021
Merged

Implement distributed training using Kubernetes #77

merged 17 commits into from
Jan 23, 2021

Conversation

leogao2
Copy link
Contributor

@leogao2 leogao2 commented Jan 23, 2021

Distributed training should work without any problems now.

To use:

  1. Push to docker (todo: automate this step using github pipelines)
  2. Run ./deploy_k8s.sh - this should create all of the containers necessary and drop you in a shell. Optional: change number of nodes in kubernetes/deploy_k8s.yml to however many you want
  3. Run your deepspeed command (i.e deepspeed --hostfile=hosts train_enwik8.py --deepspeed --deepspeed_config configs/deepspeed_zero2.json) in the shell. Make sure to include --hostfile=hosts.

@leogao2
Copy link
Contributor Author

leogao2 commented Jan 23, 2021

Todo: set up eleuther docker hub account and make pipeline for building containers

Dockerfile Show resolved Hide resolved
image: leogao2/deepspeed_eleuther
ports:
- name: sshd
containerPort: 2222
Copy link

@amannm amannm Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be 22 as suggested by the EXPOSE at https://github.com/coreweave/cuda-ssh-server/blob/master/Dockerfile#L37 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had thought as well, but Leo and I tested it and it seems to work with 2222. Not sure why / what difference it makes though.

@leogao2 leogao2 changed the title Add k8s stuff Implement distributed training using Kubernetes Jan 23, 2021
@leogao2 leogao2 marked this pull request as ready for review January 23, 2021 07:57
@leogao2 leogao2 requested a review from a team as a code owner January 23, 2021 07:57
@StellaAthena StellaAthena enabled auto-merge (squash) January 23, 2021 20:59
@StellaAthena StellaAthena linked an issue Jan 23, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants