Implement distributed training using Kubernetes #77

leogao2 · 2021-01-23T00:51:40Z

Distributed training should work without any problems now.

To use:

Push to docker (todo: automate this step using github pipelines)
Run ./deploy_k8s.sh - this should create all of the containers necessary and drop you in a shell. Optional: change number of nodes in kubernetes/deploy_k8s.yml to however many you want
Run your deepspeed command (i.e deepspeed --hostfile=hosts train_enwik8.py --deepspeed --deepspeed_config configs/deepspeed_zero2.json) in the shell. Make sure to include --hostfile=hosts.

leogao2 · 2021-01-23T06:09:10Z

Todo: set up eleuther docker hub account and make pipeline for building containers

Dockerfile

amannm · 2021-01-23T07:39:26Z

kubernetes/deploy_k8s.yml

+ image: leogao2/deepspeed_eleuther
+ ports:
+ - name: sshd
+ containerPort: 2222


shouldn't this be 22 as suggested by the EXPOSE at https://github.com/coreweave/cuda-ssh-server/blob/master/Dockerfile#L37 ?

That's what I had thought as well, but Leo and I tested it and it seems to work with 2222. Not sure why / what difference it makes though.

leogao2 added 8 commits January 22, 2021 17:50

Add initial Dockerfile and k8s deployment

1bdb814

Fix imports

05f4dcf

Update dockerfile

6e9b3a8

Merge branch 'main' of github.com:EleutherAI/gpt-neox into k8s

a0b4748

Add /dev/shm patch

206b8ca

Update to be completely passwordless

2a578d2

Add host file generation script

d48f48a

Make deploy script

032fde1

amannm reviewed Jan 23, 2021

View reviewed changes

Dockerfile Show resolved Hide resolved

amannm reviewed Jan 23, 2021

View reviewed changes

leogao2 changed the title ~~Add k8s stuff~~ Implement distributed training using Kubernetes Jan 23, 2021

leogao2 marked this pull request as ready for review January 23, 2021 07:57

leogao2 requested a review from a team as a code owner January 23, 2021 07:57

leogao2 requested review from StellaAthena and ConnorJL January 23, 2021 07:57

leogao2 and others added 9 commits January 23, 2021 01:03

Update deploy script and fix echo -e problem

81f1ef4

Remove command line argument

ee1739e

Generate keys for worker machines

af90891

Harden security slightly

dcd11ae

Update docker for custom keygen

e682f7f

Fix ssh

8a26b7e

Fix deploy script to use right id

0e36734

Added logging config

480dc36

Update README.md

221d73a

StellaAthena enabled auto-merge (squash) January 23, 2021 20:59

StellaAthena approved these changes Jan 23, 2021

View reviewed changes

StellaAthena merged commit 4d22350 into main Jan 23, 2021

StellaAthena deleted the k8s branch January 23, 2021 20:59

StellaAthena linked an issue Jan 23, 2021 that may be closed by this pull request

Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together #62

Closed

StellaAthena linked an issue Jan 23, 2021 that may be closed by this pull request

Expand to all 8 CoreWeave Machines #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement distributed training using Kubernetes #77

Implement distributed training using Kubernetes #77

leogao2 commented Jan 23, 2021 •

edited

Loading

leogao2 commented Jan 23, 2021

amannm Jan 23, 2021 •

edited

Loading

StellaAthena Jan 23, 2021

Implement distributed training using Kubernetes #77

Implement distributed training using Kubernetes #77

Conversation

leogao2 commented Jan 23, 2021 • edited Loading

leogao2 commented Jan 23, 2021

amannm Jan 23, 2021 • edited Loading

Choose a reason for hiding this comment

StellaAthena Jan 23, 2021

Choose a reason for hiding this comment

leogao2 commented Jan 23, 2021 •

edited

Loading

amannm Jan 23, 2021 •

edited

Loading