Better deployment #107

joshlk · 2021-01-30T14:28:17Z

Each user has a different deployment name
User can choose how many nodes to deploy
SSH keys and post deploy script are mounted to the containers using k8 secrets
requirements.txt versions have been pinned. This is to prevent version mismatch issues in the future.
DeeperSpeed is directly specified in the requirements.txt

Script usage:

$ deploy_k8.sh [branch=main] [n_nodes=4] [name_suffix=$USER] [image]

… def

StellaAthena · 2021-01-31T05:10:52Z

I'm getting error: error reading kubernetes/id_rsa.pub: no such file or directory and then Error from server (NotFound): deployments.apps "neox-stellabiderman" not found. I think that I need to write a deployment script and name it neox-stellabiderman? More helpful error messages would be a huge plus.

I also tried telling it to make 50 nodes and it just spun for half an hour with no feedback.

joshlk · 2021-01-31T15:24:49Z

No need for separate deployment script. Let me look into the issues

joshlk · 2021-01-31T15:46:23Z

error: error reading kubernetes/id_rsa.pub: no such file or directory

Simple mistake. That should work now.

Error from server (NotFound): deployments.apps "neox-stellabiderman" not found

This is printed when you haven't already got a deployment running. I have now suppressed this message.

I also tried telling it to make 50 nodes and it just spun for half an hour with no feedback.

Here are some useful commands to determine whats going on (your deployment will be called neox-stellabiderman):

kubectl get deployment neox-stellabiderman: status of deployment. Includes how many pods are ready.
kubectl get pods -o wide: lists all pods and whats there status. All pods with a neox-stellabiderman are associated with your deployment.
kubectl describe pods neox-stellabiderman-6fbffb6dc7-tc29q: further details about a particular pod
kubectl logs neox-stellabiderman-6fbffb6dc7-tc29q: follow the logs of a particular pod

I also just tried to run a deployment with 50 nodes. Only 4 pods started. If you then do kubectl describe on one of the pods which didn't start it shows this error:

0/1441 nodes are available: 120 node(s) were unschedulable, 1321 Insufficient nvidia.com/gpu, 380 Insufficient memory, 929 Insufficient cpu.

Is there a limit to the number of resources we can use?

…repo to `leogao2`

StellaAthena · 2021-01-31T16:17:50Z

Is there a limit to the number of resources we can use?

right now, we can have 8 nodes total. So when you asked it to make 50 and it made 4, we probably had 4 other nodes already made. That’s also why I chose to test 50; I wanted to see what would happen if I over-requested resources.

StellaAthena · 2021-01-31T16:33:44Z

I just tried to deploy a two-pod deployment (six previously existed) and got an interesting message when I ran kubectl get pods -o wide. The status of one of my pods was listed as PostStartHookError: command '/bin/bash /secrets/post_start_script.sh' exited with 128: fatal: destination path '.' already exists and is not an empty directory. After some time that status message went away and it gained an IP address and a node ID number, but it's status is listed as CrashLoopBackOff. The other node (which never displayed an error) still reports for its IP and ID number.

I also opened up Lens and saw something interesting. Despite the deployments being listed separately (see left), when I click on either neox-josh or neox-stellabiderman I see that they both report 5 pods! Is there something in the code that is connecting the two deployments behind the scenes?

# Conflicts: # Dockerfile

…to upload files to all machines

joshlk · 2021-01-31T17:16:44Z

kubernetes/k8s_spec.yml

@@ -20,21 +20,27 @@ spec:
 command: ["/usr/sbin/sshd"]
 args: ["-D"]
 tty: true
- image: leogao2/gpt-neox
+ image: leogao2/gpt-neox:synced-deployment


This should be changed to leogao2/gpt-neox:main once merged into main branch

joshlk · 2021-01-31T17:19:28Z

I have added two util scripts to be used when logged into the main node of a deployment:

scripts/kill_all.sh: kills the deepspeed run on all machines
scripts/sync.sh: uploads a specified file or files to all machines. Useful when you want to make changes on the fly. e.g. scripts/sync.sh gpt_neox/* will upload all files in the gpt_neox to all machines

joshlk and others added 18 commits January 29, 2021 11:41

Docker build action test

ec1157d

Docker build action test

b1c3da1

Docker build action

379ed36

test

9e8e05c

Docker build action

5ae2d95

Docker build action

38d4b01

Test

052e23a

Docker build action

0846a45

Docker build action

664b726

Docker build action

0b6da3e

Docker build action

e0acffd

Docker build action

f2378dd

.

a8ad26f

Update deploy_k8s.sh

d1a7423

Docker build action

28a423a

Pin versions and directly install deeperspeed

3b0c829

Parameterise N_NODES and give a different deployment name for each user

c692f08

SSH keys uploaded using docker secrets. Post script executed using k8…

6505ae8

… def

joshlk requested a review from a team as a code owner January 30, 2021 14:28

joshlk requested review from StellaAthena and leogao2 January 30, 2021 14:28

joshlk added 9 commits January 30, 2021 14:28

SSH keys uploaded using docker secrets. Post script executed using k8…

8e06910

… def

Delete temp files

d3f0b55

Push to eleutherai repo

a8ef34a

Test chaching

2df389d

Test caching

15d5e93

Add docker caching

eb3cd92

Revert to one command

3dcb75c

Spelling

a566932

Use leogao2 dockerhub account

51a9fe1

Use relative path for keygen

8dc0637

joshlk added 4 commits January 31, 2021 16:01

Add argument which allows user to change image. Change default image …

c0dda3d

…repo to `leogao2`

Add argument which allows user to change image. Change default image …

9cc6875

…repo to `leogao2`

Pin image version

c0f8a2c

Pin image version

0f60ad7

joshlk added 8 commits January 31, 2021 16:43

Enforce pdsh to use ssh

d04218f

Tag with branch name as well

9011b30

Use synced-deployment branch image

c18450b

Merge branch 'feature/docker_build' into synced-deployment

8aab881

# Conflicts: # Dockerfile

Use branch as tag

a412671

Use synced-deployment branch image

e7b7dbe

Use synced-deployment branch image

b147b07

Add util scripts to kill a running deepspeed job on all machines and …

77c5582

…to upload files to all machines

joshlk mentioned this pull request Jan 31, 2021

Docker build action test #105

Closed

joshlk commented Jan 31, 2021

View reviewed changes

StellaAthena mentioned this pull request Feb 1, 2021

Monitoring using wandb #108

Merged

joshlk added 2 commits February 1, 2021 09:48

Kill k8 deployment script

19a817e

Default image depends on branch

86bf28f

StellaAthena approved these changes Feb 1, 2021

View reviewed changes

StellaAthena merged commit 2f5c34a into main Feb 1, 2021

StellaAthena deleted the synced-deployment branch February 1, 2021 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better deployment #107

Better deployment #107

joshlk commented Jan 30, 2021 •

edited

Loading

StellaAthena commented Jan 31, 2021

joshlk commented Jan 31, 2021

joshlk commented Jan 31, 2021

StellaAthena commented Jan 31, 2021

StellaAthena commented Jan 31, 2021 •

edited

Loading

joshlk Jan 31, 2021

joshlk commented Jan 31, 2021 •

edited

Loading

Better deployment #107

Better deployment #107

Conversation

joshlk commented Jan 30, 2021 • edited Loading

StellaAthena commented Jan 31, 2021

joshlk commented Jan 31, 2021

joshlk commented Jan 31, 2021

StellaAthena commented Jan 31, 2021

StellaAthena commented Jan 31, 2021 • edited Loading

joshlk Jan 31, 2021

Choose a reason for hiding this comment

joshlk commented Jan 31, 2021 • edited Loading

joshlk commented Jan 30, 2021 •

edited

Loading

StellaAthena commented Jan 31, 2021 •

edited

Loading

joshlk commented Jan 31, 2021 •

edited

Loading