Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better deployment #107

Merged
merged 42 commits into from
Feb 1, 2021
Merged

Better deployment #107

merged 42 commits into from
Feb 1, 2021

Conversation

joshlk
Copy link
Member

@joshlk joshlk commented Jan 30, 2021

  • Each user has a different deployment name
  • User can choose how many nodes to deploy
  • SSH keys and post deploy script are mounted to the containers using k8 secrets
  • requirements.txt versions have been pinned. This is to prevent version mismatch issues in the future.
  • DeeperSpeed is directly specified in the requirements.txt

Script usage:

$ deploy_k8.sh [branch=main] [n_nodes=4] [name_suffix=$USER] [image]

@joshlk joshlk requested a review from a team as a code owner January 30, 2021 14:28
@StellaAthena
Copy link
Member

I'm getting error: error reading kubernetes/id_rsa.pub: no such file or directory and then Error from server (NotFound): deployments.apps "neox-stellabiderman" not found. I think that I need to write a deployment script and name it neox-stellabiderman? More helpful error messages would be a huge plus.

I also tried telling it to make 50 nodes and it just spun for half an hour with no feedback.

@joshlk
Copy link
Member Author

joshlk commented Jan 31, 2021

No need for separate deployment script. Let me look into the issues

@joshlk
Copy link
Member Author

joshlk commented Jan 31, 2021

error: error reading kubernetes/id_rsa.pub: no such file or directory

Simple mistake. That should work now.

Error from server (NotFound): deployments.apps "neox-stellabiderman" not found

This is printed when you haven't already got a deployment running. I have now suppressed this message.

I also tried telling it to make 50 nodes and it just spun for half an hour with no feedback.

Here are some useful commands to determine whats going on (your deployment will be called neox-stellabiderman):

  • kubectl get deployment neox-stellabiderman: status of deployment. Includes how many pods are ready.
  • kubectl get pods -o wide: lists all pods and whats there status. All pods with a neox-stellabiderman are associated with your deployment.
  • kubectl describe pods neox-stellabiderman-6fbffb6dc7-tc29q: further details about a particular pod
  • kubectl logs neox-stellabiderman-6fbffb6dc7-tc29q: follow the logs of a particular pod

I also just tried to run a deployment with 50 nodes. Only 4 pods started. If you then do kubectl describe on one of the pods which didn't start it shows this error:

0/1441 nodes are available: 120 node(s) were unschedulable, 1321 Insufficient nvidia.com/gpu, 380 Insufficient memory, 929 Insufficient cpu.

Is there a limit to the number of resources we can use?

@StellaAthena
Copy link
Member

Is there a limit to the number of resources we can use?

right now, we can have 8 nodes total. So when you asked it to make 50 and it made 4, we probably had 4 other nodes already made. That’s also why I chose to test 50; I wanted to see what would happen if I over-requested resources.

@StellaAthena
Copy link
Member

StellaAthena commented Jan 31, 2021

I just tried to deploy a two-pod deployment (six previously existed) and got an interesting message when I ran kubectl get pods -o wide. The status of one of my pods was listed as PostStartHookError: command '/bin/bash /secrets/post_start_script.sh' exited with 128: fatal: destination path '.' already exists and is not an empty directory. After some time that status message went away and it gained an IP address and a node ID number, but it's status is listed as CrashLoopBackOff. The other node (which never displayed an error) still reports for its IP and ID number.

I also opened up Lens and saw something interesting. Despite the deployments being listed separately (see left), when I click on either neox-josh or neox-stellabiderman I see that they both report 5 pods! Is there something in the code that is connecting the two deployments behind the scenes?

Screen Shot 2021-01-31 at 11 31 27 AM

@joshlk joshlk mentioned this pull request Jan 31, 2021
@@ -20,21 +20,27 @@ spec:
command: ["/usr/sbin/sshd"]
args: ["-D"]
tty: true
image: leogao2/gpt-neox
image: leogao2/gpt-neox:synced-deployment
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed to leogao2/gpt-neox:main once merged into main branch

@joshlk
Copy link
Member Author

joshlk commented Jan 31, 2021

I have added two util scripts to be used when logged into the main node of a deployment:

  • scripts/kill_all.sh: kills the deepspeed run on all machines
  • scripts/sync.sh: uploads a specified file or files to all machines. Useful when you want to make changes on the fly. e.g. scripts/sync.sh gpt_neox/* will upload all files in the gpt_neox to all machines

@StellaAthena StellaAthena merged commit 2f5c34a into main Feb 1, 2021
@StellaAthena StellaAthena deleted the synced-deployment branch February 1, 2021 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants