-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create experiment runners #7
Comments
Do you want to have something like a dashboard for executing and managing your experiments or should it be CLI based. |
We would be happy with either, but a dashboard would probably be ideal. FYI, the current interface is to write scripts and then call We already have a tensorboard... is this something that can easily be modified to also manage experiments? |
@steven-mi I assume Kubeflow plays nicely with Kubernetes? Kubernetes is what we are already using on the CoreWeave servers so that sounds ideal. |
Do folks want to use something like Weights & Biases? It's a bit harder to make public, but I find it really useful for keeping track of experiments, config files, and sweeps. |
Kubeflow will give us the functional insights we need. |
@dreamindata I've never used Argo before but it looks neat. I think one of the most important uses for a runner is to make sure that work is actually being spread evenly across all machines/processes. What do you think is best for keeping track of that? |
We somewhat have this set up in Kubernetes + Docker currently. I use a Kube GUI called Lens to keep an eye on load balancing. |
If we change some of the docker defaults we can give each GPU a little more breathing room and the work will stay well distributed across multiple GPUs. |
This is correct. Example timings from the most recent large scale run are:
|
Superseded by subsequent changes |
We will want to run experiments with a variety of configs and options. To enable this, we need two things:
The text was updated successfully, but these errors were encountered: