Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create experiment runners #7

Closed
2 tasks
StellaAthena opened this issue Dec 23, 2020 · 11 comments
Closed
2 tasks

Create experiment runners #7

StellaAthena opened this issue Dec 23, 2020 · 11 comments
Assignees
Labels
feature request New feature or request good first issue Good for newcomers
Projects

Comments

@StellaAthena
Copy link
Member

We will want to run experiments with a variety of configs and options. To enable this, we need two things:

  • configs files that we can use to specify the settings for a particular run
  • an experiment runner for managing and automatically executing several runs
@StellaAthena StellaAthena added the feature request New feature or request label Dec 23, 2020
@StellaAthena StellaAthena added this to To do in 1T or BUST via automation Dec 27, 2020
@steven-mi
Copy link
Contributor

Do you want to have something like a dashboard for executing and managing your experiments or should it be CLI based.

@StellaAthena
Copy link
Member Author

Do you want to have something like a dashboard for executing and managing your experiments or should it be CLI based.

We would be happy with either, but a dashboard would probably be ideal. FYI, the current interface is to write scripts and then call sh scripts/script_name.Sh.

We already have a tensorboard... is this something that can easily be modified to also manage experiments?

@StellaAthena StellaAthena added the good first issue Good for newcomers label Jan 4, 2021
@steven-mi
Copy link
Contributor

steven-mi commented Jan 4, 2021

We would be happy with either, but a dashboard would probably be ideal. FYI, the current interface is to write scripts and then call sh scripts/script_name.Sh.

Jup. I saw that. Didn't look very optimal

We already have a tensorboard... is this something that can easily be modified to also manage experiments?

Tensorboard is more a visualization tool. Not really sure if it is able to manage/start experiments.


If you want to have a way for scheduling your Python Scripts, you could use an orchestrating system like:

  • Airflow
  • Kubeflow
  • Mara
  • Luigi
    All of them come in with a web UI, in which you can monitor and manage your scripts. This tool won't give you an overview of your experiments like Tensorboard. It is simply just a dashboard for starting, stoping and monitoring the state of your training.

e.g. when using Airflow it would look like this:
image

If you just want to have a dashboard where you can look up your experiment results, then maybe MLflow. Although for this case Tensorboard would also be fine

@StellaAthena
Copy link
Member Author

@steven-mi I assume Kubeflow plays nicely with Kubernetes? Kubernetes is what we are already using on the CoreWeave servers so that sounds ideal.

@glebshevchukk
Copy link

Do folks want to use something like Weights & Biases? It's a bit harder to make public, but I find it really useful for keeping track of experiments, config files, and sweeps.

@dreamindata
Copy link

dreamindata commented Jan 23, 2021

Kubeflow will give us the functional insights we need.
We can also use Argo to look at the layout of individual GPUs and the CPU instances they are connected to.
If we need to go further and have more extensive logging we can use ELK or ELP, etc. once we set up a separate compute node and reserve the logging storage.
To keep track of experiments, Kedro has a bit of a learning curve, but the function visualization interface makes it worth it. Kedro-Vis will also come in useful later when users need to create pipelines to segregate inappropriate content.
Weights and Biases (wandb.ai) reports look professional and the monitoring is easy to implement but the output is kludgy (whatever you guys think is best is fine by me)

@glebshevchukk
Copy link

@dreamindata I've never used Argo before but it looks neat. I think one of the most important uses for a runner is to make sure that work is actually being spread evenly across all machines/processes. What do you think is best for keeping track of that?

@StellaAthena
Copy link
Member Author

We somewhat have this set up in Kubernetes + Docker currently. I use a Kube GUI called Lens to keep an eye on load balancing.

@dreamindata
Copy link

@dreamindata I've never used Argo before but it looks neat. I think one of the most important uses for a runner is to make sure that work is actually being spread evenly across all machines/processes. What do you think is best for keeping track of that?

If we change some of the docker defaults we can give each GPU a little more breathing room and the work will stay well distributed across multiple GPUs.
I would recommend staying at the Kubernetes management level because tools which work at the service level have a delay and most of them cannot look into the self-healing stages of a kubernetes node.
Octant might have some tools but Lens should do everything we need.
The only exception might be Cortex since it will allow us to get insights during a node re-launch?
In our case the backward microsteps will (should?) be creating the most computing burden so I will have to look at the code first.
Kubernetes has several management tools and Canonical allows for installation as charms if you want to check it out.

@StellaAthena
Copy link
Member Author

StellaAthena commented Jan 23, 2021

In our case the backward microsteps will (should?) be creating the most computing burden so I will have to look at the code first.

This is correct. Example timings from the most recent large scale run are:

10.141.250.206: rank=0 time (ms) | optimizer_gradients: 15.57 | optimizer_step: 125.15 | optimizer_allgather: 153.85
10.141.250.206: rank=0 time (ms) | forward_microstep: 9.73 | backward_microstep: 349.99 | backward_inner_microstep: 30.18 | backward_allreduce_microstep: 319.74 | step_microstep: 298.96
10.141.250.206: rank=0 time (ms) | forward: 9.71 | backward: 349.96 | backward_inner: 30.15 | backward_allreduce: 319.71 | step: 298.93
10.141.250.206: rank=0 time (ms) | optimizer_gradients: 23.32 | optimizer_step: 123.67 | optimizer_allgather: 267.97
10.141.250.206: rank=0 time (ms) | forward_microstep: 9.96 | backward_microstep: 448.03 | backward_inner_microstep: 28.11 | backward_allreduce_microstep: 419.85 | step_microstep: 419.11
10.141.250.206: rank=0 time (ms) | forward: 9.94 | backward: 448.01 | backward_inner: 28.08 | backward_allreduce: 419.82 | step: 419.08

leogao2 pushed a commit that referenced this issue Feb 13, 2021
Add Sinusoidal Positional Embedding
leogao2 pushed a commit that referenced this issue Feb 13, 2021
Add Sinusoidal Positional Embedding
@StellaAthena
Copy link
Member Author

Superseded by subsequent changes

1T or BUST automation moved this from To do to Done Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers
Projects
Development

No branches or pull requests

5 participants