Create experiment runners #7

StellaAthena · 2020-12-23T19:44:04Z

We will want to run experiments with a variety of configs and options. To enable this, we need two things:

configs files that we can use to specify the settings for a particular run
an experiment runner for managing and automatically executing several runs

steven-mi · 2021-01-04T00:38:27Z

Do you want to have something like a dashboard for executing and managing your experiments or should it be CLI based.

StellaAthena · 2021-01-04T04:34:06Z

Do you want to have something like a dashboard for executing and managing your experiments or should it be CLI based.

We would be happy with either, but a dashboard would probably be ideal. FYI, the current interface is to write scripts and then call sh scripts/script_name.Sh.

We already have a tensorboard... is this something that can easily be modified to also manage experiments?

steven-mi · 2021-01-04T12:01:02Z

We would be happy with either, but a dashboard would probably be ideal. FYI, the current interface is to write scripts and then call sh scripts/script_name.Sh.

Jup. I saw that. Didn't look very optimal

We already have a tensorboard... is this something that can easily be modified to also manage experiments?

Tensorboard is more a visualization tool. Not really sure if it is able to manage/start experiments.

If you want to have a way for scheduling your Python Scripts, you could use an orchestrating system like:

Airflow
Kubeflow
Mara
Luigi
All of them come in with a web UI, in which you can monitor and manage your scripts. This tool won't give you an overview of your experiments like Tensorboard. It is simply just a dashboard for starting, stoping and monitoring the state of your training.

e.g. when using Airflow it would look like this:

If you just want to have a dashboard where you can look up your experiment results, then maybe MLflow. Although for this case Tensorboard would also be fine

StellaAthena · 2021-01-04T14:25:21Z

@steven-mi I assume Kubeflow plays nicely with Kubernetes? Kubernetes is what we are already using on the CoreWeave servers so that sounds ideal.

glebshevchukk · 2021-01-12T20:03:14Z

Do folks want to use something like Weights & Biases? It's a bit harder to make public, but I find it really useful for keeping track of experiments, config files, and sweeps.

dreamindata · 2021-01-23T22:05:13Z

Kubeflow will give us the functional insights we need.
We can also use Argo to look at the layout of individual GPUs and the CPU instances they are connected to.
If we need to go further and have more extensive logging we can use ELK or ELP, etc. once we set up a separate compute node and reserve the logging storage.
To keep track of experiments, Kedro has a bit of a learning curve, but the function visualization interface makes it worth it. Kedro-Vis will also come in useful later when users need to create pipelines to segregate inappropriate content.
Weights and Biases (wandb.ai) reports look professional and the monitoring is easy to implement but the output is kludgy (whatever you guys think is best is fine by me)

glebshevchukk · 2021-01-23T22:16:57Z

@dreamindata I've never used Argo before but it looks neat. I think one of the most important uses for a runner is to make sure that work is actually being spread evenly across all machines/processes. What do you think is best for keeping track of that?

StellaAthena · 2021-01-23T22:33:16Z

We somewhat have this set up in Kubernetes + Docker currently. I use a Kube GUI called Lens to keep an eye on load balancing.

dreamindata · 2021-01-23T22:45:37Z

@dreamindata I've never used Argo before but it looks neat. I think one of the most important uses for a runner is to make sure that work is actually being spread evenly across all machines/processes. What do you think is best for keeping track of that?

If we change some of the docker defaults we can give each GPU a little more breathing room and the work will stay well distributed across multiple GPUs.
I would recommend staying at the Kubernetes management level because tools which work at the service level have a delay and most of them cannot look into the self-healing stages of a kubernetes node.
Octant might have some tools but Lens should do everything we need.
The only exception might be Cortex since it will allow us to get insights during a node re-launch?
In our case the backward microsteps will (should?) be creating the most computing burden so I will have to look at the code first.
Kubernetes has several management tools and Canonical allows for installation as charms if you want to check it out.

StellaAthena · 2021-01-23T22:49:21Z

In our case the backward microsteps will (should?) be creating the most computing burden so I will have to look at the code first.

This is correct. Example timings from the most recent large scale run are:

10.141.250.206: rank=0 time (ms) | optimizer_gradients: 15.57 | optimizer_step: 125.15 | optimizer_allgather: 153.85
10.141.250.206: rank=0 time (ms) | forward_microstep: 9.73 | backward_microstep: 349.99 | backward_inner_microstep: 30.18 | backward_allreduce_microstep: 319.74 | step_microstep: 298.96
10.141.250.206: rank=0 time (ms) | forward: 9.71 | backward: 349.96 | backward_inner: 30.15 | backward_allreduce: 319.71 | step: 298.93
10.141.250.206: rank=0 time (ms) | optimizer_gradients: 23.32 | optimizer_step: 123.67 | optimizer_allgather: 267.97
10.141.250.206: rank=0 time (ms) | forward_microstep: 9.96 | backward_microstep: 448.03 | backward_inner_microstep: 28.11 | backward_allreduce_microstep: 419.85 | step_microstep: 419.11
10.141.250.206: rank=0 time (ms) | forward: 9.94 | backward: 448.01 | backward_inner: 28.08 | backward_allreduce: 419.82 | step: 419.08

Add Sinusoidal Positional Embedding

StellaAthena · 2021-02-17T17:26:10Z

Superseded by subsequent changes

StellaAthena added the feature request New feature or request label Dec 23, 2020

StellaAthena added the good first issue Good for newcomers label Jan 4, 2021

StellaAthena assigned joshlk Jan 28, 2021

leogao2 pushed a commit that referenced this issue Feb 13, 2021

Merge pull request #7 from EleutherAI/sinusoid_pos_emb

261b0ec

Add Sinusoidal Positional Embedding

leogao2 pushed a commit that referenced this issue Feb 13, 2021

Merge pull request #7 from EleutherAI/sinusoid_pos_emb

94a6305

Add Sinusoidal Positional Embedding

StellaAthena closed this as completed Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create experiment runners #7

Create experiment runners #7

StellaAthena commented Dec 23, 2020

steven-mi commented Jan 4, 2021

StellaAthena commented Jan 4, 2021

steven-mi commented Jan 4, 2021 •

edited

Loading

StellaAthena commented Jan 4, 2021

glebshevchukk commented Jan 12, 2021

dreamindata commented Jan 23, 2021 •

edited

Loading

glebshevchukk commented Jan 23, 2021

StellaAthena commented Jan 23, 2021

dreamindata commented Jan 23, 2021

StellaAthena commented Jan 23, 2021 •

edited

Loading

StellaAthena commented Feb 17, 2021

Create experiment runners #7

Create experiment runners #7

Comments

StellaAthena commented Dec 23, 2020

steven-mi commented Jan 4, 2021

StellaAthena commented Jan 4, 2021

steven-mi commented Jan 4, 2021 • edited Loading

StellaAthena commented Jan 4, 2021

glebshevchukk commented Jan 12, 2021

dreamindata commented Jan 23, 2021 • edited Loading

glebshevchukk commented Jan 23, 2021

StellaAthena commented Jan 23, 2021

dreamindata commented Jan 23, 2021

StellaAthena commented Jan 23, 2021 • edited Loading

StellaAthena commented Feb 17, 2021

steven-mi commented Jan 4, 2021 •

edited

Loading

dreamindata commented Jan 23, 2021 •

edited

Loading

StellaAthena commented Jan 23, 2021 •

edited

Loading