Monitoring using wandb #108

joshlk · 2021-01-30T17:49:07Z

Monitoring using Weights & Biases (wandb). Each worker submits it own report. These are then aggregated using a common group key. The reports are submitted to a eleutherai org account. wandb have kindly provided us with a free org account at my request. Please DM me if you want to be added to it.

Instead of using the normal deepspeed entry point you need to use deepy.py which is in the root dir. This handles generating a common group key and communicating the wandb API key to all workers. Otherwise its a drop in replacement.

For example, here is how it works:

Log into wandb wandb login. If you aren't logged in it wont report to wandb.ai but the model will continue as usual
Run deepspeed using the deepy.py entry point:

./deepy.py train_pipeline.py --deepspeed --deepspeed_config configs/checkpoints_config.json --model configs/gpt3_small.json

deepy.py == deepspeed

Example report: https://wandb.ai/eleutherai/neox_train_enwik8/groups/eaWBBEYorwPheZ7s8pWXPb?workspace=user-joshlk

# Conflicts: # configs/base_model.json # configs/gpt3_small.json # train_enwik8_pipeline.py # train_pipeline.py

joshlk · 2021-01-30T18:39:17Z

Not ready to merge quite yet

joshlk · 2021-01-31T18:17:40Z

Example:

./deepy.py train_pipeline.py --deepspeed --deepspeed_config configs/checkpoints_config.json --model configs/gpt3_small.json
https://wandb.ai/eleutherai/neox_train_pipeline/groups/RRSE66WPNT9aRzmfPDbk6m?workspace=user-joshlk

Branch ready for review and to be merged

StellaAthena · 2021-01-31T21:00:30Z

Any idea what's going on here?

joshlk · 2021-01-31T21:29:09Z

Hmmm. Try ‘git pull’

StellaAthena · 2021-01-31T21:53:47Z

Hmmm. Try ‘git pull’

It says I am up to date

joshlk · 2021-01-31T22:16:56Z

Odd as deepy.py has been committed into the branch

StellaAthena · 2021-02-01T02:00:43Z

The first time I build the pod on the main branch and checked out this branch. I tried again building with this branch and now it’s working. A little awkward, but not the end of the world.

I am running the tokenization now and once the model starts actually chaining I’ll merge both this PR and #107

joshlk · 2021-02-01T09:45:31Z

Here are the detailed steps required to test. It's a bit convoluted due to #107 not being merged:

By default a run will add the monitoring to the EleutherAI wandb org account. You can configure this by changing the entry in the model config json. For that to work you need to be added to the EleutherAI wandb org account - I have just sent you an invite now.
On your local machine checkout the branch synced-deployment and start a deployment using feature/monitoring_wandb branch's code and image (the image has wandb pip installed):

kubernetes/deploy_k8s.sh feature/monitoring_wandb 4 stellabiderman joshlk/gpt-neox:wandb

Here I'm referencing the docker image joshlk/gpt-neox:wandb which I manually built. But once #107 is merged we can change the deployment script so it uses the docker image from the same branch.

You will be dropped into the shell of the main node after the deployment script has finished. Here login to wandb: wandb login. Once Better deployment #107 is merged we can change the deployment script to automatically copy your login details from your local machine to the main node so this step wont be necessary.
The best test to first run is with the dummy pipeline as you will get quick feedback. So try running:

./deepy.py train_enwik8.py --deepspeed --deepspeed_config configs/checkpoints_config.json

You then should see something like this in the logs/screen output:

10.141.250.251: wandb: Tracking run with wandb version 0.10.15
10.141.250.251: wandb: Syncing run neox-josh-6d4cbb7ccb-x2f72-5
10.141.250.251: wandb: ⭐️ View project at https://wandb.ai/eleutherai/neox_train_enwik8
10.141.250.251: wandb: 🚀 View run at https://wandb.ai/eleutherai/neox_train_enwik8/runs/1o1lyp45
10.141.250.251: wandb: Run data is saved locally in /app/wandb/run-20210201_093554-1o1lyp45
10.141.250.251: wandb: Run `wandb offline` to turn off syncing.

You can then open https://wandb.ai/eleutherai and select the latest run and you should see it. Let me know if that works 😃

StellaAthena · 2021-02-01T16:02:27Z

I ran ./deepy.py train_enwik8.py --deepspeed --deepspeed_config configs/checkpoints_config.json successfully. It logs on WandB and things generally look good. Two questions:

I am using two nodes, but I see four "groups" on WandB, three of which eventually stop advancing. Are these past runs or something?
I got the following warning printout when running the code

/build_dir/src/deepspeed/deepspeed/runtime/pipe/module.py:533: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if rank_repr is not '':

I then tried replacing train_enwik8.py with train.py and it didn't work, but train_pipeline.py did work. I'm going to call that a win and merge both PRs, then work on improving usability.

# Conflicts: # Dockerfile # requirements.txt

joshlk · 2021-02-01T17:43:23Z

The deployment script now automatically uploads your wandb login details to the main node. So you only have to login once on your local machine using wandb login and it should all work automatically.

Branch ready to be merged.

* Monitoring using wandb (#108) * Wandb for distributed training * Pin requirements * Use wandb team account * Add batch size * Add batch size * Add batch size * Add wandb to all pipelines * Add wandb to all pipelines * JSON error * Include config in base * Add wandb config to gpt_small * Add tmux * Log every pass * Automatically upload wandb API key from local machine * Automatically upload wandb API key from local machine * Substitute forward slash with underscore * Substitute forward slash with underscore * Substitute forward slash with underscore * Use docker actions for tags * Use docker actions for tags * Use docker actions for tags * Substitute `/` with `-`. Same as docker metadata * Update docker_build.yaml (#109) Remove layer caching * add evaluation (#110) * add evaluation * Update train_pipeline.py * Only enable init wandb if API key can be found * wandb: correct project name * Removes typo Co-authored-by: Josh Levy-Kramer <[email protected]> Co-authored-by: Stella Biderman <[email protected]> Co-authored-by: Josh Levy-Kramer <[email protected]> Co-authored-by: sdtblck <[email protected]> Co-authored-by: Josh Levy-Kramer <[email protected]>

joshlk added 11 commits January 29, 2021 11:42

Wandb for distributed training

44a5d48

Pin requirements

351e738

Use wandb team account

9bf6c35

Add batch size

081b16e

Add batch size

184e5a6

Add batch size

84afac6

Add wandb to all pipelines

a966344

Add wandb to all pipelines

1c947c7

Merge branch 'main' into feature/monitoring_wandb

fcb3eba

# Conflicts: # configs/base_model.json # configs/gpt3_small.json # train_enwik8_pipeline.py # train_pipeline.py

JSON error

01dd674

Include config in base

37ff696

joshlk added 3 commits January 31, 2021 17:45

Add wandb config to gpt_small

94f9654

Add tmux

1cba655

Log every pass

c4e4345

joshlk marked this pull request as ready for review January 31, 2021 18:17

joshlk requested a review from a team as a code owner January 31, 2021 18:17

joshlk requested review from ConnorJL and AranKomat January 31, 2021 18:17

joshlk added 3 commits February 1, 2021 16:33

Merge branch 'main' into feature/monitoring_wandb

8fe6a03

# Conflicts: # Dockerfile # requirements.txt

Automatically upload wandb API key from local machine

b9960e1

Automatically upload wandb API key from local machine

b390829

joshlk added 3 commits February 1, 2021 16:59

Substitute forward slash with underscore

be130f7

Substitute forward slash with underscore

4114727

Substitute forward slash with underscore

2a998d4

StellaAthena previously approved these changes Feb 1, 2021

View reviewed changes

Use docker actions for tags

b483080

joshlk dismissed StellaAthena’s stale review via b483080 February 1, 2021 17:15

joshlk added 4 commits February 1, 2021 17:20

Use docker actions for tags

6ab19f4

Use docker actions for tags

2b76c93

Merge branch 'main' into feature/monitoring_wandb

29b504c

Substitute / with -. Same as docker metadata

e353455

StellaAthena approved these changes Feb 1, 2021

View reviewed changes

StellaAthena merged commit 7996864 into main Feb 1, 2021

StellaAthena deleted the feature/monitoring_wandb branch February 1, 2021 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring using wandb #108

Monitoring using wandb #108

joshlk commented Jan 30, 2021 •

edited

Loading

joshlk commented Jan 30, 2021

joshlk commented Jan 31, 2021

StellaAthena commented Jan 31, 2021

joshlk commented Jan 31, 2021

StellaAthena commented Jan 31, 2021

joshlk commented Jan 31, 2021 •

edited

Loading

StellaAthena commented Feb 1, 2021

joshlk commented Feb 1, 2021

StellaAthena commented Feb 1, 2021

joshlk commented Feb 1, 2021

Monitoring using wandb #108

Monitoring using wandb #108

Conversation

joshlk commented Jan 30, 2021 • edited Loading

joshlk commented Jan 30, 2021

joshlk commented Jan 31, 2021

StellaAthena commented Jan 31, 2021

joshlk commented Jan 31, 2021

StellaAthena commented Jan 31, 2021

joshlk commented Jan 31, 2021 • edited Loading

StellaAthena commented Feb 1, 2021

joshlk commented Feb 1, 2021

StellaAthena commented Feb 1, 2021

joshlk commented Feb 1, 2021

joshlk commented Jan 30, 2021 •

edited

Loading

joshlk commented Jan 31, 2021 •

edited

Loading