Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring using wandb #108

Merged
merged 25 commits into from
Feb 1, 2021
Merged

Monitoring using wandb #108

merged 25 commits into from
Feb 1, 2021

Conversation

joshlk
Copy link
Member

@joshlk joshlk commented Jan 30, 2021

Monitoring using Weights & Biases (wandb). Each worker submits it own report. These are then aggregated using a common group key. The reports are submitted to a eleutherai org account. wandb have kindly provided us with a free org account at my request. Please DM me if you want to be added to it.

Instead of using the normal deepspeed entry point you need to use deepy.py which is in the root dir. This handles generating a common group key and communicating the wandb API key to all workers. Otherwise its a drop in replacement.

For example, here is how it works:

  1. Log into wandb wandb login. If you aren't logged in it wont report to wandb.ai but the model will continue as usual
  2. Run deepspeed using the deepy.py entry point:
./deepy.py train_pipeline.py --deepspeed --deepspeed_config configs/checkpoints_config.json --model configs/gpt3_small.json	

deepy.py == deepspeed

Example report: https://wandb.ai/eleutherai/neox_train_enwik8/groups/eaWBBEYorwPheZ7s8pWXPb?workspace=user-joshlk

@joshlk
Copy link
Member Author

joshlk commented Jan 30, 2021

Not ready to merge quite yet

@joshlk
Copy link
Member Author

joshlk commented Jan 31, 2021

Example:

Branch ready for review and to be merged

@joshlk joshlk marked this pull request as ready for review January 31, 2021 18:17
@joshlk joshlk requested a review from a team as a code owner January 31, 2021 18:17
@StellaAthena
Copy link
Member

Any idea what's going on here?

Screen Shot 2021-01-31 at 3 59 52 PM

@joshlk
Copy link
Member Author

joshlk commented Jan 31, 2021

Hmmm. Try ‘git pull’

@StellaAthena
Copy link
Member

Hmmm. Try ‘git pull’

It says I am up to date

@joshlk
Copy link
Member Author

joshlk commented Jan 31, 2021

Odd as deepy.py has been committed into the branch

@StellaAthena
Copy link
Member

The first time I build the pod on the main branch and checked out this branch. I tried again building with this branch and now it’s working. A little awkward, but not the end of the world.

I am running the tokenization now and once the model starts actually chaining I’ll merge both this PR and #107

@joshlk
Copy link
Member Author

joshlk commented Feb 1, 2021

Here are the detailed steps required to test. It's a bit convoluted due to #107 not being merged:

  1. By default a run will add the monitoring to the EleutherAI wandb org account. You can configure this by changing the entry in the model config json. For that to work you need to be added to the EleutherAI wandb org account - I have just sent you an invite now.
  2. On your local machine checkout the branch synced-deployment and start a deployment using feature/monitoring_wandb branch's code and image (the image has wandb pip installed):
kubernetes/deploy_k8s.sh feature/monitoring_wandb 4 stellabiderman joshlk/gpt-neox:wandb

Here I'm referencing the docker image joshlk/gpt-neox:wandb which I manually built. But once #107 is merged we can change the deployment script so it uses the docker image from the same branch.

  1. You will be dropped into the shell of the main node after the deployment script has finished. Here login to wandb: wandb login. Once Better deployment #107 is merged we can change the deployment script to automatically copy your login details from your local machine to the main node so this step wont be necessary.
  2. The best test to first run is with the dummy pipeline as you will get quick feedback. So try running:
./deepy.py train_enwik8.py --deepspeed --deepspeed_config configs/checkpoints_config.json

You then should see something like this in the logs/screen output:

10.141.250.251: wandb: Tracking run with wandb version 0.10.15
10.141.250.251: wandb: Syncing run neox-josh-6d4cbb7ccb-x2f72-5
10.141.250.251: wandb: ⭐️ View project at https://wandb.ai/eleutherai/neox_train_enwik8
10.141.250.251: wandb: 🚀 View run at https://wandb.ai/eleutherai/neox_train_enwik8/runs/1o1lyp45
10.141.250.251: wandb: Run data is saved locally in /app/wandb/run-20210201_093554-1o1lyp45
10.141.250.251: wandb: Run `wandb offline` to turn off syncing.

You can then open https://wandb.ai/eleutherai and select the latest run and you should see it. Let me know if that works 😃

@StellaAthena
Copy link
Member

I ran ./deepy.py train_enwik8.py --deepspeed --deepspeed_config configs/checkpoints_config.json successfully. It logs on WandB and things generally look good. Two questions:

  1. I am using two nodes, but I see four "groups" on WandB, three of which eventually stop advancing. Are these past runs or something?
  2. I got the following warning printout when running the code
/build_dir/src/deepspeed/deepspeed/runtime/pipe/module.py:533: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if rank_repr is not '':

I then tried replacing train_enwik8.py with train.py and it didn't work, but train_pipeline.py did work. I'm going to call that a win and merge both PRs, then work on improving usability.

StellaAthena
StellaAthena previously approved these changes Feb 1, 2021
@joshlk
Copy link
Member Author

joshlk commented Feb 1, 2021

The deployment script now automatically uploads your wandb login details to the main node. So you only have to login once on your local machine using wandb login and it should all work automatically.

Branch ready to be merged.

@StellaAthena StellaAthena merged commit 7996864 into main Feb 1, 2021
@StellaAthena StellaAthena deleted the feature/monitoring_wandb branch February 1, 2021 17:50
StellaAthena added a commit that referenced this pull request Feb 3, 2021
* Monitoring using wandb (#108)

* Wandb for distributed training

* Pin requirements

* Use wandb team account

* Add batch size

* Add batch size

* Add batch size

* Add wandb to all pipelines

* Add wandb to all pipelines

* JSON error

* Include config in base

* Add wandb config to gpt_small

* Add tmux

* Log every pass

* Automatically upload wandb API key from local machine

* Automatically upload wandb API key from local machine

* Substitute forward slash with underscore

* Substitute forward slash with underscore

* Substitute forward slash with underscore

* Use docker actions for tags

* Use docker actions for tags

* Use docker actions for tags

* Substitute `/` with `-`. Same as docker metadata

* Update docker_build.yaml (#109)

Remove layer caching

* add evaluation (#110)

* add evaluation

* Update train_pipeline.py

* Only enable init wandb if API key can be found

* wandb: correct project name

* Removes typo

Co-authored-by: Josh Levy-Kramer <[email protected]>
Co-authored-by: Stella Biderman <[email protected]>

Co-authored-by: Josh Levy-Kramer <[email protected]>
Co-authored-by: sdtblck <[email protected]>
Co-authored-by: Josh Levy-Kramer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants