-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring using wandb #108
Conversation
# Conflicts: # configs/base_model.json # configs/gpt3_small.json # train_enwik8_pipeline.py # train_pipeline.py
Not ready to merge quite yet |
Example:
Branch ready for review and to be merged |
Hmmm. Try ‘git pull’ |
It says I am up to date |
Odd as deepy.py has been committed into the branch |
The first time I build the pod on the I am running the tokenization now and once the model starts actually chaining I’ll merge both this PR and #107 |
Here are the detailed steps required to test. It's a bit convoluted due to #107 not being merged:
Here I'm referencing the docker image
You then should see something like this in the logs/screen output:
You can then open https://wandb.ai/eleutherai and select the latest run and you should see it. Let me know if that works 😃 |
I ran
I then tried replacing |
# Conflicts: # Dockerfile # requirements.txt
The deployment script now automatically uploads your wandb login details to the main node. So you only have to login once on your local machine using Branch ready to be merged. |
* Monitoring using wandb (#108) * Wandb for distributed training * Pin requirements * Use wandb team account * Add batch size * Add batch size * Add batch size * Add wandb to all pipelines * Add wandb to all pipelines * JSON error * Include config in base * Add wandb config to gpt_small * Add tmux * Log every pass * Automatically upload wandb API key from local machine * Automatically upload wandb API key from local machine * Substitute forward slash with underscore * Substitute forward slash with underscore * Substitute forward slash with underscore * Use docker actions for tags * Use docker actions for tags * Use docker actions for tags * Substitute `/` with `-`. Same as docker metadata * Update docker_build.yaml (#109) Remove layer caching * add evaluation (#110) * add evaluation * Update train_pipeline.py * Only enable init wandb if API key can be found * wandb: correct project name * Removes typo Co-authored-by: Josh Levy-Kramer <[email protected]> Co-authored-by: Stella Biderman <[email protected]> Co-authored-by: Josh Levy-Kramer <[email protected]> Co-authored-by: sdtblck <[email protected]> Co-authored-by: Josh Levy-Kramer <[email protected]>
Monitoring using Weights & Biases (wandb). Each worker submits it own report. These are then aggregated using a common group key. The reports are submitted to a eleutherai org account. wandb have kindly provided us with a free org account at my request. Please DM me if you want to be added to it.
Instead of using the normal
deepspeed
entry point you need to usedeepy.py
which is in the root dir. This handles generating a common group key and communicating the wandb API key to all workers. Otherwise its a drop in replacement.For example, here is how it works:
wandb login
. If you aren't logged in it wont report to wandb.ai but the model will continue as usualdeepy.py
entry point:deepy.py
==deepspeed
Example report: https://wandb.ai/eleutherai/neox_train_enwik8/groups/eaWBBEYorwPheZ7s8pWXPb?workspace=user-joshlk