Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cull pp = 0 model branch #269

Merged
merged 26 commits into from
Apr 30, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
cc2c900
fix wandb group stuff
sdtblck Apr 28, 2021
f92f8c9
fix checkpointing if deepspeed_activation_checkpointing = true
sdtblck Apr 28, 2021
62e4528
get rid of all codepaths where pp = 0, rearrange layout
sdtblck Apr 28, 2021
32b7bd1
refactor checkpointing
sdtblck Apr 28, 2021
56987c2
rename megatron_args to neox_args + remove unused argument
sdtblck Apr 28, 2021
8b6d515
remove unused FP16 code (deepspeed handles this)
sdtblck Apr 28, 2021
b58c48d
remove unused gradient clipping code (deepspeed handles this)
sdtblck Apr 28, 2021
d622349
remove apex dependency in training.py
sdtblck Apr 28, 2021
4e2d64a
removed unused megatron/memory.py
sdtblck Apr 28, 2021
a7b7b18
update requirements + dockerfile
sdtblck Apr 28, 2021
5e9dc55
Merge branch 'main' into cull-model-branch
sdtblck Apr 28, 2021
0b8fee9
get pipe to normal conversion working properly
sdtblck Apr 28, 2021
c80212e
Merge remote-tracking branch 'origin/cull-model-branch' into cull-mod…
sdtblck Apr 28, 2021
871e679
fix eval_helper
sdtblck Apr 28, 2021
77fe200
fix Dockerfile
sdtblck Apr 28, 2021
243c60a
get rid of megatron/data/dataset_utils.py
sdtblck Apr 28, 2021
f19e14a
update random.py
sdtblck Apr 28, 2021
e5212b1
remove some duplicate code
sdtblck Apr 28, 2021
de042f3
revert config changes
sdtblck Apr 28, 2021
3cf01de
revert changes to checkpointing.py
sdtblck Apr 28, 2021
6f5079f
test model update after gpt2 model remove
Apr 29, 2021
1dae917
adding more test configs
Apr 29, 2021
3c59574
Merge branch 'testcases_continued' into cull-model-branch
kipgparker Apr 29, 2021
df76402
remove MegatronModule + all custom saving logic (shit's cursed)
sdtblck Apr 29, 2021
09b5d06
delete deepspeed lmao
sdtblck Apr 30, 2021
ac00dbd
revert changes to small config
sdtblck Apr 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update random.py
  • Loading branch information
sdtblck committed Apr 28, 2021
commit f19e14aa4c3e1bdcaa9bc564194942f6bf5cf03c
4 changes: 2 additions & 2 deletions configs/local_setup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
"data-path": "data/enron/enron_text_document",
"vocab-file": "data/gpt2-vocab.json",
"merge-file": "data/gpt2-merges.txt",
"save": "checkpoints",
"load": "checkpoints",
# "save": "checkpoints222232",
# "load": "checkpoints222232",
"tensorboard-dir": "tensorboard",
"log-dir": "logs",
}
9 changes: 5 additions & 4 deletions configs/small.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
# across the node boundaries )
"pipe-parallel-size": 1,
"model-parallel-size": 1,

"deepspeed_activation_checkpointing": true,
# "log_param_norm": true,
"wandb_group": "test_pipe_convert",
# model settings
"num-layers": 12,
"hidden-size": 768,
Expand All @@ -13,13 +15,12 @@
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,
"no-weight-tying": false,

# these should provide some speedup but takes a while to build, set to true if desired
"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,


# optimizer settings
"optimizer": {
"type": "Adam",
Expand Down Expand Up @@ -72,7 +73,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"save-interval": 500,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
10 changes: 8 additions & 2 deletions megatron/mpu/random.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# TODO: should be able to get rid of this file entirely

import deepspeed
import deepspeed.runtime.activation_checkpointing.checkpointing as checkpointing

# Default name for the model parallel rng tracker.
_MODEL_PARALLEL_RNG_TRACKER_NAME = deepspeed.checkpointing._MODEL_PARALLEL_RNG_TRACKER_NAME
Expand All @@ -12,5 +13,10 @@
# RNG tracker object.
_CUDA_RNG_STATE_TRACKER = deepspeed.checkpointing._CUDA_RNG_STATE_TRACKER

from deepspeed.runtime.activation_checkpointing.checkpointing import _set_cuda_rng_state, checkpoint, \
model_parallel_cuda_manual_seed, get_cuda_rng_tracker
# Deepspeed checkpointing functions
# TODO: replace calls to these in our codebase with calls to the deepspeed ones
_set_cuda_rng_state = checkpointing._set_cuda_rng_state
checkpoint = checkpointing.checkpoint
model_parallel_cuda_manual_seed = checkpointing.model_parallel_cuda_manual_seed
get_cuda_rng_tracker = checkpointing.get_cuda_rng_tracker