Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cull pp = 0 model branch #269

Merged
merged 26 commits into from
Apr 30, 2021
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
cc2c900
fix wandb group stuff
sdtblck Apr 28, 2021
f92f8c9
fix checkpointing if deepspeed_activation_checkpointing = true
sdtblck Apr 28, 2021
62e4528
get rid of all codepaths where pp = 0, rearrange layout
sdtblck Apr 28, 2021
32b7bd1
refactor checkpointing
sdtblck Apr 28, 2021
56987c2
rename megatron_args to neox_args + remove unused argument
sdtblck Apr 28, 2021
8b6d515
remove unused FP16 code (deepspeed handles this)
sdtblck Apr 28, 2021
b58c48d
remove unused gradient clipping code (deepspeed handles this)
sdtblck Apr 28, 2021
d622349
remove apex dependency in training.py
sdtblck Apr 28, 2021
4e2d64a
removed unused megatron/memory.py
sdtblck Apr 28, 2021
a7b7b18
update requirements + dockerfile
sdtblck Apr 28, 2021
5e9dc55
Merge branch 'main' into cull-model-branch
sdtblck Apr 28, 2021
0b8fee9
get pipe to normal conversion working properly
sdtblck Apr 28, 2021
c80212e
Merge remote-tracking branch 'origin/cull-model-branch' into cull-mod…
sdtblck Apr 28, 2021
871e679
fix eval_helper
sdtblck Apr 28, 2021
77fe200
fix Dockerfile
sdtblck Apr 28, 2021
243c60a
get rid of megatron/data/dataset_utils.py
sdtblck Apr 28, 2021
f19e14a
update random.py
sdtblck Apr 28, 2021
e5212b1
remove some duplicate code
sdtblck Apr 28, 2021
de042f3
revert config changes
sdtblck Apr 28, 2021
3cf01de
revert changes to checkpointing.py
sdtblck Apr 28, 2021
6f5079f
test model update after gpt2 model remove
Apr 29, 2021
1dae917
adding more test configs
Apr 29, 2021
3c59574
Merge branch 'testcases_continued' into cull-model-branch
kipgparker Apr 29, 2021
df76402
remove MegatronModule + all custom saving logic (shit's cursed)
sdtblck Apr 29, 2021
09b5d06
delete deepspeed lmao
sdtblck Apr 30, 2021
ac00dbd
revert changes to small config
sdtblck Apr 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
revert changes to checkpointing.py
  • Loading branch information
sdtblck committed Apr 28, 2021
commit 3cf01de15ac405fdd0a1d19ccd812448fd47e30d
9 changes: 7 additions & 2 deletions megatron/checkpointing.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,7 @@ def delete_old_checkpoints(save_dir, n_to_keep):
def save_ds_checkpoint(iteration, model, args):
"""Save a model checkpoint."""

sd = {}
sd['iteration'] = iteration
sd = {'iteration': iteration}
# rng states.
if not args.no_save_rng:
sd['random_rng_state'] = random.getstate()
Expand All @@ -115,6 +114,12 @@ def save_ds_checkpoint(iteration, model, args):
sd['cuda_rng_state'] = torch.cuda.get_rng_state()
sd['rng_tracker_states'] = mpu.get_cuda_rng_tracker().get_states()

if not args.is_pipe_parallel:
# megatron model uses state_dict_for_save_checkpointing instead of the standard state_dict
# state_dict is used by deepspeed for module saving so it needs to point to the right function
model.module.state_dict = model.module.state_dict_for_save_checkpoint
# Pipeline parallelism manages its own state dict

model.save_checkpoint(args.save, client_state=sd)


Expand Down