Cull pp = 0 model branch #269

sdtblck · 2021-04-28T14:11:15Z

Ok, this pr is a compendium of lots of stuff.

Main change is that the model branch where pp=0 no longer exists. pipe parallel defaults to 1. This means we only have to maintain a single model. I verified both models acheived exactly the same loss here, and in fact pp=1 (GPT2ModelPipe) is slightly faster, for some reason, so there's no reason that i can see to keep the other model branch around.
Fix the wandb group name logging (we ended up adding multiple uuids to the end of the group name)
Removes a lot of dead code from:
- megatron/mpu/random/.py -> handled activation checkpointing. With the pipeline model, this is all handled by deepspeed
- megatron/fp16 stuff -> handles loss scaling / fp16 conversion. Again, all handled by deepspeed, so can be safely removed.
- megatron/mpu/grads.py -> handled gradient clipping, also now handled by deepspeed.
- megatron/memory.py -> wasn't used anywhere.
Makes the apex dependency optional (better to have it as apex fusedadam is slightly faster than deepspeed's version).
Some mild reorganization of the layout of megatron/model to make it easier to work with / navigate
renamed megatron/arguments/megatron_args.py to megatron/arguments/neox_args.py (in the future i think we should rename the whole package to neox - I think we're now sufficiently different lol, but couldn't be arsed to go through the hassle rn)
Updated requirements (separated optional ones from mandatory ones) and updated dockerfile

sdtblck · 2021-04-28T14:24:14Z

pls don't merge yet - making some more changes

…el-branch

sdtblck · 2021-04-28T21:05:46Z

Okay, should be ready to merge now.

We can now convert GPT2ModelPipe to a regular nn.Sequential model by calling GPT2ModelPipe.to_sequential(). If pipe parallel is set to 0, we train using this model. This should also enable us to still use ZeRO 2 / 3 etc. if desired.

sdtblck · 2021-04-30T08:51:36Z

Imo everything here is ready to merge - here is training with pipe parallel on / off as well as loading from checkpoint - everything stable and running as expected.

sdtblck added 10 commits April 28, 2021 14:19

fix wandb group stuff

cc2c900

fix checkpointing if deepspeed_activation_checkpointing = true

f92f8c9

get rid of all codepaths where pp = 0, rearrange layout

62e4528

refactor checkpointing

32b7bd1

rename megatron_args to neox_args + remove unused argument

56987c2

remove unused FP16 code (deepspeed handles this)

8b6d515

remove unused gradient clipping code (deepspeed handles this)

b58c48d

remove apex dependency in training.py

d622349

removed unused megatron/memory.py

4e2d64a

update requirements + dockerfile

a7b7b18

sdtblck requested a review from a team as a code owner April 28, 2021 14:11

sdtblck requested review from StellaAthena and leogao2 April 28, 2021 14:11

Merge branch 'main' into cull-model-branch

5e9dc55

This was linked to issues Apr 28, 2021

Get rid of codepath where pp = 0 #243

Closed

Timer logging innacurate if pp=0 #238

Closed

sdtblck added 4 commits April 28, 2021 21:37

get pipe to normal conversion working properly

0b8fee9

Merge remote-tracking branch 'origin/cull-model-branch' into cull-mod…

c80212e

…el-branch

fix eval_helper

871e679

fix Dockerfile

77fe200

sdtblck added 5 commits April 29, 2021 01:28

get rid of megatron/data/dataset_utils.py

243c60a

update random.py

f19e14a

remove some duplicate code

e5212b1

revert config changes

de042f3

revert changes to checkpointing.py

3cf01de

ShivanshuPurohit previously approved these changes Apr 29, 2021

View reviewed changes

sweinbach added 2 commits April 29, 2021 11:31

test model update after gpt2 model remove

6f5079f

adding more test configs

1dae917

Merge branch 'testcases_continued' into cull-model-branch

3c59574

kipgparker dismissed ShivanshuPurohit’s stale review via 3c59574 April 29, 2021 12:05

sdtblck and others added 3 commits April 29, 2021 15:18

remove MegatronModule + all custom saving logic (shit's cursed)

df76402

delete deepspeed lmao

09b5d06

revert changes to small config

ac00dbd

sdtblck requested a review from sweinbach April 30, 2021 09:38

sweinbach approved these changes Apr 30, 2021

View reviewed changes

sdtblck merged commit dc44965 into main Apr 30, 2021

sdtblck deleted the cull-model-branch branch April 30, 2021 09:43

sdtblck mentioned this pull request Apr 30, 2021

Code Cleanup #208

Closed

3 tasks

sdtblck mentioned this pull request Mar 1, 2022

Is PP = 1 faster than Sequential? #574

Closed

StellaAthena mentioned this pull request Mar 1, 2022

Parallel all reduce communication and backprop #573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cull pp = 0 model branch #269

Cull pp = 0 model branch #269

sdtblck commented Apr 28, 2021

sdtblck commented Apr 28, 2021

sdtblck commented Apr 28, 2021

sdtblck commented Apr 30, 2021

Cull pp = 0 model branch #269

Cull pp = 0 model branch #269

Conversation

sdtblck commented Apr 28, 2021

sdtblck commented Apr 28, 2021

sdtblck commented Apr 28, 2021

sdtblck commented Apr 30, 2021