Pythia Checkpoint Loading #4

kshitijkg · 2023-05-30T14:03:39Z

We need to load Pythia Checkpoints for MAGMA training.
Main Issue: Mismatch in weights in checkpoint and in MAGMA model
Sources of mismatch

Naming change due to Attention module being re-set to the AdapterWrapper (https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/model/adapter.py#L141), resulting in weights changing from, example:
2.attention. query_key_value.weight to 2.attention.attn_block.query_key_value.weight

Proposed solutions:
Without changing names on Pythia Checkpoint:

Add adapters after loading checkpoint, the restructuring happens after weights have already bene loaded. Disadvantage: Adapter weights will have to be loaded separately, Disadvantage: code will duplicated and not clean
Get class from the module https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/model/adapter.py#L129, then inherit it, override init and forward functions to include adapters. The structure remains the same, but this does not work since we dont easily have the initialization arguments to recreate the attention module. We only have the initialized object

Changing the names of the Pythia Checkpoint:

Renames the weights from attention to attention.attn_block and mlp to mlp.attn_block, and stores the checkpoint again, and use the new checkpoint.
Override with custom load fn that does this on the fly: Pythia checkpoint loading #3: This solution will not work in the future when we are using pipeline parallelism: custom_load_fn not supported w. pipeline parallelism

Mismatch Source 2:
2. Additional weights in MAGMA - Due to image prefix and adapters:
Proposed Solution: Can be resolved by setting strict = False when loading checkpoint. Not the best solution, can be risky. But plan is to quickly verify if all the weights that dont match are just due to image prefix and adapters and be able to train stuff, after First mismatch has been fixed, set strict=False. Can find a better solution once everyone is able to use the code to port their changes and do test runs.

kshitijkg · 2023-05-31T09:25:31Z

Current Solution: Number 3. Renames the weights from attention to attention.attn_block and mlp to mlp.attn_block, and stores the checkpoint again, and use the new checkpoint.
PR: #10

We just need to run the convert checkpoint script and use that to load.

Additionally, we set strict = False so that image prefix and adapters are ignored. I have checked manually if there are any other weights that exist that dont have the right name, but everything looks correct.

Lastly, this requires another change in the DeeperSpeed code, use the following branch: https://github.com/EleutherAI/DeeperSpeed/tree/robin_summit

kshitijkg added the bug Something isn't working label May 30, 2023

kshitijkg self-assigned this May 30, 2023

kshitijkg mentioned this issue May 30, 2023

Pythia checkpoint loading #3

Draft

kshitijkg pinned this issue May 30, 2023

kshitijkg unpinned this issue May 30, 2023

floatingbigcat added the enhancement New feature or request label May 31, 2023

kshitijkg closed this as completed Jun 12, 2023

kshitijkg added this to the Robin V0 milestone Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pythia Checkpoint Loading #4

Pythia Checkpoint Loading #4

kshitijkg commented May 30, 2023 •

edited

Loading

kshitijkg commented May 31, 2023 •

edited

Loading

Pythia Checkpoint Loading #4

Pythia Checkpoint Loading #4

Comments

kshitijkg commented May 30, 2023 • edited Loading

kshitijkg commented May 31, 2023 • edited Loading

kshitijkg commented May 30, 2023 •

edited

Loading

kshitijkg commented May 31, 2023 •

edited

Loading