fix(modeling): deepspeed checkpoint loading #482

maxreciprocate · 2023-05-22T11:51:00Z

This PR adds an option to load and resume training from previously saved deepspeed checkpoints

Example of successive training:

accelerate launch --config_file configs/accelerate/zero3.yaml examples/ppo_sentiments.py
accelerate launch --config_file configs/accelerate/zero3.yaml examples/ppo_sentiments.py '{"train": {"resume_from_checkpoint": "ckpts/best_checkpoint"}}'

Verify ZeRO3 loading
Update docstrings

jon-tow

Thanks for working on this, Max!

Checkpointing works properly for models that do not freeze any layers (at least under ZeRO-2) but not otherwise. E.g. for a model with num_layers_unfrozen=2 I get the following error.

RuntimeError: Error(s) in loading state_dict for AutoModelForCausalLMWithHydraValueHead:
        Missing key(s) in state_dict: "frozen_head.decoder_blocks.0.ln_1.weight", 
"frozen_head.decoder_blocks.0.ln_1.bias", "frozen_head.decoder_blocks.0.attn.bias", 
"frozen_head.decoder_blocks.0.attn.masked_bias", "frozen_head.decoder_blocks.0.attn.c_attn.weight", 
"frozen_head.decoder_blocks.0.attn.c_attn.bias", "frozen_head.decoder_blocks.0.attn.c_proj.weight", 
"frozen_head.decoder_blocks.0.attn.c_proj.bias", "frozen_head.decoder_blocks.0.ln_2.weight", 
"frozen_head.decoder_blocks.0.ln_2.bias", "frozen_head.decoder_blocks.0.mlp.c_fc.weight", 
"frozen_head.decoder_blocks.0.mlp.c_fc.bias", "frozen_head.decoder_blocks.0.mlp.c_proj.weight", 
"frozen_head.decoder_blocks.0.mlp.c_proj.bias", "frozen_head.decoder_blocks.1.ln_1.weight", 
"frozen_head.decoder_blocks.1.ln_1.bias", "frozen_head.decoder_blocks.1.attn.bias", 
"frozen_head.decoder_blocks.1.attn.masked_bias", "frozen_head.decoder_blocks.1.attn.c_attn.weight", 
"frozen_head.decoder_blocks.1.attn.c_attn.bias", "frozen_head.decoder_blocks.1.attn.c_proj.weight", 
"frozen_head.decoder_blocks.1.attn.c_proj.bias", "frozen_head.decoder_blocks.1.ln_2.weight", 
"frozen_head.decoder_blocks.1.ln_2.bias", "frozen_head.decoder_blocks.1.mlp.c_fc.weight", 
"frozen_head.decoder_blocks.1.mlp.c_fc.bias", "frozen_head.decoder_blocks.1.mlp.c_proj.weight", 
"frozen_head.decoder_blocks.1.mlp.c_proj.bias", "frozen_head.final_norm.weight", "frozen_head.final_norm.bias", 
"frozen_head.lm_head.weight".

We'll probably need to filter frozen_head from state_dict when checkpointing. What do you think?

trlx/models/modeling_ilql.py

jon-tow · 2023-05-23T19:05:31Z

trlx/models/modeling_ppo.py

- for k, v in v_head_state_dict.items():
- base_model_state_dict[f"v_head.{k}"] = v
- return base_model_state_dict
+ base_model_state_dict = self.base_model.state_dict(*args, **dict(prefix="base_model.", **kwargs))


Same as for ILQL - need to add support for Seq2Seq.

trlx/trlx.py

jon-tow

Look good to me!

Dahoas · 2023-07-10T13:54:18Z

@maxreciprocate Do you want to resolve the conflicts and then we will merge?

Dahoas · 2023-08-04T10:00:21Z

@maxreciprocate Bump on this

Dahoas · 2023-08-08T15:56:50Z

Looks good, merging

maxreciprocate added 2 commits May 22, 2023 14:41

fix(modeling_*): state_dict for deepspeed checkpoints

9f0d19f

feat(configs): add resume_from_checkpoint option

7fabbff

maxreciprocate requested a review from jon-tow May 22, 2023 11:51

maxreciprocate added 2 commits May 22, 2023 15:06

fix(modeling): python3.8 compatibility

fa0340a

fix(trlx): remove a leftover argument

74a4cf9

jon-tow requested changes May 23, 2023

View reviewed changes

maxreciprocate added 4 commits June 2, 2023 17:59

fix(modeling_ppo): saving frozen_head for deepspeed to load ckpt

f24e11e

fix(modeling_ppo): recover frozen_head from checkpoint

801a04e

fix(modeling_{ppo,ilql}): mirror changes for seq2seq models

232866e

fix(modeling_ppo): pin branch class for frozen_branch in seq2seq

b0e9510

jon-tow reviewed Jun 7, 2023

View reviewed changes

trlx/trlx.py Show resolved Hide resolved

maxreciprocate and others added 2 commits June 7, 2023 21:51

fix(ppo_trainer): postpone make_experience until trainer.learn()

1777f44

Merge branch 'main' into fix-checkpoint-loading

fb4ea44

jon-tow approved these changes Jun 23, 2023

View reviewed changes

maxreciprocate added 2 commits July 24, 2023 15:20

Merge branch 'main' into fix-checkpoint-loading

d9aa8d2

fix(modeling_ppo): respect heads_only argument in state_dict

da1896c

maxreciprocate added 3 commits August 7, 2023 19:04

fix(modeling): resolve tests for peft from_pretrained

ac29d10

style: satisfy flake

8d8dc61

docs(modeling): update post_init docstrings

ad4fc6b

Dahoas merged commit 2e667e6 into main Aug 8, 2023
2 checks passed

maxreciprocate deleted the fix-checkpoint-loading branch August 8, 2023 18:27

andrewsiah mentioned this pull request Oct 25, 2023

resume_from_checkpoint doesn't work #577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(modeling): deepspeed checkpoint loading #482

fix(modeling): deepspeed checkpoint loading #482

maxreciprocate commented May 22, 2023 •

edited

Loading

jon-tow left a comment

jon-tow May 23, 2023

jon-tow left a comment

Dahoas commented Jul 10, 2023

Dahoas commented Aug 4, 2023

Dahoas commented Aug 8, 2023

fix(modeling): deepspeed checkpoint loading #482

fix(modeling): deepspeed checkpoint loading #482

Conversation

maxreciprocate commented May 22, 2023 • edited Loading

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow May 23, 2023

Choose a reason for hiding this comment

jon-tow left a comment

Choose a reason for hiding this comment

Dahoas commented Jul 10, 2023

Dahoas commented Aug 4, 2023

Dahoas commented Aug 8, 2023

maxreciprocate commented May 22, 2023 •

edited

Loading