[Usage] Missing "trainer_state.json" when resuming training from saved checkpoints #1164

jiadingfang · 2024-02-22T03:29:37Z

Describe the issue

Issue: I'm trying to reproduce the training results. As my cluster has 4-hour usage limit per use, I need to save checkpoints and resume training from them. However, resume training failed because of this error.

Command:

./scripts/v1_5/pretrain.sh

where I change the save_steps to be 500.

Log:

[Errno 2] No such file or directory: './checkpoints/llava-v1.5-7b-pretrain/checkpoint-1500/trainer_state.json'

Screenshots:
I have "checkpoint-1500" folder, but it does not contain such "trainer_state.json" file.

The text was updated successfully, but these errors were encountered:

sahilqure · 2024-02-27T15:26:29Z

@jiadingfang Were you able to solve this issue?

sahilqure · 2024-02-27T16:05:39Z

@haotian-liu Can you please check this issue. I think many people have missed this because nobody is doing the pre-training. Is this the problem with transformers or deepspeed version?

baochi0212 · 2024-03-08T06:36:17Z

add self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME)) in llava/train/llava_trainer.py in _save_checkpoint method

JiangLinsheng · 2024-03-13T06:58:36Z

I also encountered the same problem.Is there any solution?

git-siddhesh · 2024-03-23T09:01:17Z

same issue, using huggingface trainer, saving the model, and facing the same issue when loading the model from the saved directory.

ashmalvayani · 2024-05-21T20:47:47Z

self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))

raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")

This will not work because it also needs the .pth files and optimizer files without which it won't work.

ashmalvayani · 2024-05-22T10:17:03Z

Even if you add trainer_state.json file, it will not resume as it will ask for optimizer files and .pth files which still won't be saved. I think the best way is to comment out their function and simply keep their "super(LlaVaTrainer, self) ... " line and let the code run. I have tested this, it does not save the mm_projector.bin file at each stage but it does save the entire weights at each checkpoint.

You can either manually extract the mm_projector weights later. If you don't want to do this, don't worry, at the end of training it automatically saves the trainer_state.json, mm_projector.bin and config.json file after the completion of last step.

lucasjinreal · 2024-06-20T02:31:51Z

Just add this single line to the _save_checkpoint function:

save all for mm adaptor resume

        self.save_model(output_dir, _internal_call=True)

Then you can either resume, or using the mmprojector.bin, keep everything else untouched.

junha1125 · 2024-06-20T14:59:54Z

Try1 following issuecomment-2179701269

add this single line self.save_model(output_dir, _internal_call=True) under the line, torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin')).
delete def _save(self in llava_trainer.py.
It does save the entire weights. But, it dose not save deepspeed_checkpoint_dirs = sorted(glob.glob(f"{checkpoint_path}/global_step*")). So I got the error.

Try2 following issuecomment-2124420477

delete def _save_checkpoint(selfin llava_trainer.py.
It also does save the entire weights and the global_step1 directory.
When re-run train.py, resume the model weight and the deepspeed-files without errors. 🥳

For me, recommend Try2.

lxysl · 2024-06-20T17:29:35Z

The entire model weights are saved in this way in safe_save_model_for_hf_trainer():

    ...
    if trainer.deepspeed:
        torch.cuda.synchronize()
        trainer.save_model(output_dir)
        return

    state_dict = trainer.model.state_dict()
    if trainer.args.should_save:
        cpu_state_dict = {
            key: value.cpu()
            for key, value in state_dict.items()
        }
        del state_dict
        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa

and before this function is called, trainer state has been saved:

trainer.save_state()

So I wonder whether torch.cuda.synchronize() and trainer.save_state() should be done when using save_steps in hugging face Trainer with deepspeed?
What's the difference between trainer.save_model() and trainer._save?

lxysl · 2024-06-20T17:36:06Z

And I also wonder why model weights are not checkpointed every save_step? Isn't the default save_step in hugging face Trainer equal to 500? Please @ me if there's any progress.

lucasjinreal · 2024-06-21T02:53:00Z

Oh... Ignore all the above noisy comments, just a single line solve all issues.

lxysl · 2024-06-21T09:16:21Z

Oh... Ignore all the above noisy comments, just a single line solve all issues.

I will try it later! :)

StephenQSstarThomas · 2024-06-30T00:08:53Z

Just add this single line to the _save_checkpoint function:

save all for mm adaptor resume
        self.save_model(output_dir, _internal_call=True)
Then you can either resume, or using the mmprojector.bin, keep everything else untouched.

between which lines should this line be added?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage] Missing "trainer_state.json" when resuming training from saved checkpoints #1164

[Usage] Missing "trainer_state.json" when resuming training from saved checkpoints #1164

jiadingfang commented Feb 22, 2024

sahilqure commented Feb 27, 2024

sahilqure commented Feb 27, 2024

baochi0212 commented Mar 8, 2024

JiangLinsheng commented Mar 13, 2024

git-siddhesh commented Mar 23, 2024

ashmalvayani commented May 21, 2024 •

edited

Loading

ashmalvayani commented May 22, 2024

lucasjinreal commented Jun 20, 2024

junha1125 commented Jun 20, 2024 •

edited

Loading

lxysl commented Jun 20, 2024 •

edited

Loading

lxysl commented Jun 20, 2024 •

edited

Loading

lucasjinreal commented Jun 21, 2024

lxysl commented Jun 21, 2024

StephenQSstarThomas commented Jun 30, 2024

save all for mm adaptor resume

[Usage] Missing "trainer_state.json" when resuming training from saved checkpoints #1164

[Usage] Missing "trainer_state.json" when resuming training from saved checkpoints #1164

Comments

jiadingfang commented Feb 22, 2024

Describe the issue

sahilqure commented Feb 27, 2024

sahilqure commented Feb 27, 2024

baochi0212 commented Mar 8, 2024

JiangLinsheng commented Mar 13, 2024

git-siddhesh commented Mar 23, 2024

ashmalvayani commented May 21, 2024 • edited Loading

ashmalvayani commented May 22, 2024

lucasjinreal commented Jun 20, 2024

save all for mm adaptor resume

junha1125 commented Jun 20, 2024 • edited Loading

Try1 following issuecomment-2179701269

Try2 following issuecomment-2124420477

lxysl commented Jun 20, 2024 • edited Loading

lxysl commented Jun 20, 2024 • edited Loading

lucasjinreal commented Jun 21, 2024

lxysl commented Jun 21, 2024

StephenQSstarThomas commented Jun 30, 2024

save all for mm adaptor resume

ashmalvayani commented May 21, 2024 •

edited

Loading

junha1125 commented Jun 20, 2024 •

edited

Loading

lxysl commented Jun 20, 2024 •

edited

Loading

lxysl commented Jun 20, 2024 •

edited

Loading