Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] Missing "trainer_state.json" when resuming training from saved checkpoints #1164

Open
jiadingfang opened this issue Feb 22, 2024 · 14 comments

Comments

@jiadingfang
Copy link

Describe the issue

Issue: I'm trying to reproduce the training results. As my cluster has 4-hour usage limit per use, I need to save checkpoints and resume training from them. However, resume training failed because of this error.

Command:

./scripts/v1_5/pretrain.sh

where I change the save_steps to be 500.

Log:

[Errno 2] No such file or directory: './checkpoints/llava-v1.5-7b-pretrain/checkpoint-1500/trainer_state.json'

Screenshots:
I have "checkpoint-1500" folder, but it does not contain such "trainer_state.json" file.
image

@sahilqure
Copy link

@jiadingfang Were you able to solve this issue?

@sahilqure
Copy link

@haotian-liu Can you please check this issue. I think many people have missed this because nobody is doing the pre-training. Is this the problem with transformers or deepspeed version?

@baochi0212
Copy link

add self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME)) in llava/train/llava_trainer.py in _save_checkpoint method

@JiangLinsheng
Copy link

I also encountered the same problem.Is there any solution?

@git-siddhesh
Copy link

same issue, using huggingface trainer, saving the model, and facing the same issue when loading the model from the saved directory.

@ashmalvayani
Copy link

ashmalvayani commented May 21, 2024

self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))

raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")

This will not work because it also needs the .pth files and optimizer files without which it won't work.

@ashmalvayani
Copy link

Even if you add trainer_state.json file, it will not resume as it will ask for optimizer files and .pth files which still won't be saved. I think the best way is to comment out their function and simply keep their "super(LlaVaTrainer, self) ... " line and let the code run. I have tested this, it does not save the mm_projector.bin file at each stage but it does save the entire weights at each checkpoint.

You can either manually extract the mm_projector weights later. If you don't want to do this, don't worry, at the end of training it automatically saves the trainer_state.json, mm_projector.bin and config.json file after the completion of last step.

@lucasjinreal
Copy link

Just add this single line to the _save_checkpoint function:

save all for mm adaptor resume

        self.save_model(output_dir, _internal_call=True)

Then you can either resume, or using the mmprojector.bin, keep everything else untouched.

@junha1125
Copy link

junha1125 commented Jun 20, 2024

Try1 following issuecomment-2179701269

  1. add this single line self.save_model(output_dir, _internal_call=True) under the line, torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin')).
  2. delete def _save(self in llava_trainer.py.
  3. It does save the entire weights. But, it dose not save deepspeed_checkpoint_dirs = sorted(glob.glob(f"{checkpoint_path}/global_step*")). So I got the error.

Try2 following issuecomment-2124420477

  1. delete def _save_checkpoint(selfin llava_trainer.py.
  2. It also does save the entire weights and the global_step1 directory.
  3. When re-run train.py, resume the model weight and the deepspeed-files without errors. 🥳

For me, recommend Try2.

@lxysl
Copy link

lxysl commented Jun 20, 2024

The entire model weights are saved in this way in safe_save_model_for_hf_trainer():

    ...
    if trainer.deepspeed:
        torch.cuda.synchronize()
        trainer.save_model(output_dir)
        return

    state_dict = trainer.model.state_dict()
    if trainer.args.should_save:
        cpu_state_dict = {
            key: value.cpu()
            for key, value in state_dict.items()
        }
        del state_dict
        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa

and before this function is called, trainer state has been saved:

trainer.save_state()

So I wonder whether torch.cuda.synchronize() and trainer.save_state() should be done when using save_steps in hugging face Trainer with deepspeed?
What's the difference between trainer.save_model() and trainer._save?

@lxysl
Copy link

lxysl commented Jun 20, 2024

And I also wonder why model weights are not checkpointed every save_step? Isn't the default save_step in hugging face Trainer equal to 500? Please @ me if there's any progress.

@lucasjinreal
Copy link

Oh... Ignore all the above noisy comments, just a single line solve all issues.

@lxysl
Copy link

lxysl commented Jun 21, 2024

Oh... Ignore all the above noisy comments, just a single line solve all issues.

I will try it later! :)

@StephenQSstarThomas
Copy link

Just add this single line to the _save_checkpoint function:

save all for mm adaptor resume

        self.save_model(output_dir, _internal_call=True)

Then you can either resume, or using the mmprojector.bin, keep everything else untouched.

between which lines should this line be added?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants