-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage] Missing "trainer_state.json" when resuming training from saved checkpoints #1164
Comments
@jiadingfang Were you able to solve this issue? |
@haotian-liu Can you please check this issue. I think many people have missed this because nobody is doing the pre-training. Is this the problem with transformers or deepspeed version? |
add |
I also encountered the same problem.Is there any solution? |
same issue, using huggingface trainer, saving the model, and facing the same issue when loading the model from the saved directory. |
This will not work because it also needs the .pth files and optimizer files without which it won't work. |
Even if you add trainer_state.json file, it will not resume as it will ask for optimizer files and .pth files which still won't be saved. I think the best way is to comment out their function and simply keep their "super(LlaVaTrainer, self) ... " line and let the code run. I have tested this, it does not save the mm_projector.bin file at each stage but it does save the entire weights at each checkpoint. You can either manually extract the mm_projector weights later. If you don't want to do this, don't worry, at the end of training it automatically saves the trainer_state.json, mm_projector.bin and config.json file after the completion of last step. |
Just add this single line to the _save_checkpoint function: save all for mm adaptor resume
Then you can either resume, or using the mmprojector.bin, keep everything else untouched. |
Try1 following issuecomment-2179701269
Try2 following issuecomment-2124420477
For me, recommend Try2. |
The entire model weights are saved in this way in ...
if trainer.deepspeed:
torch.cuda.synchronize()
trainer.save_model(output_dir)
return
state_dict = trainer.model.state_dict()
if trainer.args.should_save:
cpu_state_dict = {
key: value.cpu()
for key, value in state_dict.items()
}
del state_dict
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa and before this function is called, trainer state has been saved: trainer.save_state() So I wonder whether |
And I also wonder why model weights are not checkpointed every save_step? Isn't the default save_step in hugging face Trainer equal to 500? Please @ me if there's any progress. |
Oh... Ignore all the above noisy comments, just a single line solve all issues. |
I will try it later! :) |
between which lines should this line be added? |
Describe the issue
Issue: I'm trying to reproduce the training results. As my cluster has 4-hour usage limit per use, I need to save checkpoints and resume training from them. However, resume training failed because of this error.
Command:
where I change the save_steps to be 500.
Log:
Screenshots:
I have "checkpoint-1500" folder, but it does not contain such "trainer_state.json" file.
The text was updated successfully, but these errors were encountered: