-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't find a valid checkpoint at {checkpoint_path} #1132
Comments
多机多卡 我们测测 稍等 |
我们使用的是phi3-vision-128k-instruct,这或许可以有一些帮助 |
训练脚本如下: |
我之后尝试了使用resume_only_model True,此时遇到的错误如下
是否是phi3-vl的适配问题呢? |
集群重启后,想加载模型从存储的step重新训练。
在训练设定上,一方面之前没有设置save_only_model=False,另一方面使用的是多机多卡,deepspeed-zero3。
使用resume_from_checkpoint一直在报Can't find a valid checkpoint at {checkpoint_path}。
是否这种设定无法从之前的step继续训练?以及这种无法找到checkpoint的原因是什么呢?
The text was updated successfully, but these errors were encountered: