Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't find a valid checkpoint at {checkpoint_path} #1132

Open
MindLostGuy opened this issue Jun 13, 2024 · 4 comments
Open

Can't find a valid checkpoint at {checkpoint_path} #1132

MindLostGuy opened this issue Jun 13, 2024 · 4 comments
Assignees

Comments

@MindLostGuy
Copy link

集群重启后,想加载模型从存储的step重新训练。

在训练设定上,一方面之前没有设置save_only_model=False,另一方面使用的是多机多卡,deepspeed-zero3。

使用resume_from_checkpoint一直在报Can't find a valid checkpoint at {checkpoint_path}。

是否这种设定无法从之前的step继续训练?以及这种无法找到checkpoint的原因是什么呢?

@Jintao-Huang
Copy link
Collaborator

多机多卡 我们测测 稍等

@Jintao-Huang Jintao-Huang self-assigned this Jun 13, 2024
@MindLostGuy
Copy link
Author

我们使用的是phi3-vision-128k-instruct,这或许可以有一些帮助

@MindLostGuy
Copy link
Author

训练脚本如下:
NNODES=8 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft --model_type phi3-vision-128k-instruct --resume_from_checkpoint /log_path/phi3-vision-128k-instruct/v5-20240611-122404/checkpoint-4000/ --sft_type full --dtype bf16 --dataset /train_data_path/ --val_dataset /val_data_path/#100 --dataset_test_ratio 0.0 --max_length 4096 --seed 42 --check_dataset_strategy warning --use_flash_attn true --num_train_epochs 1 --dataloader_num_workers 4 --per_device_train_batch_size 2 --gradient_accumulation_steps 2 --logging_steps 5 --save_steps 1000 --eval_steps 1000 --lr_scheduler_type cosine --warmup_ratio 0.03 --gradient_checkpointing True --deepspeed default-zero3;

@MindLostGuy
Copy link
Author

我之后尝试了使用resume_only_model True,此时遇到的错误如下

size mismatch for model.layers.31.input_layernorm.weight: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.31.post_attention_layernorm.weight: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.norm.weight: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([0]).

是否是phi3-vl的适配问题呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants