Can't find a valid checkpoint at {checkpoint_path} #1132

MindLostGuy · 2024-06-13T06:43:37Z

集群重启后，想加载模型从存储的step重新训练。

在训练设定上，一方面之前没有设置save_only_model=False，另一方面使用的是多机多卡，deepspeed-zero3。

使用resume_from_checkpoint一直在报Can't find a valid checkpoint at {checkpoint_path}。

是否这种设定无法从之前的step继续训练？以及这种无法找到checkpoint的原因是什么呢？

Jintao-Huang · 2024-06-13T07:23:07Z

多机多卡我们测测稍等

MindLostGuy · 2024-06-13T07:35:15Z

我们使用的是phi3-vision-128k-instruct，这或许可以有一些帮助

MindLostGuy · 2024-06-13T07:48:38Z

训练脚本如下:
NNODES=8 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft --model_type phi3-vision-128k-instruct --resume_from_checkpoint /log_path/phi3-vision-128k-instruct/v5-20240611-122404/checkpoint-4000/ --sft_type full --dtype bf16 --dataset /train_data_path/ --val_dataset /val_data_path/#100 --dataset_test_ratio 0.0 --max_length 4096 --seed 42 --check_dataset_strategy warning --use_flash_attn true --num_train_epochs 1 --dataloader_num_workers 4 --per_device_train_batch_size 2 --gradient_accumulation_steps 2 --logging_steps 5 --save_steps 1000 --eval_steps 1000 --lr_scheduler_type cosine --warmup_ratio 0.03 --gradient_checkpointing True --deepspeed default-zero3;

MindLostGuy · 2024-06-13T08:37:36Z

我之后尝试了使用resume_only_model True，此时遇到的错误如下

size mismatch for model.layers.31.input_layernorm.weight: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.layers.31.post_attention_layernorm.weight: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for model.norm.weight: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([0]).

是否是phi3-vl的适配问题呢？

Jintao-Huang self-assigned this Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't find a valid checkpoint at {checkpoint_path} #1132

Can't find a valid checkpoint at {checkpoint_path} #1132

MindLostGuy commented Jun 13, 2024

Jintao-Huang commented Jun 13, 2024

MindLostGuy commented Jun 13, 2024

MindLostGuy commented Jun 13, 2024

MindLostGuy commented Jun 13, 2024

Can't find a valid checkpoint at {checkpoint_path} #1132

Can't find a valid checkpoint at {checkpoint_path} #1132

Comments

MindLostGuy commented Jun 13, 2024

Jintao-Huang commented Jun 13, 2024

MindLostGuy commented Jun 13, 2024

MindLostGuy commented Jun 13, 2024

MindLostGuy commented Jun 13, 2024