Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Trying to set a tensor of shape torch.Size([128257, 4096]) in "weight" (which has shape torch.Size([128256, 4096])), this look incorrect. #31

Open
basteran opened this issue May 28, 2024 · 1 comment

Comments

@basteran
Copy link

Hello everyone, thank you for the great job!

I am trying to further fine-tune the LLaVA architecture using your implementation with LLaMA 3 Instruct 8B. I can already fine-tune the Vicuna model using the original LLaVA code and now I am looking for some implementation with LLaMA 3.

I found your repo and followed your instructions from the README.md file for each step. I am able to train the model using the following bash file and it looks like it's correctly saved. NOTE: I downloaded the model from your huggingface repo

TRAINING CODE

#!/bin/bash

################## MODELS #################
PROMPT_VERSION="llama3"
MODEL_DIR_PATH="/user/hf_models/"
MODEL_VERSION="LLaVA-Meta-Llama-3-8B-Instruct-FT"
MODEL_ABS_PATH=$MODEL_DIR_PATH/$MODEL_VERSION
################### END ###################

################## CUDA ####################
export CUDA_VISIBLE_DEVICES=0
echo "CUDA IS" ${CUDA_VISIBLE_DEVICES}
################## CUDA ####################

################# TRAINING #################
deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256\
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path $MODEL_ABS_PATH \
    --version $PROMPT_VERSION \
    --data_path ./data/train.json \
    --image_folder ./data/images \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --bf16 True \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-lora\
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "linear" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing False \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none

I then tried to merge (using this script from LLaVA) the resulting adapters with the original model LLaVA-Meta-Llama-3-8B-Instruct-FT and I got the following error.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading LLaVA from base model...
/user/anaconda3/envs/mm_iglu_it/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Loading checkpoint shards:   0%|                                                                                                           | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/user/mm-iglu-it/./scripts/merge_lora_weights.py", line 22, in <module>
    merge_lora(args)
  File "/user/mm-iglu-it/./scripts/merge_lora_weights.py", line 8, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, device_map='cpu')
  File "/user/mm-iglu-it/llava/model/builder.py", line 64, in load_pretrained_model
    model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
  File "/user/anaconda3/envs/mm_iglu_it/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3682, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/user/anaconda3/envs/mm_iglu_it/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4109, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/user/anaconda3/envs/mm_iglu_it/lib/python3.10/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/user/anaconda3/envs/mm_iglu_it/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([128257, 4096]) in "weight" (which has shape torch.Size([128256, 4096])), this look incorrect.

Finally, I even tried using the adapters (without merging) with the following script but I get the same identical error. The file llava/eval/test_llava.py is very similar to the inference script from the original LLaVA repo, but I made very little changes for my convenience (such as --prompt-version, --input-file-path, etc.).

TESTING CODE

# !/bin/bash

##################################### MODEL #####################################
PROMPT_VERSION="llama3"
MODEL_NAME="llava-LLaVA-Meta-Llama-3-8B-Instruct-FT-lora"
MODEL_BASE="LLaVA-Meta-Llama-3-8B-Instruct-FT"
################################## CHOOSE CUDA ##################################
export CUDA_VISIBLE_DEVICES=0
echo "CUDA is" ${CUDA_VISIBLE_DEVICES}
###################################### END ######################################


#################################### TESTING ####################################
deepspeed ./llava/eval/test_llava.py \
    --model-path ./checkpoints/$MODEL_NAME \
    --model-base /user/hf_models/$MODEL_BASE \
    --model-name $MODEL_NAME \
    --prompt-version $PROMPT_VERSION \
    --input-file-path ./data/test.json \
    --image-path ./data/images 

Do you have any idea what I am doing wrong? I can't find anything online.

@crux82
Copy link

crux82 commented May 29, 2024

I also have the same problem. I think it is connected to #25

Can someone help us?

Thank you!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants