Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] tokenization mismatch when finetuning v1.5-7b #661

Open
Liu0329 opened this issue Oct 25, 2023 · 24 comments
Open

[Usage] tokenization mismatch when finetuning v1.5-7b #661

Liu0329 opened this issue Oct 25, 2023 · 24 comments

Comments

@Liu0329
Copy link

Liu0329 commented Oct 25, 2023

Describe the issue

Issue:
I have found some threads reporting the tokenization mismatch problem, but I am still confused. I download the v1.5-7b weight from https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main
, and finetune on datasets in the paper. I adapt the command line to make it run on V100.
tokenizers.version == '0.14.1'

Command:

WANDB_MODE=disabled deepspeed llava/train/train.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /path/to/llm_weights/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower /path/to/llm_weights/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /path/to/llm_weights/llava-v1.5-7b/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --fp16 True \
    --output_dir ./checkpoints/llava-v1.5-7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \

Screenshots:
企业微信截图_16982049679152

@yuyq96
Copy link

yuyq96 commented Oct 25, 2023

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

@yuyq96
Copy link

yuyq96 commented Oct 25, 2023

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:

cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

@Liu0329
Copy link
Author

Liu0329 commented Oct 25, 2023

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ?
企业微信截图_1698216809301

@yuyq96
Copy link

yuyq96 commented Oct 25, 2023

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ? 企业微信截图_1698216809301

You can check my last modification, the mismatch is due to the different tokenization results of 'USER' and the missing </s>.

@Liu0329
Copy link
Author

Liu0329 commented Oct 25, 2023

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:

cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

Awesome ! I tested your change, and it did work. So the problem is caused by both USER and </s>. To clarify, the change should be made in method preprocess_v1 in llava/train/train.py.
@haotian-liu Please have a double check.

@Liu0329
Copy link
Author

Liu0329 commented Oct 25, 2023

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head".
So it seems the problem is caused by missing space before and after </s> ?
企业微信截图_16982203487382

@yuyq96
Copy link

yuyq96 commented Oct 25, 2023

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ? 企业微信截图_16982203487382

Yes, this will lead to different tokenization results with LLaMA tokenizer.

@Liu0329
Copy link
Author

Liu0329 commented Oct 25, 2023

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ? 企业微信截图_16982203487382

Yes, this will lead to different tokenization results with LLaMA tokenizer.

For the above case, can the tokenizer correctly separate No (or other words) before </s>? If not, the training would be harmed. So the better solution should be to modify the prompt.
And I tried to insert space before and after </s>, but the mismatch showed again with original code.

@haotian-liu
Copy link
Owner

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

@haotian-liu
Copy link
Owner

@yuyq96 Thanks for the fix, I'll take a look into this issue. This fix may cause issue with earlier tokenizer versions? I feel that there were some behavioral changes of the tokenizer.

@Liu0329
Copy link
Author

Liu0329 commented Oct 26, 2023

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

Thanks, tokenizers downgrading to 0.12.1, and transformers to 4.31.0 solved the problem. I also tried inserting spaces before and after </s>, and the warning showed again, don't know why extra spaces would not do.

@zzzzzzrc
Copy link

zzzzzzrc commented Nov 1, 2023

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1.
But don't know why mismatch when set "use_fast=False"

@GuoQiushan
Copy link

GuoQiushan commented Nov 7, 2023

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"

@zzzzzzrc I tried to set "use_fast=True" and it works. But I'm not sure whether it will affect the final performance or not. Do you have any suggestion?

@xiechengmude
Copy link

Is here fixed?

@GuoQiushan
Copy link

Is here fixed?

Setting use_fast=True works for my case.

@ryusaeba
Copy link

ryusaeba commented Dec 1, 2023

@haotian-liu
This is similar issue as what FastChat meet. The root cause is Huggingface introduce some bugs when dealing with added tokens. Please refer the fix here.

@liuhaogeng
Copy link

liuhaogeng commented Dec 3, 2023

round_len = len(tokenizer(rou).input_ids), for each round,the tokenizer will add "bos"(bos of vicuna), so i wonder if the round_len caculation is right? Thanks

@xxxwuwq
Copy link

xxxwuwq commented Feb 2, 2024

I encountered the “tokenization mismatch” issue during fine-tuning as well. Upon investigation, I found that it was primarily caused by the presence of empty strings in the “value” field of QA {"from": "human", "value": ""} int the dataset. As a result, the prompt became inclusive of the string “xxx USER:ASSISTANT: xxxx”. This led to the “tokenization mismatch” issue during the tokenization process. I’m not sure if this experience is useful, but I thought I’d share it.

@lucasjinreal
Copy link

Hi, I am training LLava with Qwen2 got same mismatch.
Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

@20191864218
Copy link

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

@charismaticchiu
Copy link

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.
Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

@lucasjinreal
Copy link

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

@BlueBlueFF
Copy link

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

Can you share your tokenizer settings?

@gujiaqivadin
Copy link

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.
Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

same when finetuning in 1.5b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests