[Usage] tokenization mismatch when finetuning v1.5-7b #661

Liu0329 · 2023-10-25T03:42:43Z

Describe the issue

Issue:
I have found some threads reporting the tokenization mismatch problem, but I am still confused. I download the v1.5-7b weight from https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main
, and finetune on datasets in the paper. I adapt the command line to make it run on V100.
tokenizers.version == '0.14.1'

Command:

WANDB_MODE=disabled deepspeed llava/train/train.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /path/to/llm_weights/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower /path/to/llm_weights/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /path/to/llm_weights/llava-v1.5-7b/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --fp16 True \
    --output_dir ./checkpoints/llava-v1.5-7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \

Screenshots:

The text was updated successfully, but these errors were encountered:

yuyq96 · 2023-10-25T04:12:34Z

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

yuyq96 · 2023-10-25T06:56:09Z

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:

cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

Liu0329 · 2023-10-25T07:00:41Z

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ?

yuyq96 · 2023-10-25T07:21:39Z

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ?

You can check my last modification, the mismatch is due to the different tokenization results of 'USER' and the missing </s>.

Liu0329 · 2023-10-25T07:28:13Z

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:
cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

Awesome ! I tested your change, and it did work. So the problem is caused by both USER and </s>. To clarify, the change should be made in method preprocess_v1 in llava/train/train.py.
@haotian-liu Please have a double check.

Liu0329 · 2023-10-25T07:53:03Z

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head".
So it seems the problem is caused by missing space before and after </s> ?

yuyq96 · 2023-10-25T07:59:46Z

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ?

Yes, this will lead to different tokenization results with LLaMA tokenizer.

Liu0329 · 2023-10-25T09:24:37Z

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ?

Yes, this will lead to different tokenization results with LLaMA tokenizer.

For the above case, can the tokenizer correctly separate No (or other words) before </s>? If not, the training would be harmed. So the better solution should be to modify the prompt.
And I tried to insert space before and after </s>, but the mismatch showed again with original code.

haotian-liu · 2023-10-25T17:00:09Z

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

haotian-liu · 2023-10-25T17:01:36Z

@yuyq96 Thanks for the fix, I'll take a look into this issue. This fix may cause issue with earlier tokenizer versions? I feel that there were some behavioral changes of the tokenizer.

Liu0329 · 2023-10-26T03:19:33Z

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

Thanks, tokenizers downgrading to 0.12.1, and transformers to 4.31.0 solved the problem. I also tried inserting spaces before and after </s>, and the warning showed again, don't know why extra spaces would not do.

zzzzzzrc · 2023-11-01T14:41:01Z

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1.
But don't know why mismatch when set "use_fast=False"

GuoQiushan · 2023-11-07T08:33:07Z

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"

@zzzzzzrc I tried to set "use_fast=True" and it works. But I'm not sure whether it will affect the final performance or not. Do you have any suggestion?

xiechengmude · 2023-11-18T07:19:52Z

Is here fixed?

GuoQiushan · 2023-11-19T08:42:29Z

Is here fixed?

Setting use_fast=True works for my case.

ryusaeba · 2023-12-01T01:23:26Z

@haotian-liu
This is similar issue as what FastChat meet. The root cause is Huggingface introduce some bugs when dealing with added tokens. Please refer the fix here.

liuhaogeng · 2023-12-03T04:39:19Z

round_len = len(tokenizer(rou).input_ids)， for each round，the tokenizer will add "bos"(bos of vicuna), so i wonder if the round_len caculation is right? Thanks

xxxwuwq · 2024-02-02T02:09:14Z

I encountered the “tokenization mismatch” issue during fine-tuning as well. Upon investigation, I found that it was primarily caused by the presence of empty strings in the “value” field of QA {"from": "human", "value": ""} int the dataset. As a result, the prompt became inclusive of the string “xxx USER:ASSISTANT: xxxx”. This led to the “tokenization mismatch” issue during the tokenization process. I’m not sure if this experience is useful, but I thought I’d share it.

lucasjinreal · 2024-02-20T03:33:02Z

Hi, I am training LLava with Qwen2 got same mismatch.
Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

20191864218 · 2024-02-22T17:43:36Z

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

charismaticchiu · 2024-02-23T21:50:00Z

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.
Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

lucasjinreal · 2024-02-25T06:10:13Z

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

BlueBlueFF · 2024-02-26T16:01:33Z

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

Can you share your tokenizer settings?

gujiaqivadin · 2024-04-15T02:47:06Z

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.
Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

same when finetuning in 1.5b

attnmamba mentioned this issue Feb 1, 2024

[Usage] ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer' #1042

Closed

prashanthsadasivan mentioned this issue Apr 17, 2024

[Usage] loss is 0.0 and tokenization mismatch: 102 vs. 104. (ignored) on liuhaotian/llava-v1.6-34b #1166

Closed

Luo-Z13 mentioned this issue Apr 30, 2024

Using phi-3 and LLava but some fields of phi3 network not support mbzuai-oryx/LLaVA-pp#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage] tokenization mismatch when finetuning v1.5-7b #661

[Usage] tokenization mismatch when finetuning v1.5-7b #661

Liu0329 commented Oct 25, 2023

yuyq96 commented Oct 25, 2023 •

edited

Loading

yuyq96 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

yuyq96 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

yuyq96 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

haotian-liu commented Oct 25, 2023

haotian-liu commented Oct 25, 2023

Liu0329 commented Oct 26, 2023

zzzzzzrc commented Nov 1, 2023

GuoQiushan commented Nov 7, 2023 •

edited

Loading

xiechengmude commented Nov 18, 2023

GuoQiushan commented Nov 19, 2023

ryusaeba commented Dec 1, 2023

liuhaogeng commented Dec 3, 2023 •

edited

Loading

xxxwuwq commented Feb 2, 2024

lucasjinreal commented Feb 20, 2024

20191864218 commented Feb 22, 2024

charismaticchiu commented Feb 23, 2024

lucasjinreal commented Feb 25, 2024

BlueBlueFF commented Feb 26, 2024

gujiaqivadin commented Apr 15, 2024

[Usage] tokenization mismatch when finetuning v1.5-7b #661

[Usage] tokenization mismatch when finetuning v1.5-7b #661

Comments

Liu0329 commented Oct 25, 2023

Describe the issue

yuyq96 commented Oct 25, 2023 • edited Loading

yuyq96 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

yuyq96 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

yuyq96 commented Oct 25, 2023

Liu0329 commented Oct 25, 2023

haotian-liu commented Oct 25, 2023

haotian-liu commented Oct 25, 2023

Liu0329 commented Oct 26, 2023

zzzzzzrc commented Nov 1, 2023

GuoQiushan commented Nov 7, 2023 • edited Loading

xiechengmude commented Nov 18, 2023

GuoQiushan commented Nov 19, 2023

ryusaeba commented Dec 1, 2023

liuhaogeng commented Dec 3, 2023 • edited Loading

xxxwuwq commented Feb 2, 2024

lucasjinreal commented Feb 20, 2024

20191864218 commented Feb 22, 2024

charismaticchiu commented Feb 23, 2024

lucasjinreal commented Feb 25, 2024

BlueBlueFF commented Feb 26, 2024

gujiaqivadin commented Apr 15, 2024

yuyq96 commented Oct 25, 2023 •

edited

Loading

GuoQiushan commented Nov 7, 2023 •

edited

Loading

liuhaogeng commented Dec 3, 2023 •

edited

Loading