Reproducing the stage1 and stage2 Model problem on L40s #27

cydiachen · 2024-02-05T12:46:33Z

Thank you for your excellent job.
I followed your work and download the released dataset from your link.
Since you have kindly provided an end-to-end script and processed dataset file. I thought we can quickly reproduce your excellent work. But After two days of training, we get our LLaVA-phi2 model. It can infer by your code.

But It can not reproduce the excellent accuracy in your paper. Would you mind sharing any train logs or detailed information with us. Therefore, we can debug the training process and find out what happened.

LinB203 · 2024-02-05T13:50:38Z

Do you mean the checkpoint of stage 2? We do not mention the results of stage 2 in paper. The result of table 3 you want to reproduce or table 7?

cydiachen · 2024-02-05T14:28:39Z

Do you mean the checkpoint of stage 2? We do not mention the results of stage 2 in paper. The result of table 3 you want to reproduce or table 7?

Exactly. I carefully read your paper and find the relevant experimental result in Table.10 in the supplementary materials.
According to the 'phi-2' without MOE, the VQA^T and VQA^v2 scored 68.7 and 77.1. But our result scored 31 and 49. A large margin of the performance with the result.

cydiachen · 2024-02-05T14:37:59Z

Addtionally, I think it is necessary to clarify our dataset used to reproduce.

Stage1: --data_path ${JSON_FOLDER}/llava_image_.json
Stage2: --data_path ${JSON_FOLDER}/la_tune_256k.json
${JSON_FOLDER}/lrv_tune_331k.json ${JSON_FOLDER}/lvis_tune_220k_.json
${JSON_FOLDER}/svit_tune_157k.json ${JSON_FOLDER}/nlp_tune.json \

LinB203 · 2024-02-05T14:51:19Z

I think you misunderstood our paper. The LLaVA-phi in table 7 is not obtained by training with stage 2 data. Please refer to variant c of table 5 and the Effect of Training Strategy subsection to figure out the setup.

We did not validate the results of stage 2 with stage 2 data, but to make sure your results are consistent, we did just now. The result we got on textqa was 31.7 aligned with you.

By the way, if you want to get the better results, you can take the LLaVA-1.5 data (which is the stage 3 data in MoE-LLaVA) and train a non-MoE version. That would actually be an LLaVA-phi and have no connection to MoE-LLaVA.

cydiachen · 2024-02-05T15:01:21Z

I think you misunderstood our paper. The LLaVA-phi in table 7 is not obtained by training with stage 2 data. Please refer to variant c of table 5 and the Effect of Training Strategy subsection to figure out the setup.

We did not validate the results of stage 2 with stage 2 data, but to make sure your results are consistent, we did just now. The result we got on textqa was 31.7 aligned with you.

By the way, if you want to get the better results, you can take the LLaVA-1.5 data (which is the stage 3 data in MoE-LLaVA) and train a non-MoE version. That would actually be an LLaVA-phi and have no connection to MoE-LLaVA.

Thx a lot. This project is solid and open to the community. I will keep in touch with you to further explore the protential of the method.

cydiachen · 2024-02-08T02:04:21Z

@LinB203
Hello, Lin. I am now working on integrating MIniCPM LLM with your work. Since the MiniCPM shares a large similarity with Phi-2. I followed the phi-2 pipeline and implement the whole pipeline. The model succeed in loading parameter correctly, but the stage1 pretraining suffer from large loss (~5). Is this phenomenon normal for the llm backbone?

LinB203 · 2024-02-08T02:11:37Z

We have actually finished training MoE-LLaVA-minicpm. we provide all three stages train_state.json for reference. Please feel free to open a new issue if you have one.
stage3.json
stage1.json
stage2.json

cydiachen · 2024-02-08T02:28:21Z

We have actually finished training MoE-LLaVA-minicpm. we provide all three stages train_state.json for reference. Please feel free to open a new issue if you have one. stage3.json stage1.json stage2.json

Thank you. My init loss is the same with you. I will reopen a new issue if more questions are met.

LinB203 · 2024-02-08T03:01:20Z

We have actually finished training MoE-LLaVA-minicpm. we provide all three stages train_state.json for reference. Please feel free to open a new issue if you have one. stage3.json stage1.json stage2.json

Thank you. My init loss is the same with you. I will reopen a new issue if more questions are met.

Btw, we are training on 384×384 resolution. So the final loss maybe a little different.

As the json shows, the loss rises dramatically in the last few steps causing the last saved checkpoint to be unavailable. So I suggest you can save more checkpoints during the process. e.g. if you train 5198 steps in total, maybe 5000 steps will be much better than the last.

This seems to be a problem caused by minicpm, I haven't encountered it in other models.

cydiachen · 2024-02-08T17:43:49Z

Btw, we are training on 384×384 resolution. So the final loss maybe a little different.

As the json shows, the loss rises dramatically in the last few steps causing the last saved checkpoint to be unavailable. So I suggest you can save more checkpoints during the process. e.g. if you train 5198 steps in total, maybe 5000 steps will be much better than the last.

This seems to be a problem caused by minicpm, I haven't encountered it in other models.

I am currently working on 336x336 resolution. I didn't came up with the phenomenon of increase loss on the end.
Unlucky , I met another problem. After Stage-2, my loss is exactly the same with you.
But when I intended to evaluate the whole result on TextVQA. The model seems to output endless and repeated results. The tokenizer and conversation template are aligned with llama-2, which might be align with MiniCPM.

Canon ODADADADADADADA                                                                                                                                                  
  0%|                                                                                                                               | 1/5000 [00:06<8:34:01,  6.17s/it]
OCRupupupupupupupupupupD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D F ACRupupup" The small small small small small small s
mall small small small small small small small small a C RUP                                                                                                           
  0%|                                                                                                                               | 2/5000 [00:10<7:10:31,  5.17s/it]
ThisESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTEST                                                          
  0%|                                                                                                                               | 3/5000 [00:11<4:43:20,  3.40s/it]
No Single Single Single Single OCR                                                                                                                                     
  0%|                                                                                                                               | 4/5000 [00:16<5:18:11,  3.82s/it]
The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The
 The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The Th
e The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The T
he The The                                                                                                                                                             
  0%|▏                                                                                                                              | 5/5000 [00:20<5:37:25,  4.05s/it]
Number Number Number                                                      2,,,,,,,,,,,,,,,2, O,,,,2, O,, a player from the baseball baseball baseball player from the "
2,, O,, a                2, a player from the "2,                                                                                                                      
  0%|▏                                                                                                                              | 6/5000 [00:25<5:48:18,  4.18s/it]
The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The
 The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The Th
e The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The T
he The The                                                                                                                                                             
  0%|▏                                                                                                                              | 7/5000 [00:29<5:55:32,  4.27s/it]
RoleEPEPEP OCR                                                                                                                                                         
  0%|▏                                                                                                                              | 8/5000 [00:34<5:59:31,  4.32s/it]
AITITOCR                                                                                                                                                               
  0%|▏                                                                                                                              | 9/5000 [00:38<6:09:30,  4.44s/it]
The Phot Phot Phot Phot Phot Phot Phot Phot Phot Phot Phot Phot L L L L L L L L ACR                                                                                    
  0%|▎                                                                                                                             | 10/5000 [00:43<6:09:02,  4.44s/it]
OffCR                                                                                                                                                                  
  0%|▎                                                                                                                             | 11/5000 [00:47<6:10:14,  4.45s/it]
OCR Honey Honey Honey Honey
  0%|▎                                                                                                                             | 12/5000 [00:52<6:11:34,  4.47s/it]
The OCRCR
  0%|▎                                                                                                                             | 13/5000 [00:56<6:12:15,  4.48s/it]
Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk
  0%|▎                                                                                                                             | 14/5000 [01:01<6:12:51,  4.49s/it]
  0%|▎                                                                                                                             | 14/5000 [01:01<6:07:58,  4.43s/it]

The implementation of my MINICPM template are as follows.

conv_minicpm = Conversation(
    system="You are a helpful language and vision assistant. "
           "You are able to understand the visual content that the user provides, "
           "and assist the user with a variety of tasks using natural language.",
    roles=("USER", "ASSISTANT"),
    version="minicpm",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.LLAMA_2,
    sep="<s>",
    sep2="</s>",
)

LinB203 · 2024-02-09T14:40:06Z

Here is my conv template.

conv_minicpm = Conversation(
    system="A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
    roles=("USER", "ASSISTANT"),
    version="minicpm",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.TWO,
    sep=" ",
    sep2="</s>",
)

bug-fixed · 2024-02-19T14:01:21Z

@LinB203 Hi Lin, thanks for your great work and thoughtful interactions.
I have a question, from the finetune_moe.sh,

https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/c7a5a42efe8dbd092d1c8e51e6265996f5a138b8/scripts/v1/phi2/finetune_moe.sh#L16C26-L16C62

The final MoE-LLaVA is finetuned from a Stage 2 finetuned checkpoint. I finetuned one Stage 2 checkpoint as your shared in https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/phi2/finetune.sh. The result is

yes/no: 78.6
number: 21.48
other: 41.93
overall: 54.72

I'm not sure if this result is reasonable. Would you please share some evaluation metrics of this checkpoint on the VQAv2 dataset? It would be much appreciated if you could share these checkpoints. Thanks.

cydiachen · 2024-02-20T15:05:09Z

@LinB203 Hi Lin, thanks for your great work and thoughtful interactions. I have a question, from the finetune_moe.sh,

https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/c7a5a42efe8dbd092d1c8e51e6265996f5a138b8/scripts/v1/phi2/finetune_moe.sh#L16C26-L16C62

The final MoE-LLaVA is finetuned from a Stage 2 finetuned checkpoint. I finetuned one Stage 2 checkpoint as your shared in https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/phi2/finetune.sh. The result is
yes/no: 78.6
number: 21.48
other: 41.93
overall: 54.72
I'm not sure if this result is reasonable. Would you please share some evaluation metrics of this checkpoint on the VQAv2 dataset? It would be much appreciated if you could share these checkpoints. Thanks.

You can find an accuracy score in the results. You can check them.
In addition, You can evaluate your model offline on TextVQA dataset.
In my reproduction(More gradient accumulation), the VQA-v2 and textvqa score is slightly below the report. But the difference is within 1%.

bug-fixed · 2024-02-20T18:23:51Z

@LinB203 Hi Lin, thanks for your great work and thoughtful interactions. I have a question, from the finetune_moe.sh,
https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/c7a5a42efe8dbd092d1c8e51e6265996f5a138b8/scripts/v1/phi2/finetune_moe.sh#L16C26-L16C62
The final MoE-LLaVA is finetuned from a Stage 2 finetuned checkpoint. I finetuned one Stage 2 checkpoint as your shared in https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/phi2/finetune.sh. The result is
yes/no: 78.6
number: 21.48
other: 41.93
overall: 54.72
I'm not sure if this result is reasonable. Would you please share some evaluation metrics of this checkpoint on the VQAv2 dataset? It would be much appreciated if you could share these checkpoints. Thanks.
You can find an accuracy score in the results. You can check them. In addition, You can evaluate your model offline on TextVQA dataset. In my reproduction(More gradient accumulation), the VQA-v2 and textvqa score is slightly below the report. But the difference is within 1%.

Hi @cydiachen , many thanks for your kind reply and your shared information. Greatly appreciated!
I run the evaluation on the TextVQA dataset and get a score of 33% (before MoE) and 47% (after MoE). Is this result reasonable? I checked your previous comments in this thread, our results seemed similar on this dataset.
But in Table 10 of the paper, the score is 67.8% (without MoE) and 68.7% (with MoE), which had a large higher margin to our results.

cydiachen closed this as completed Feb 5, 2024

cydiachen reopened this Feb 8, 2024

cydiachen closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the stage1 and stage2 Model problem on L40s #27

Reproducing the stage1 and stage2 Model problem on L40s #27

cydiachen commented Feb 5, 2024

LinB203 commented Feb 5, 2024 •

edited

Loading

cydiachen commented Feb 5, 2024

cydiachen commented Feb 5, 2024

LinB203 commented Feb 5, 2024

cydiachen commented Feb 5, 2024 •

edited

Loading

cydiachen commented Feb 8, 2024

LinB203 commented Feb 8, 2024 •

edited

Loading

cydiachen commented Feb 8, 2024

LinB203 commented Feb 8, 2024 •

edited

Loading

cydiachen commented Feb 8, 2024

LinB203 commented Feb 9, 2024

bug-fixed commented Feb 19, 2024 •

edited

Loading

cydiachen commented Feb 20, 2024

bug-fixed commented Feb 20, 2024

Reproducing the stage1 and stage2 Model problem on L40s #27

Reproducing the stage1 and stage2 Model problem on L40s #27

Comments

cydiachen commented Feb 5, 2024

LinB203 commented Feb 5, 2024 • edited Loading

cydiachen commented Feb 5, 2024

cydiachen commented Feb 5, 2024

LinB203 commented Feb 5, 2024

cydiachen commented Feb 5, 2024 • edited Loading

cydiachen commented Feb 8, 2024

LinB203 commented Feb 8, 2024 • edited Loading

cydiachen commented Feb 8, 2024

LinB203 commented Feb 8, 2024 • edited Loading

cydiachen commented Feb 8, 2024

LinB203 commented Feb 9, 2024

bug-fixed commented Feb 19, 2024 • edited Loading

cydiachen commented Feb 20, 2024

bug-fixed commented Feb 20, 2024

LinB203 commented Feb 5, 2024 •

edited

Loading

cydiachen commented Feb 5, 2024 •

edited

Loading

LinB203 commented Feb 8, 2024 •

edited

Loading

LinB203 commented Feb 8, 2024 •

edited

Loading

bug-fixed commented Feb 19, 2024 •

edited

Loading