[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

yix-chen · 2023-10-20T03:43:03Z

Question

Hi, I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings except flash-attention on A100, and got 1466.6. Given that the paper reported 1510 on MME, is that a normal fluctuation or some hyperparameters need to be tweaked.

haotian-liu · 2023-10-20T19:47:53Z

Hi, can you share the number of the official checkpoints you evaluate on your local machine (to make sure that the eval is consistent)? Also, what about the numbers on other datasets? Are they consistently lower (can you also share them)? Thanks.

yix-chen · 2023-10-21T07:07:16Z

Hi Haotian,

The MME evaluation on official v1.5-7B checkpoint is fine, which is 1508.9. And on other datasets, the results are also consistent with the reported. So I wonder if something went wrong in finetuning, e.g., flash-attention was not used?

haotian-liu · 2023-10-21T15:09:27Z

Hi @yix-chen

I have not tested running without flash-attention, but theoretically, it is an exact-attention optimization, so with or without does not significantly affect the results.

It seems that the eval is fine, but it is still hard to determine the cause with MME performance alone. Can you share the numbers of more datasets you have tested, so that we can see both the trend and the exact absolute difference? Thanks.

Carol-lyh · 2023-10-27T06:12:11Z

We cannot reproduce the results on MME either, our result is 1457.7

TempleX98 · 2023-10-27T09:25:40Z

We also failed to reproduce the official performance. Our model got a score of 1473.

haotian-liu · 2023-11-27T16:16:27Z

This may be due to some unexpected randomness when using distributed training (#864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector.

This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable.

Any observation/advice in terms of the randomness is welcomed.

eehover · 2023-12-06T12:22:41Z

try to set the deepspeeed zero1 config. the loss will be same every time

zjysteven · 2024-06-20T15:00:33Z

Can confirm the same thing here. I'm using lmms-eval for evaluation. The released llava-1.5-7b checkpoint got 1512 on MME, while my retrained/reproduced version got only 1478.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

yix-chen commented Oct 20, 2023

haotian-liu commented Oct 20, 2023

yix-chen commented Oct 21, 2023

haotian-liu commented Oct 21, 2023

Carol-lyh commented Oct 27, 2023

TempleX98 commented Oct 27, 2023

haotian-liu commented Nov 27, 2023 •

edited

Loading

eehover commented Dec 6, 2023

zjysteven commented Jun 20, 2024

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

Comments

yix-chen commented Oct 20, 2023

Question

haotian-liu commented Oct 20, 2023

yix-chen commented Oct 21, 2023

haotian-liu commented Oct 21, 2023

Carol-lyh commented Oct 27, 2023

TempleX98 commented Oct 27, 2023

haotian-liu commented Nov 27, 2023 • edited Loading

eehover commented Dec 6, 2023

zjysteven commented Jun 20, 2024

haotian-liu commented Nov 27, 2023 •

edited

Loading