Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

Open
yix-chen opened this issue Oct 20, 2023 · 8 comments
Open

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

yix-chen opened this issue Oct 20, 2023 · 8 comments

Comments

@yix-chen
Copy link

Question

Hi, I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings except flash-attention on A100, and got 1466.6. Given that the paper reported 1510 on MME, is that a normal fluctuation or some hyperparameters need to be tweaked.

@haotian-liu
Copy link
Owner

Hi, can you share the number of the official checkpoints you evaluate on your local machine (to make sure that the eval is consistent)? Also, what about the numbers on other datasets? Are they consistently lower (can you also share them)? Thanks.

@yix-chen
Copy link
Author

Hi Haotian,

The MME evaluation on official v1.5-7B checkpoint is fine, which is 1508.9. And on other datasets, the results are also consistent with the reported. So I wonder if something went wrong in finetuning, e.g., flash-attention was not used?

@haotian-liu
Copy link
Owner

Hi @yix-chen

I have not tested running without flash-attention, but theoretically, it is an exact-attention optimization, so with or without does not significantly affect the results.

It seems that the eval is fine, but it is still hard to determine the cause with MME performance alone. Can you share the numbers of more datasets you have tested, so that we can see both the trend and the exact absolute difference? Thanks.

@Carol-lyh
Copy link

We cannot reproduce the results on MME either, our result is 1457.7

@TempleX98
Copy link

We also failed to reproduce the official performance. Our model got a score of 1473.

@haotian-liu
Copy link
Owner

haotian-liu commented Nov 27, 2023

This may be due to some unexpected randomness when using distributed training (#864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector.

This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable.

Any observation/advice in terms of the randomness is welcomed.

@eehover
Copy link

eehover commented Dec 6, 2023

try to set the deepspeeed zero1 config. the loss will be same every time

@zjysteven
Copy link

Can confirm the same thing here. I'm using lmms-eval for evaluation. The released llava-1.5-7b checkpoint got 1512 on MME, while my retrained/reproduced version got only 1478.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants