-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630
Comments
Hi, can you share the number of the official checkpoints you evaluate on your local machine (to make sure that the eval is consistent)? Also, what about the numbers on other datasets? Are they consistently lower (can you also share them)? Thanks. |
Hi Haotian, The MME evaluation on official v1.5-7B checkpoint is fine, which is 1508.9. And on other datasets, the results are also consistent with the reported. So I wonder if something went wrong in finetuning, e.g., flash-attention was not used? |
Hi @yix-chen I have not tested running without flash-attention, but theoretically, it is an exact-attention optimization, so with or without does not significantly affect the results. It seems that the eval is fine, but it is still hard to determine the cause with MME performance alone. Can you share the numbers of more datasets you have tested, so that we can see both the trend and the exact absolute difference? Thanks. |
We cannot reproduce the results on MME either, our result is 1457.7 |
We also failed to reproduce the official performance. Our model got a score of 1473. |
This may be due to some unexpected randomness when using distributed training (#864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector. This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable. Any observation/advice in terms of the randomness is welcomed. |
try to set the deepspeeed zero1 config. the loss will be same every time |
Can confirm the same thing here. I'm using lmms-eval for evaluation. The released llava-1.5-7b checkpoint got 1512 on MME, while my retrained/reproduced version got only 1478. |
Question
Hi, I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings except flash-attention on A100, and got 1466.6. Given that the paper reported 1510 on MME, is that a normal fluctuation or some hyperparameters need to be tweaked.
The text was updated successfully, but these errors were encountered: