-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation MC Questions #1875
Comments
Yes, the result should be the same (at least for the single-token completions case). Have you compared the logits you get to the observed per-sample values? (the ones we log won't have the softmax applied to them though, we just select the largest one) |
Something to flag is that occasionally model predictions can flip due to things as minor as running with a different batch size, if logits on two answer choices were very very close or the same floating point value. But there should not be large divergences |
I have a question regarding evaluating LLMs on mc questions using loglikelihood of tokens. From existing implementations like MMLU etc, the code snippet would look like this:
However, the result from doing this is usually very different from lm evaluation harness. For example, running evaluations on llama would result in 2 different numbers. I looked into the code snippet within this repo but cannot be sure what is the difference. I understand that this repo has more wrappers around, but under the hood, is the idea the same as the code snippet? I am a bit confused since I get vastly different results from lm eval harness by coding in this way.
The text was updated successfully, but these errors were encountered: