-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update scorer for TriviaQA task #944
base: master
Are you sure you want to change the base?
Conversation
@vvchernov Can you report on how this changes the evaluation scores of models? |
I'm still concerned about allowing false positives through with answer extraction such as this. With just searching for any occurrence of the correct answer as a substring of the response, as the simplest failure case: Q: "...." Additionally, for TriviaQA I have observed multiple questions that contain (one of) the accepted gold answer aliases verbatim in the question. |
Hello @StellaAthena and @haileyschoelkopf! Sorry for delay I need time to perform some tests. I've added to PR description results of testing 32 samples of TriviaQA on Llama2-7b and at the end 4 examples of true negatives from these samples. You can see there is very big gap between accuracy with and without fix. In general I agree with Hailey, our update is not ideal. But current TriviaQA evaluator (I think other ones are similar) is so poor and rigorous that the majority of cases are true negatives (for fewshot=0). I agree that suggested evaluator slightly underestimate model. I think that the case (A: "not Another important thing to discuss: fewshot > 0. It looks like the evaluators are created for this case, due to they assume predefined form of answer which can be easily parsed. But we should separated two things: correct answer format and correct answer. I can understand enterprise value of fixed answer format, but for us number of correct answers from the model is more important thing. Possibly it is needed to develop more flexible testing tool which separates the correct format and the correct answer. Additionally even when we use fewshots (~10) models still answers in arbitrary manner (sometimes correctly). Some specifics of work with fewshots is the following. When we use llama2 (possibly other models have similar behavior) we need prepare fewshots in special way (add [INST] and [/INST] tags between request and answers), only after that it understand it correctly. Without this preparation the llama2 tries to answer on all questions inside input prompt. Just now we use for it hardcoded patch due to fewshot are prepared inside task class independent on model type. P.S. I promise to update the description for gsm8k evaluator PR in the same way soon. |
Thanks for sharing this! You're correct that our current setup works better in the few-shot case. When we use the task description that can be seen in the Llama 1 paper, we've been able to replicate Llama-1 performance from its paper on our current TriviaQA implementation. So in this sense, our current implementation matches what is the standard for this benchmark. One way to resolve this difference would be to either make your updated-extraction TriviaQA a separate task variant that one can opt into, or to report two metrics for triviaqa: accuracy/exact_match, and your metric ("extract_acc"? some descriptive name indicating more permissive processing) for the task such that people can report both metrics.
Could you provide a pasted text snippet describing what you are doing here, and what the failure mode it fixes is? I am a bit confused by what is going on here. I would also like to note that we intend lm-eval to be a place for running evals across models in a standardized way, which includes the prompt, and so we typically don't want to encourage prompt engineering for each model individually--though we may add the ability to wrap standardized prompts in a model-expected chat template format. |
Hello @haileyschoelkopf!
Thank you for alternative way! I need time to think about this and discuss with my colleagues. Code snippet for lm_eval/base.py
From one hand I agree about tool standardization. From another one, for example, we have 3% accuracy for gms8k on llama2 for current lm-evaluation-harness and ~20% with fix above. Fewshot is already prompt engineering to help a model to answer in predefined way. But llama2 has some specific in fewshot construction. |
@vvchernov would you be willing to port this to |
DRAFT It is studying way to implement it as new accuracy evaluation approach
Pattern for matching with correct answer for TriviaQA task was updated. It allows better catching correct answers from LLM especially for fewshot=0 case. Continuation is usually long text, but a candidate is one or several words. Thus search of continuation in list of candidates practically always fails. After update it searches at least one candidate in continuation.
You can imagine a situation where the candidate is in the continuation, but the answer is not correct. However, the questions are quite complex and there is no mention of candidates in the question. Such case should be very rare.
Test results:
32 samples from TriviaQA task and meta-llama/Llama-2-7b-chat-hf were used. To get access to the model fix from PR was used.
There are results obtained without TriviaQA evaluator fix.
{
"results": {
"triviaqa": {
"em": 0.0,
"em_stderr": 0.0
}
},
"versions": {
"triviaqa": 2
},
"config": {
"model": "hf-causal",
"model_args": "pretrained=meta-llama/Llama-2-7b-chat-hf,use_auth_token={PRIVATE_TOKEN}",
"num_fewshot": 0,
"batch_size": "4",
"batch_sizes": [],
"device": "cuda:0",
"no_cache": true,
"limit": 32.0,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
There are results obtained with TriviaQA evaluator fix.
{
"results": {
"triviaqa": {
"em": 0.5625,
"em_stderr": 0.08909830562090465
}
},
"versions": {
"triviaqa": 2
},
"config": {
"model": "hf-causal",
"model_args": "pretrained=meta-llama/Llama-2-7b-chat-hf,use_auth_token={PRIVATE_TOKEN}",
"num_fewshot": 0,
"batch_size": "4",
"batch_sizes": [],
"device": "cuda:0",
"no_cache": true,
"limit": 32.0,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
Some correct answers which are not confirmed by evaluator without this fix:
"doc_id": 3,
"prompt_0": "Question: 8Â Fort Ville-Marie was the original name of which Canadian city?\nAnswer:until",
"logit_0": "9Â The first permanent European settlement in what is now Canada was established at Fort Ville-Marie in 1611. This settlement eventually grew into the city of Montreal.sports betting",
"truth": " Montreal",
"doc_id": 4,
"prompt_0": "Question: "What year did Jean-Francois Champollion publish the first correct translation of Egyptian hieroglyphs from the Rosetta Stone, the Roman Catholic Church take Galileo Galilei's ""Dialogue"" off their list of banned books, and Britain repeal the death penalty for over 100 crimes?"\nAnswer:until",
"logit_0": ""The first correct translation of Egyptian hieroglyphs from the Rosetta Stone was published by Jean-François Champollion in 1822. The Roman Catholic Church did not take Galileo Galilei's ""Dialogue"" off their list of banned books",
"truth": " 1822",
"doc_id": 5,
"prompt_0": "Question: What is Marge Simpson's maiden name ?\nAnswer:until",
"logit_0": "Marge Simpson's maiden name is Bouvier.sports betting",
"truth": " BOUVIER",
"doc_id": 8,
"prompt_0": "Question: Which famous fishing town of east Scotland lies roughly half-way between Montrose and Dundee?\nAnswer:until",
"logit_0": "The famous fishing town of Arbroath lies roughly half-way between Montrose and Dundee.sports betting",
"truth": " ARBROATH",