Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update scorer for TriviaQA task #944

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vvchernov
Copy link

@vvchernov vvchernov commented Oct 24, 2023

DRAFT It is studying way to implement it as new accuracy evaluation approach

Pattern for matching with correct answer for TriviaQA task was updated. It allows better catching correct answers from LLM especially for fewshot=0 case. Continuation is usually long text, but a candidate is one or several words. Thus search of continuation in list of candidates practically always fails. After update it searches at least one candidate in continuation.
You can imagine a situation where the candidate is in the continuation, but the answer is not correct. However, the questions are quite complex and there is no mention of candidates in the question. Such case should be very rare.

Test results:
32 samples from TriviaQA task and meta-llama/Llama-2-7b-chat-hf were used. To get access to the model fix from PR was used.
There are results obtained without TriviaQA evaluator fix.
{
"results": {
"triviaqa": {
"em": 0.0,
"em_stderr": 0.0
}
},
"versions": {
"triviaqa": 2
},
"config": {
"model": "hf-causal",
"model_args": "pretrained=meta-llama/Llama-2-7b-chat-hf,use_auth_token={PRIVATE_TOKEN}",
"num_fewshot": 0,
"batch_size": "4",
"batch_sizes": [],
"device": "cuda:0",
"no_cache": true,
"limit": 32.0,
"bootstrap_iters": 100000,
"description_dict": {}
}
}

There are results obtained with TriviaQA evaluator fix.
{
"results": {
"triviaqa": {
"em": 0.5625,
"em_stderr": 0.08909830562090465
}
},
"versions": {
"triviaqa": 2
},
"config": {
"model": "hf-causal",
"model_args": "pretrained=meta-llama/Llama-2-7b-chat-hf,use_auth_token={PRIVATE_TOKEN}",
"num_fewshot": 0,
"batch_size": "4",
"batch_sizes": [],
"device": "cuda:0",
"no_cache": true,
"limit": 32.0,
"bootstrap_iters": 100000,
"description_dict": {}
}
}

Some correct answers which are not confirmed by evaluator without this fix:
"doc_id": 3,
"prompt_0": "Question: 8  Fort Ville-Marie was the original name of which Canadian city?\nAnswer:until",
"logit_0": "9  The first permanent European settlement in what is now Canada was established at Fort Ville-Marie in 1611. This settlement eventually grew into the city of Montreal.sports betting",
"truth": " Montreal",
"doc_id": 4,
"prompt_0": "Question: "What year did Jean-Francois Champollion publish the first correct translation of Egyptian hieroglyphs from the Rosetta Stone, the Roman Catholic Church take Galileo Galilei's ""Dialogue"" off their list of banned books, and Britain repeal the death penalty for over 100 crimes?"\nAnswer:until",
"logit_0": ""The first correct translation of Egyptian hieroglyphs from the Rosetta Stone was published by Jean-François Champollion in 1822. The Roman Catholic Church did not take Galileo Galilei's ""Dialogue"" off their list of banned books",
"truth": " 1822",
"doc_id": 5,
"prompt_0": "Question: What is Marge Simpson's maiden name ?\nAnswer:until",
"logit_0": "Marge Simpson's maiden name is Bouvier.sports betting",
"truth": " BOUVIER",
"doc_id": 8,
"prompt_0": "Question: Which famous fishing town of east Scotland lies roughly half-way between Montrose and Dundee?\nAnswer:until",
"logit_0": "The famous fishing town of Arbroath lies roughly half-way between Montrose and Dundee.sports betting",
"truth": " ARBROATH",

@StellaAthena
Copy link
Member

@vvchernov Can you report on how this changes the evaluation scores of models?

@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Nov 4, 2023

You can imagine a situation where the candidate is in the continuation, but the answer is not correct. However, the questions are quite complex and there is no mention of candidates in the question. Such case should be very rare.

I'm still concerned about allowing false positives through with answer extraction such as this. With just searching for any occurrence of the correct answer as a substring of the response, as the simplest failure case:

Q: "...."
A: "not Correct_Answer"
would be scored as correct if the answer were Correct_Answer. I think it's preferable to slightly underestimate model performance compared to introducing the potential for false positives / overestimation of model performance in this situation.

Additionally, for TriviaQA I have observed multiple questions that contain (one of) the accepted gold answer aliases verbatim in the question.

@vvchernov
Copy link
Author

Hello @StellaAthena and @haileyschoelkopf! Sorry for delay I need time to perform some tests. I've added to PR description results of testing 32 samples of TriviaQA on Llama2-7b and at the end 4 examples of true negatives from these samples. You can see there is very big gap between accuracy with and without fix.

In general I agree with Hailey, our update is not ideal. But current TriviaQA evaluator (I think other ones are similar) is so poor and rigorous that the majority of cases are true negatives (for fewshot=0). I agree that suggested evaluator slightly underestimate model. I think that the case (A: "not Correct_Answer") is rare. But it is good point from you about questions with gold answer inside I would be more aware about such cases.

Another important thing to discuss: fewshot > 0. It looks like the evaluators are created for this case, due to they assume predefined form of answer which can be easily parsed. But we should separated two things: correct answer format and correct answer. I can understand enterprise value of fixed answer format, but for us number of correct answers from the model is more important thing. Possibly it is needed to develop more flexible testing tool which separates the correct format and the correct answer. Additionally even when we use fewshots (~10) models still answers in arbitrary manner (sometimes correctly).

Some specifics of work with fewshots is the following. When we use llama2 (possibly other models have similar behavior) we need prepare fewshots in special way (add [INST] and [/INST] tags between request and answers), only after that it understand it correctly. Without this preparation the llama2 tries to answer on all questions inside input prompt. Just now we use for it hardcoded patch due to fewshot are prepared inside task class independent on model type.

P.S. I promise to update the description for gsm8k evaluator PR in the same way soon.

@haileyschoelkopf
Copy link
Contributor

Thanks for sharing this!

You're correct that our current setup works better in the few-shot case. When we use the task description that can be seen in the Llama 1 paper, we've been able to replicate Llama-1 performance from its paper on our current TriviaQA implementation. So in this sense, our current implementation matches what is the standard for this benchmark.

One way to resolve this difference would be to either make your updated-extraction TriviaQA a separate task variant that one can opt into, or to report two metrics for triviaqa: accuracy/exact_match, and your metric ("extract_acc"? some descriptive name indicating more permissive processing) for the task such that people can report both metrics.

When we use llama2 (possibly other models have similar behavior) we need prepare fewshots in special way (add [INST] and [/INST] tags between request and answers), only after that it understand it correctly.

Could you provide a pasted text snippet describing what you are doing here, and what the failure mode it fixes is? I am a bit confused by what is going on here. I would also like to note that we intend lm-eval to be a place for running evals across models in a standardized way, which includes the prompt, and so we typically don't want to encourage prompt engineering for each model individually--though we may add the ability to wrap standardized prompts in a model-expected chat template format.

@vvchernov
Copy link
Author

Hello @haileyschoelkopf!

One way to resolve this difference would be to either make your updated-extraction TriviaQA a separate task variant that one can opt into, or to report two metrics for triviaqa: accuracy/exact_match, and your metric ("extract_acc"? some descriptive name indicating more permissive processing) for the task such that people can report both metrics.

Thank you for alternative way! I need time to think about this and discuss with my colleagues.

Code snippet for lm_eval/base.py

labeled_examples = (
                "\n\n" +
                "".join(
                    [
                        self.doc_to_text(doc) + "\n[/INST]\n" + self.doc_to_target(doc) + "\n</s>\n\n<s>\n[INST]\n"
                        for doc in fewshotex
                    ]
                )
            )

From one hand I agree about tool standardization. From another one, for example, we have 3% accuracy for gms8k on llama2 for current lm-evaluation-harness and ~20% with fix above. Fewshot is already prompt engineering to help a model to answer in predefined way. But llama2 has some specific in fewshot construction.
I'm not so familiar with big-refactor branch, but possibly it has tool for fewshot processing like input preprocessing.

@vvchernov vvchernov marked this pull request as draft November 20, 2023 18:19
@StellaAthena StellaAthena removed this from the v0.3.0 milestone Nov 20, 2023
@lintangsutawika
Copy link
Contributor

@vvchernov would you be willing to port this to main instead? There it would be easier to have this as an alternative version to the default triviaqa task we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants