-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Flexible Answer Extraction Code #1159
Labels
feature request
A feature that isn't implemented yet.
Comments
Is this PR related -- #943 ? |
Yup! This and the triviaqa ones are good examples of what we’ll want to handle. Ideally we can use multiple filter pipelines for this purpose. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In LM Evaluation Harness, we work to match the "original" / "default" methods used to evaluate datasets. This includes using whatever answer extraction / post-processing is done by the original code implementations if provided, even if such extraction is flawed and may miss correct-in-substance-but-not-form answers.
Where appropriate and requested, we may consider adding more flexible answer extraction code. This has been requested by many users. I think a good middle ground might be to support both a
standard
andloose
filter pipeline or metric for various datasets, such as for GSM8k: scoring based only on the last number output from the model, as opposed to scoring#### {number}
only as correct but not things such asso, the answer is {number}
.There's definitely a balance to be struck though between being too permissive and being sufficiently flexible s.t. benchmarks aren't just a test of performing the right formatting steps + incantations for a model.
This issue is to track our addition of such flexibility / improvements, and solicit requests or feedback on this.
The text was updated successfully, but these errors were encountered: