Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Flexible Answer Extraction Code #1159

Open
haileyschoelkopf opened this issue Dec 18, 2023 · 2 comments
Open

More Flexible Answer Extraction Code #1159

haileyschoelkopf opened this issue Dec 18, 2023 · 2 comments
Assignees
Labels
feature request A feature that isn't implemented yet.

Comments

@haileyschoelkopf
Copy link
Contributor

In LM Evaluation Harness, we work to match the "original" / "default" methods used to evaluate datasets. This includes using whatever answer extraction / post-processing is done by the original code implementations if provided, even if such extraction is flawed and may miss correct-in-substance-but-not-form answers.

Where appropriate and requested, we may consider adding more flexible answer extraction code. This has been requested by many users. I think a good middle ground might be to support both a standard and loose filter pipeline or metric for various datasets, such as for GSM8k: scoring based only on the last number output from the model, as opposed to scoring #### {number} only as correct but not things such as so, the answer is {number}.

There's definitely a balance to be struck though between being too permissive and being sufficiently flexible s.t. benchmarks aren't just a test of performing the right formatting steps + incantations for a model.

This issue is to track our addition of such flexibility / improvements, and solicit requests or feedback on this.

@haileyschoelkopf haileyschoelkopf added the feature request A feature that isn't implemented yet. label Dec 18, 2023
@haileyschoelkopf haileyschoelkopf self-assigned this Dec 23, 2023
@anjor
Copy link
Contributor

anjor commented Dec 31, 2023

Is this PR related -- #943 ?

@haileyschoelkopf
Copy link
Contributor Author

Yup! This and the triviaqa ones are good examples of what we’ll want to handle.

Ideally we can use multiple filter pipelines for this purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet.
Projects
Status: Ready
Development

No branches or pull requests

2 participants