More Flexible Answer Extraction Code #1159

haileyschoelkopf · 2023-12-18T18:05:31Z

In LM Evaluation Harness, we work to match the "original" / "default" methods used to evaluate datasets. This includes using whatever answer extraction / post-processing is done by the original code implementations if provided, even if such extraction is flawed and may miss correct-in-substance-but-not-form answers.

Where appropriate and requested, we may consider adding more flexible answer extraction code. This has been requested by many users. I think a good middle ground might be to support both a standard and loose filter pipeline or metric for various datasets, such as for GSM8k: scoring based only on the last number output from the model, as opposed to scoring #### {number} only as correct but not things such as so, the answer is {number}.

There's definitely a balance to be struck though between being too permissive and being sufficiently flexible s.t. benchmarks aren't just a test of performing the right formatting steps + incantations for a model.

This issue is to track our addition of such flexibility / improvements, and solicit requests or feedback on this.

The text was updated successfully, but these errors were encountered:

anjor · 2023-12-31T23:04:17Z

Is this PR related -- #943 ?

haileyschoelkopf · 2024-01-01T00:03:29Z

Yup! This and the triviaqa ones are good examples of what we’ll want to handle.

Ideally we can use multiple filter pipelines for this purpose.

haileyschoelkopf added the feature request A feature that isn't implemented yet. label Dec 18, 2023

haileyschoelkopf self-assigned this Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Flexible Answer Extraction Code #1159

More Flexible Answer Extraction Code #1159

haileyschoelkopf commented Dec 18, 2023

anjor commented Dec 31, 2023

haileyschoelkopf commented Jan 1, 2024

More Flexible Answer Extraction Code #1159

More Flexible Answer Extraction Code #1159

Comments

haileyschoelkopf commented Dec 18, 2023

anjor commented Dec 31, 2023

haileyschoelkopf commented Jan 1, 2024