Physics GRE task added #1655

ShayekhBinIslam · 2024-04-01T11:20:23Z

This PR adds the Physics GRE dataset released in the Public Inflection Benchmark by @InflectionAI. Please refer here for the details.

It solved the issue #1554.

Ref: https://github.com/InflectionAI/Inflection-Benchmarks/tree/main?tab=readme-ov-file#physics-gre

CLAassistant · 2024-04-01T11:20:28Z

All committers have signed the CLA.

haileyschoelkopf

Thanks very much for the PR! This is a great contribution.

Have you tested any models, such as Mistral, on this task? If so, do you have any sample outputs I could review?
It might make sense to add ..._maj1 variants of this task, or to have maj8 be an available variant and default to not running maj8, due to computational costs.
Other than that:
Are

    # Remove the data points that have images in the input. 100 -> 76
    dataset = dataset.filter(lambda x: x["has_image"] is False)
    # Remove the data points without groud truth label. 76 -> 75
    dataset = dataset.filter(lambda x: x["target_scores"] is not None)
    # All questions must have one and only one correct answer.
    assert (
        len(dataset.filter(lambda x: sum(x["target_scores"].values()) != 1)) == 0
    ), "Zero or More than one correct answers."

all of these filters used by Inflection in their evaluation of this task? E.g., for >1 correct answer, is this actually not possible for the root test and an error in the benchmark files, or could we support >1 correct answer questions too?

ShayekhBinIslam · 2024-04-02T18:49:06Z

Most welcome. Thanks for your prompt feedback.

Yes, we have tested Mistral for this task. The sample result can be found here and sample output here.
Indeed the ..._maj1 is already here. It is called score-first similar to the gsm8k-cot-self-consistency task.
a. Inflection has not released the data preprocessing pipeline yet.
b. It seems there should be one and only one correct answer in Physics GRE tests. Reference

haileyschoelkopf

Thank you very much!

However:
I think the link you shared means that a model attempting to select more than one answer would be penalized, not that multiple answers can't both be counted.

I was thinking for computational cost reasons, that having a separate task variant which does greedy generation and only reports Maj@1 would be beneficial. How long (and on what GPU) did it take to run mistral on these tasks?

haileyschoelkopf · 2024-04-04T21:27:59Z

lm_eval/tasks/physics_gre/default.yaml

@@ -0,0 +1,52 @@
+dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.


this file should be renamed to default_yaml so that we don't try to register it as its own task!

ShayekhBinIslam · 2024-04-14T15:59:13Z

Thanks a lot for getting back.

I am not sure how we can detect a model **attempting** to select more than one answer using the regex.
The greedy generation is a great suggestion.
It takes about 15 V100 hours to complete all the tasks for Mixtral-8x7B-Instruct-v0.1 4-bit. For Mistral-7B-Instruct-v0.2, it is 6.5 hours.

StellaAthena · 2024-04-18T12:44:26Z

@haileyschoelkopf what is

I think the link you shared means that a model attempting to select more than one answer would be penalized, not that multiple answers can't both be counted.

based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect] is an adjustment common to standardized testing that makes random guessing score 0 in expectation.

ShayekhBinIslam · 2024-04-18T12:55:43Z

To make things concrete: let the correct answer be C. Now the model may predict:

A (wrong)
A, B, C (wrong)
C, A, B (wrong)
C (correct)

The evaluator will get 1 and 4 judged as expected.

But even though 3 is wrong, the regex parsing/filtering will say the answer to be C. Thus the model will be judged as correct, even if it is incorrect. If this is an issue, it should be noted for any MCQ tasks.

haileyschoelkopf · 2024-04-18T13:52:10Z

based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect] is an adjustment common to standardized testing that makes random guessing score 0 in expectation.

No, I am referring to the fact that this text from the link:

Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.

Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this

StellaAthena · 2024-04-18T15:20:59Z

Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.

Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this

I don't think so. I think this is just saying that missing, malformed, and incorrect answers are all treated the same way. I don't read this as implying that some questions have multiple correct answers and that in such cases you should only answer with one of them.

ShayekhBinIslam added 2 commits April 1, 2024 16:45

Physics GRE dataset added

4a05a25

Physics GRE dataset added

5a4407e

Ref: https://github.com/InflectionAI/Inflection-Benchmarks/tree/main?tab=readme-ov-file#physics-gre

ShayekhBinIslam requested review from haileyschoelkopf and lintangsutawika as code owners April 1, 2024 11:20

haileyschoelkopf requested changes Apr 2, 2024

View reviewed changes

Mistral-7B, Mixtral-8x7B results added to physics_gre

7d75c29

haileyschoelkopf reviewed Apr 5, 2024

View reviewed changes

lintangsutawika linked an issue Apr 9, 2024 that may be closed by this pull request

New Task Request: InflectionAI's Physics GRE #1554

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Physics GRE task added #1655

Physics GRE task added #1655

ShayekhBinIslam commented Apr 1, 2024

CLAassistant commented Apr 1, 2024 •

edited

Loading

haileyschoelkopf left a comment

ShayekhBinIslam commented Apr 2, 2024

haileyschoelkopf left a comment

haileyschoelkopf Apr 4, 2024

ShayekhBinIslam commented Apr 14, 2024

StellaAthena commented Apr 18, 2024

ShayekhBinIslam commented Apr 18, 2024

haileyschoelkopf commented Apr 18, 2024

StellaAthena commented Apr 18, 2024

		@@ -0,0 +1,52 @@
		dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.

Physics GRE task added #1655

Are you sure you want to change the base?

Physics GRE task added #1655

Conversation

ShayekhBinIslam commented Apr 1, 2024

CLAassistant commented Apr 1, 2024 • edited Loading

haileyschoelkopf left a comment

Choose a reason for hiding this comment

ShayekhBinIslam commented Apr 2, 2024

haileyschoelkopf left a comment

Choose a reason for hiding this comment

haileyschoelkopf Apr 4, 2024

Choose a reason for hiding this comment

ShayekhBinIslam commented Apr 14, 2024

StellaAthena commented Apr 18, 2024

ShayekhBinIslam commented Apr 18, 2024

haileyschoelkopf commented Apr 18, 2024

StellaAthena commented Apr 18, 2024

CLAassistant commented Apr 1, 2024 •

edited

Loading