Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Physics GRE task added #1655

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ShayekhBinIslam
Copy link

This PR adds the Physics GRE dataset released in the Public Inflection Benchmark by @InflectionAI. Please refer here for the details.

It solved the issue #1554.

@CLAassistant
Copy link

CLAassistant commented Apr 1, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the PR! This is a great contribution.

  • Have you tested any models, such as Mistral, on this task? If so, do you have any sample outputs I could review?
  • It might make sense to add ..._maj1 variants of this task, or to have maj8 be an available variant and default to not running maj8, due to computational costs.
  • Other than that:
    Are
    # Remove the data points that have images in the input. 100 -> 76
    dataset = dataset.filter(lambda x: x["has_image"] is False)
    # Remove the data points without groud truth label. 76 -> 75
    dataset = dataset.filter(lambda x: x["target_scores"] is not None)
    # All questions must have one and only one correct answer.
    assert (
        len(dataset.filter(lambda x: sum(x["target_scores"].values()) != 1)) == 0
    ), "Zero or More than one correct answers."

all of these filters used by Inflection in their evaluation of this task? E.g., for >1 correct answer, is this actually not possible for the root test and an error in the benchmark files, or could we support >1 correct answer questions too?

@ShayekhBinIslam
Copy link
Author

Most welcome. Thanks for your prompt feedback.

  1. Yes, we have tested Mistral for this task. The sample result can be found here and sample output here.
  2. Indeed the ..._maj1 is already here. It is called score-first similar to the gsm8k-cot-self-consistency task.
  3. a. Inflection has not released the data preprocessing pipeline yet.
    b. It seems there should be one and only one correct answer in Physics GRE tests. Reference

Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

However:
I think the link you shared means that a model attempting to select more than one answer would be penalized, not that multiple answers can't both be counted.

I was thinking for computational cost reasons, that having a separate task variant which does greedy generation and only reports Maj@1 would be beneficial. How long (and on what GPU) did it take to run mistral on these tasks?

@@ -0,0 +1,52 @@
dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be renamed to default_yaml so that we don't try to register it as its own task!

@lintangsutawika lintangsutawika linked an issue Apr 9, 2024 that may be closed by this pull request
@ShayekhBinIslam
Copy link
Author

Thanks a lot for getting back.

  1. I am not sure how we can detect a model **attempting** to select more than one answer using the regex.
  2. The greedy generation is a great suggestion.
  3. It takes about 15 V100 hours to complete all the tasks for Mixtral-8x7B-Instruct-v0.1 4-bit. For Mistral-7B-Instruct-v0.2, it is 6.5 hours.

@StellaAthena
Copy link
Member

@haileyschoelkopf what is

I think the link you shared means that a model attempting to select more than one answer would be penalized, not that multiple answers can't both be counted.

based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect] is an adjustment common to standardized testing that makes random guessing score 0 in expectation.

@ShayekhBinIslam
Copy link
Author

To make things concrete: let the correct answer be C. Now the model may predict:

  1. A (wrong)
  2. A, B, C (wrong)
  3. C, A, B (wrong)
  4. C (correct)

The evaluator will get 1 and 4 judged as expected.

But even though 3 is wrong, the regex parsing/filtering will say the answer to be C. Thus the model will be judged as correct, even if it is incorrect. If this is an issue, it should be noted for any MCQ tasks.

@haileyschoelkopf
Copy link
Contributor

based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect] is an adjustment common to standardized testing that makes random guessing score 0 in expectation.

No, I am referring to the fact that this text from the link:

Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.

Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this

@StellaAthena
Copy link
Member

Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.

Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this

I don't think so. I think this is just saying that missing, malformed, and incorrect answers are all treated the same way. I don't read this as implying that some questions have multiple correct answers and that in such cases you should only answer with one of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Task Request: InflectionAI's Physics GRE
4 participants