forked from openai/evals
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[evals] moved modelgraded specs to registry (openai#392)
- each Eval now keeps track of "registry"
- Loading branch information
Showing
14 changed files
with
181 additions
and
158 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,25 @@ | ||
prompt: |- | ||
You are comparing two responses to the following two instructions. | ||
battle: | ||
prompt: |- | ||
You are comparing two responses to the following two instructions. | ||
[Instruction 1] | ||
{input1} | ||
[Response 1] | ||
{completion1} | ||
[Instruction 1] | ||
{input1} | ||
[Response 1] | ||
{completion1} | ||
[Instruction 2] | ||
{input2} | ||
[Response 2] | ||
{completion2} | ||
[Instruction 2] | ||
{input2} | ||
[Response 2] | ||
{completion2} | ||
Is the first response better than the second? You must provide one answer based on your subjective view. | ||
choice_strings: | ||
- "Yes" | ||
- "No" | ||
choice_scores: | ||
"Yes": 1.0 | ||
"No": 0.0 | ||
input_outputs: | ||
input1: completion1 | ||
input2: completion2 | ||
Is the first response better than the second? You must provide one answer based on your subjective view. | ||
choice_strings: | ||
- "Yes" | ||
- "No" | ||
choice_scores: | ||
"Yes": 1.0 | ||
"No": 0.0 | ||
input_outputs: | ||
input1: completion1 | ||
input2: completion2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,13 @@ | ||
prompt: |- | ||
Which of the following {n} texts is best response to the following instruction? | ||
best: | ||
prompt: |- | ||
Which of the following {n} texts is best response to the following instruction? | ||
Instruction: {input} | ||
Instruction: {input} | ||
Responses: | ||
{completion} | ||
completion_sample_templates: | ||
completion: "{i}. {output}\n" | ||
choice_strings: from_n | ||
input_outputs: | ||
input: completion | ||
Responses: | ||
{completion} | ||
completion_sample_templates: | ||
completion: "{i}. {output}\n" | ||
choice_strings: from_n | ||
input_outputs: | ||
input: completion |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,27 @@ | ||
prompt: |- | ||
You are assessing a submitted answer on a given task based on a criterion. Here is the data: | ||
[BEGIN DATA] | ||
*** | ||
[Task]: {input} | ||
*** | ||
[Submission]: {completion} | ||
*** | ||
[Criterion]: {criteria} | ||
*** | ||
[END DATA] | ||
Does the submission meet the criterion? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line. | ||
closedqa: | ||
prompt: |- | ||
You are assessing a submitted answer on a given task based on a criterion. Here is the data: | ||
[BEGIN DATA] | ||
*** | ||
[Task]: {input} | ||
*** | ||
[Submission]: {completion} | ||
*** | ||
[Criterion]: {criteria} | ||
*** | ||
[END DATA] | ||
Does the submission meet the criterion? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line. | ||
Reasoning: | ||
eval_type: cot_classify | ||
choice_scores: | ||
"Y": 1.0 | ||
"N": 0.0 | ||
choice_strings: 'YN' | ||
args: | ||
criteria: | ||
relevance: "relevance: Is the submission referring to a real quote from the text?" | ||
conciseness: "conciseness: Is the answer concise and to the point?" | ||
correct: "correctness: Is the answer correct?" | ||
input_outputs: | ||
input: "completion" | ||
Reasoning: | ||
eval_type: cot_classify | ||
choice_scores: | ||
"Y": 1.0 | ||
"N": 0.0 | ||
choice_strings: 'YN' | ||
args: | ||
criteria: | ||
relevance: "relevance: Is the submission referring to a real quote from the text?" | ||
conciseness: "conciseness: Is the answer concise and to the point?" | ||
correct: "correctness: Is the answer correct?" | ||
input_outputs: | ||
input: "completion" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,15 @@ | ||
prompt: |- | ||
Are the following {n} texts diverse? | ||
diversity: | ||
prompt: |- | ||
Are the following {n} texts diverse? | ||
{completion} | ||
completion_sample_templates: | ||
completion: "{i}. {output}\n" | ||
choice_scores: | ||
"Yes": 1.0 | ||
"No": 0.0 | ||
choice_strings: | ||
- "Yes" | ||
- "No" | ||
input_outputs: | ||
input: completion | ||
{completion} | ||
completion_sample_templates: | ||
completion: "{i}. {output}\n" | ||
choice_scores: | ||
"Yes": 1.0 | ||
"No": 0.0 | ||
choice_strings: | ||
- "Yes" | ||
- "No" | ||
input_outputs: | ||
input: completion |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,23 @@ | ||
prompt: |- | ||
You are comparing a submitted answer to an expert answer on a given question. Here is the data: | ||
[BEGIN DATA] | ||
************ | ||
[Question]: {input} | ||
************ | ||
[Expert]: {ideal} | ||
************ | ||
[Submission]: {completion} | ||
************ | ||
[END DATA] | ||
fact: | ||
prompt: |- | ||
You are comparing a submitted answer to an expert answer on a given question. Here is the data: | ||
[BEGIN DATA] | ||
************ | ||
[Question]: {input} | ||
************ | ||
[Expert]: {ideal} | ||
************ | ||
[Submission]: {completion} | ||
************ | ||
[END DATA] | ||
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. | ||
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: | ||
(A) The submitted answer is a subset of the expert answer and is fully consistent with it. | ||
(B) The submitted answer is a superset of the expert answer and is fully consistent with it. | ||
(C) The submitted answer contains all the same details as the expert answer. | ||
(D) There is a disagreement between the submitted answer and the expert answer. | ||
(E) The answers differ, but these differences don't matter from the perspective of factuality. | ||
choice_strings: ABCDE | ||
input_outputs: | ||
input: completion | ||
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. | ||
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: | ||
(A) The submitted answer is a subset of the expert answer and is fully consistent with it. | ||
(B) The submitted answer is a superset of the expert answer and is fully consistent with it. | ||
(C) The submitted answer contains all the same details as the expert answer. | ||
(D) There is a disagreement between the submitted answer and the expert answer. | ||
(E) The answers differ, but these differences don't matter from the perspective of factuality. | ||
choice_strings: ABCDE | ||
input_outputs: | ||
input: completion |
Oops, something went wrong.