Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Eval details π
Eval name
Pattern identification
Eval description
Given eight examples of inputs and outputs, the model must figure out what the task is. Example:
The pattern here is to return
foo
if the target letter is in the list, andbar
otherwise.It's the same pattern for all examples.
Correct answer for this particular example is
bar
becauseu
is not in the list.What makes this a useful eval?
The biggest failure case of language models super far is reasoning.
Reasoning means that the language model is not relying on surface level correlations, and does true symbolic manipulations.
So far, a majority of existing tasks are phrased in natural language, so they don't test the ability for language models to actually reason.
The above task is easy for humans but language models still fail.
This task tests the ability for the model to do true pattern identification in-context.
GPT-4 beats gpt-3.5-turbo but is not enough. Here are the results from my run:
Criteria for a good eval β
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
Basic
evals or theFact
Model-graded eval, or an exhaustive rubric for evaluating answers for theCriteria
Model-graded eval.If there is anything else that makes your eval worth including, please document it below.
Unique eval value
See the section "What makes this a useful eval?"
Eval structure ποΈ
Your eval should
evals/registry/data/{name}
evals/registry/evals/{name}.jsonl
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist π
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
Submit eval
pip install pre-commit; pre-commit install
and have verified thatblack
,isort
, andautoflake
are running when I commit and pushFailure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval