openai · andrew-openai · Mar 14, 2023 · Mar 14, 2023 · Mar 14, 2023 · Mar 14, 2023
@@ -6,7 +6,7 @@ __PLEASE READ THIS__:
 
 In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
 
-We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4.
+We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples.
 
 ## Eval details 📑
 ### Eval name
@@ -29,7 +29,7 @@ Your eval should be:
 - [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
 - [ ] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
 - [ ] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval.
-- [ ] Include at least 100 high quality examples
+- [ ] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)
 
 If there is anything else that makes your eval worth including, please document it below.
 

@@ -4,6 +4,18 @@ This document walks through the end-to-end process for building an eval, which i
 
 The steps in this process are building your dataset, registering a new eval with your dataset, and running your eval. Crucially, we assume that you are using an [existing eval template](eval-templates.md) out of the box (if that's not the case, see [this example of building a custom eval](custom-eval.md)). If you are interested in contributing your eval publically, we also include some criteria at the bottom for what we think makes an interesting eval.
 
+We are looking for evals in the following categories:
+
+- Over-refusals
+- Safety
+- System message steerability
+- In-the-wild hallucinations
+- Math / logical / physical reasoning
+- Real-world use case (please describe in your PR how this capability would be used in a product)
+- Other foundational capability
+
+If you have an eval that falls outside this category but still is a diverse example, please contribute it!
+
 ## Formatting your data
 
 Once you have an eval in mind that you wish to implement, you will need to convert your samples into the right JSON lines (JSONL) format. A JSONL file is just a JSON file with a unique JSON object per line.