forked from openai/evals
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 38eb92c
Showing
69 changed files
with
4,565 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
evals/registry/data/**/*.jsonl filter=lfs diff=lfs merge=lfs -text |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Thank you for contributing an eval! ♥️ | ||
|
||
🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 | ||
|
||
## Eval details 📑 | ||
### Eval name | ||
[Insert Eval name here] | ||
|
||
### Eval description | ||
|
||
[Insert a short description of what your eval does here] | ||
|
||
### What makes this a useful eval? | ||
|
||
[Insert why this eval is worth including and any additional context] | ||
|
||
## Criteria for a good eval ✅ | ||
|
||
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). | ||
|
||
Your eval should be: | ||
|
||
- [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. | ||
- [ ] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. | ||
- [ ] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. | ||
- [ ] Include at least 100 high quality examples | ||
|
||
If there is anything else that makes your eval worth including, please document it below. | ||
|
||
### Unique eval value | ||
|
||
> Insert what makes your eval high quality that was not mentioned above. (Not required) | ||
## Eval structure 🏗️ | ||
|
||
Your eval should | ||
- [ ] Check that your data is in `evals/registry/data/{name}` | ||
- [ ] Check that your yaml is registered at `evals/registry/evals/{name}.jsonl` | ||
- [ ] Ensure you have the right to use the data you submit via this eval | ||
|
||
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) | ||
|
||
## Final checklist 👀 | ||
|
||
### Submission agreement | ||
|
||
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). | ||
|
||
- [ ] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. | ||
|
||
### Email address validation | ||
|
||
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. | ||
|
||
- [ ] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. | ||
|
||
### Limited availability acknowledgement | ||
|
||
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. | ||
|
||
- [ ] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. | ||
|
||
### Submit eval | ||
|
||
- [ ] I have filled out all required fields in the evals PR form | ||
- [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push | ||
|
||
Failure to fill out all required fields will result in the PR being closed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
name: Bug report | ||
description: Create a report to help us improve | ||
labels: ["bug"] | ||
body: | ||
- type: markdown | ||
attributes: | ||
value: | | ||
Thanks for taking the time to fill out this bug report! If you have questions about using the OpenAI Evals library, please open a [Discussion thread](https://github.com/openai/evals/discussions). | ||
- type: textarea | ||
id: what-happened | ||
attributes: | ||
label: Describe the bug | ||
description: A clear and concise description of what the bug is, and any additional context. | ||
placeholder: Tell us what you see! | ||
validations: | ||
required: true | ||
- type: textarea | ||
id: repro-steps | ||
attributes: | ||
label: To Reproduce | ||
description: Steps to reproduce the behavior. | ||
placeholder: | | ||
1. Fetch a '...' | ||
2. Update the '....' | ||
3. See error | ||
validations: | ||
required: true | ||
- type: textarea | ||
id: code-snippets | ||
attributes: | ||
label: Code snippets | ||
description: If applicable, add code snippets to help explain your problem. | ||
render: Python | ||
validations: | ||
required: false | ||
- type: input | ||
id: os | ||
attributes: | ||
label: OS | ||
placeholder: macOS | ||
validations: | ||
required: true | ||
- type: input | ||
id: language-version | ||
attributes: | ||
label: Python version | ||
placeholder: Python v3.8.0 | ||
validations: | ||
required: true | ||
- type: input | ||
id: lib-version | ||
attributes: | ||
label: Library version | ||
placeholder: openai-evals v0.1.1 | ||
validations: | ||
required: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
blank_issues_enabled: false | ||
contact_links: | ||
- name: OpenAI support | ||
url: https://help.openai.com/ | ||
about: | | ||
Please only file issues here that you believe represent actual bugs or feature requests for the OpenAI Evals library. | ||
If you're having general trouble with the OpenAI API, ChatGPT, etc, please visit our help center to get support. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: Feature request | ||
description: Suggest an idea for this library | ||
labels: ["feature-request"] | ||
body: | ||
- type: markdown | ||
attributes: | ||
value: | | ||
Thanks for taking the time to fill out this feature request! Please note, we are not able to accommodate all feature requests given limited bandwidth but we appreciate you taking the time to share with us how to improve the OpenAI Evals library. | ||
- type: textarea | ||
id: feature | ||
attributes: | ||
label: Describe the feature or improvement you're requesting | ||
description: A clear and concise description of what you want to happen. | ||
validations: | ||
required: true | ||
- type: textarea | ||
id: context | ||
attributes: | ||
label: Additional context | ||
description: Add any other context about the feature request here. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
__pycache__/ | ||
evals.egg-info/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
repos: | ||
- repo: https://github.com/psf/black | ||
rev: 22.8.0 | ||
hooks: | ||
- id: black | ||
args: [--line-length=100, --exclude=""] | ||
|
||
# this is not technically always safe but usually is | ||
# use comments `# isort: off` and `# isort: on` to disable/re-enable isort | ||
- repo: https://github.com/pycqa/isort | ||
rev: 5.12.0 | ||
hooks: | ||
- id: isort | ||
args: [--line-length=100, --profile=black] | ||
|
||
# this is slightly dangerous because python imports have side effects | ||
# and this tool removes unused imports, which may be providing | ||
# necessary side effects for the code to run | ||
- repo: https://github.com/PyCQA/autoflake | ||
rev: v1.6.1 | ||
hooks: | ||
- id: autoflake | ||
args: | ||
- "--in-place" | ||
- "--expand-star-imports" | ||
- "--remove-duplicate-keys" | ||
- "--remove-unused-variables" | ||
- "--remove-all-unused-imports" | ||
exclude: "evals/__init__.py" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2023 OpenAI | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
recursive-include evals *.py | ||
recursive-include evals *.yaml | ||
recursive-include evals *.sql |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
mypy: | ||
mypy --config-file=mypy.ini --no-site-packages . |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Evals | ||
|
||
Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks. | ||
|
||
You can use Evals to create and run evaluations that: | ||
- use datasets to generate prompts, | ||
- measure the quality of completions provided by an OpenAI model, and | ||
- compare performance across different datasets and models. | ||
|
||
With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps **in order**: | ||
1. Read through this doc and follow the [setup instructions below](README.md#Setup). | ||
2. Learn how to run existing evals: [run-evals.md](docs/run-evals.md). | ||
3. Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md). | ||
4. Walk through the process for building an eval: [build-eval.md](docs/build-eval.md) | ||
5. See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md). | ||
|
||
If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models. | ||
|
||
____________________ | ||
🚨 For a limited time, we will be granting GPT-4 access to those who contribute high quality evals. Please follow the instructions mentioned above and note that spam or low quality submissions will be ignored❗️ | ||
|
||
Access will be granted to the email address associated with an accepted Eval. Due to high volume, we are unable to grant access to any email other than the one used for the pull request. | ||
____________________ | ||
|
||
## Setup | ||
|
||
To run evals, you will need to set up and specify your OpenAI API key to run evals. If you need to generate an API key, you can do so at [https://platform.openai.com/account/api-keys](https://platform.openai.com/account/api-keys). After you obtain an API key, specify it using the `OPENAI_API_KEY` environment variable. **Please be aware of the [costs](https://openai.com/pricing) associated with using the API when running evals.** | ||
|
||
### Downloading evals | ||
|
||
Our Evals registry is stored using [Git-LFS](https://git-lfs.com/). Once you have downloaded and installed LFS, you can fetch the evals with: | ||
```sh | ||
git lfs fetch --all | ||
git lfs pull | ||
``` | ||
|
||
You may just want to fetch data for a select eval. You can achieve this via: | ||
```sh | ||
git lfs fetch --include=evals/registry/data/${your eval} | ||
git lfs pull | ||
``` | ||
|
||
### Making evals | ||
|
||
If you are going to be creating evals, we suggest cloning this repo directly from GitHub and installing the requirements using the following command: | ||
|
||
```sh | ||
pip install -e . | ||
``` | ||
|
||
Using `-e`, changes you make to your eval will be reflected immediately without having to reinstall. | ||
|
||
### Running evals | ||
|
||
If you don't want to contribute new evals, but simply want to run them locally, you can install the evals package via pip: | ||
|
||
```sh | ||
pip install evals | ||
``` | ||
|
||
We provide the option for you to log your eval results to a Snowflake database, if you have one or wish to set one up. For this option, you will further have to specify the `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USERNAME`, and `SNOWFLAKE_PASSWORD` environment variables. | ||
|
||
## FAQ | ||
|
||
Do you have any examples of how to build an eval from start to finish? | ||
|
||
- Yes! These are in the `examples` folder. We recommend that you also read through [build-eval.md](docs/build-eval.md) in order to gain a deeper understanding of what is happening in these examples. | ||
|
||
Do you have any examples of evals implemented in multiple different ways? | ||
|
||
- Yes! In particular, see `evals/registry/evals/coqa.yaml`. We have implemented small subsets of the [CoQA](https://stanfordnlp.github.io/coqa/) dataset for various eval templates to help illustrate the differences. | ||
|
||
I changed my data but this isn't reflected when running my eval, what's going on? | ||
|
||
- Your data may have been cached to `/tmp/filecache`. Try removing this cache and rerunning your eval. | ||
|
||
There's a lot of code, and I just want to spin up a quick eval. Help? OR, | ||
|
||
I am a world-class prompt engineer. I choose not to code. How can I contribute my wisdom? | ||
|
||
- If you follow an existing [eval template](docs/eval-templates.md) to build a basic or model-graded eval, you don't need to write any evaluation code at all! Just provide your data in JSON format and specify your eval parameters in YAML. [build-eval.md](docs/build-eval.md) walks you through these steps, and you can supplement these instructions with the Jupyter notebooks in the `examples` folder to help you get started quickly. Keep in mind, though, that a good eval will inevitably require careful thought and rigorous experimentation! | ||
|
||
## Disclaimer | ||
|
||
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Security Policy | ||
For a more in-depth look at our security policy, please check out our [Coordinated Vulnerability Disclosure Policy](https://openai.com/security/disclosure/#:~:text=Disclosure%20Policy,-Security%20is%20essential&text=OpenAI%27s%20coordinated%20vulnerability%20disclosure%20policy,expect%20from%20us%20in%20return.). | ||
|
||
Our PGP key can located [at this address.](https://cdn.openai.com/security.txt) |
Oops, something went wrong.