Skip to content

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
andrew-openai committed Mar 14, 2023
0 parents commit 38eb92c
Show file tree
Hide file tree
Showing 69 changed files with 4,565 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
evals/registry/data/**/*.jsonl filter=lfs diff=lfs merge=lfs -text
68 changes: 68 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨

## Eval details 📑
### Eval name
[Insert Eval name here]

### Eval description

[Insert a short description of what your eval does here]

### What makes this a useful eval?

[Insert why this eval is worth including and any additional context]

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

- [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [ ] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
- [ ] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval.
- [ ] Include at least 100 high quality examples

If there is anything else that makes your eval worth including, please document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above. (Not required)
## Eval structure 🏗️

Your eval should
- [ ] Check that your data is in `evals/registry/data/{name}`
- [ ] Check that your yaml is registered at `evals/registry/evals/{name}.jsonl`
- [ ] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

- [ ] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.

- [ ] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

- [ ] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

### Submit eval

- [ ] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.
56 changes: 56 additions & 0 deletions .github/bug_report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: Bug report
description: Create a report to help us improve
labels: ["bug"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report! If you have questions about using the OpenAI Evals library, please open a [Discussion thread](https://github.com/openai/evals/discussions).
- type: textarea
id: what-happened
attributes:
label: Describe the bug
description: A clear and concise description of what the bug is, and any additional context.
placeholder: Tell us what you see!
validations:
required: true
- type: textarea
id: repro-steps
attributes:
label: To Reproduce
description: Steps to reproduce the behavior.
placeholder: |
1. Fetch a '...'
2. Update the '....'
3. See error
validations:
required: true
- type: textarea
id: code-snippets
attributes:
label: Code snippets
description: If applicable, add code snippets to help explain your problem.
render: Python
validations:
required: false
- type: input
id: os
attributes:
label: OS
placeholder: macOS
validations:
required: true
- type: input
id: language-version
attributes:
label: Python version
placeholder: Python v3.8.0
validations:
required: true
- type: input
id: lib-version
attributes:
label: Library version
placeholder: openai-evals v0.1.1
validations:
required: true
7 changes: 7 additions & 0 deletions .github/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
blank_issues_enabled: false
contact_links:
- name: OpenAI support
url: https://help.openai.com/
about: |
Please only file issues here that you believe represent actual bugs or feature requests for the OpenAI Evals library.
If you're having general trouble with the OpenAI API, ChatGPT, etc, please visit our help center to get support.
20 changes: 20 additions & 0 deletions .github/feature_request.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Feature request
description: Suggest an idea for this library
labels: ["feature-request"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this feature request! Please note, we are not able to accommodate all feature requests given limited bandwidth but we appreciate you taking the time to share with us how to improve the OpenAI Evals library.
- type: textarea
id: feature
attributes:
label: Describe the feature or improvement you're requesting
description: A clear and concise description of what you want to happen.
validations:
required: true
- type: textarea
id: context
attributes:
label: Additional context
description: Add any other context about the feature request here.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
__pycache__/
evals.egg-info/
29 changes: 29 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
repos:
- repo: https://github.com/psf/black
rev: 22.8.0
hooks:
- id: black
args: [--line-length=100, --exclude=""]

# this is not technically always safe but usually is
# use comments `# isort: off` and `# isort: on` to disable/re-enable isort
- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort
args: [--line-length=100, --profile=black]

# this is slightly dangerous because python imports have side effects
# and this tool removes unused imports, which may be providing
# necessary side effects for the code to run
- repo: https://github.com/PyCQA/autoflake
rev: v1.6.1
hooks:
- id: autoflake
args:
- "--in-place"
- "--expand-star-imports"
- "--remove-duplicate-keys"
- "--remove-unused-variables"
- "--remove-all-unused-imports"
exclude: "evals/__init__.py"
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
3 changes: 3 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
recursive-include evals *.py
recursive-include evals *.yaml
recursive-include evals *.sql
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mypy:
mypy --config-file=mypy.ini --no-site-packages .
85 changes: 85 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Evals

Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.

You can use Evals to create and run evaluations that:
- use datasets to generate prompts,
- measure the quality of completions provided by an OpenAI model, and
- compare performance across different datasets and models.

With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps **in order**:
1. Read through this doc and follow the [setup instructions below](README.md#Setup).
2. Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
3. Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
4. Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
5. See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).

If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.

____________________
🚨 For a limited time, we will be granting GPT-4 access to those who contribute high quality evals. Please follow the instructions mentioned above and note that spam or low quality submissions will be ignored❗️

Access will be granted to the email address associated with an accepted Eval. Due to high volume, we are unable to grant access to any email other than the one used for the pull request.
____________________

## Setup

To run evals, you will need to set up and specify your OpenAI API key to run evals. If you need to generate an API key, you can do so at [https://platform.openai.com/account/api-keys](https://platform.openai.com/account/api-keys). After you obtain an API key, specify it using the `OPENAI_API_KEY` environment variable. **Please be aware of the [costs](https://openai.com/pricing) associated with using the API when running evals.**

### Downloading evals

Our Evals registry is stored using [Git-LFS](https://git-lfs.com/). Once you have downloaded and installed LFS, you can fetch the evals with:
```sh
git lfs fetch --all
git lfs pull
```

You may just want to fetch data for a select eval. You can achieve this via:
```sh
git lfs fetch --include=evals/registry/data/${your eval}
git lfs pull
```

### Making evals

If you are going to be creating evals, we suggest cloning this repo directly from GitHub and installing the requirements using the following command:

```sh
pip install -e .
```

Using `-e`, changes you make to your eval will be reflected immediately without having to reinstall.

### Running evals

If you don't want to contribute new evals, but simply want to run them locally, you can install the evals package via pip:

```sh
pip install evals
```

We provide the option for you to log your eval results to a Snowflake database, if you have one or wish to set one up. For this option, you will further have to specify the `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USERNAME`, and `SNOWFLAKE_PASSWORD` environment variables.

## FAQ

Do you have any examples of how to build an eval from start to finish?

- Yes! These are in the `examples` folder. We recommend that you also read through [build-eval.md](docs/build-eval.md) in order to gain a deeper understanding of what is happening in these examples.

Do you have any examples of evals implemented in multiple different ways?

- Yes! In particular, see `evals/registry/evals/coqa.yaml`. We have implemented small subsets of the [CoQA](https://stanfordnlp.github.io/coqa/) dataset for various eval templates to help illustrate the differences.

I changed my data but this isn't reflected when running my eval, what's going on?

- Your data may have been cached to `/tmp/filecache`. Try removing this cache and rerunning your eval.

There's a lot of code, and I just want to spin up a quick eval. Help? OR,

I am a world-class prompt engineer. I choose not to code. How can I contribute my wisdom?

- If you follow an existing [eval template](docs/eval-templates.md) to build a basic or model-graded eval, you don't need to write any evaluation code at all! Just provide your data in JSON format and specify your eval parameters in YAML. [build-eval.md](docs/build-eval.md) walks you through these steps, and you can supplement these instructions with the Jupyter notebooks in the `examples` folder to help you get started quickly. Keep in mind, though, that a good eval will inevitably require careful thought and rigorous experimentation!

## Disclaimer

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.
4 changes: 4 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Security Policy
For a more in-depth look at our security policy, please check out our [Coordinated Vulnerability Disclosure Policy](https://openai.com/security/disclosure/#:~:text=Disclosure%20Policy,-Security%20is%20essential&text=OpenAI%27s%20coordinated%20vulnerability%20disclosure%20policy,expect%20from%20us%20in%20return.).

Our PGP key can located [at this address.](https://cdn.openai.com/security.txt)
Loading

0 comments on commit 38eb92c

Please sign in to comment.