Merge branch 'openai:main' into main

gauravjaincr7 · Apr 15, 2023 · dcdd88f · dcdd88f
2 parents 12fa536 + a6fe832
commit dcdd88f
Show file tree

Hide file tree

Showing 211 changed files with 3,059 additions and 939 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -2,6 +2,12 @@
 
 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
 
+__PLEASE READ THIS__:
+
+In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
+
+We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.**
+
 ## Eval details 📑
 ### Eval name
 [Insert Eval name here]
@@ -23,7 +29,7 @@ Your eval should be:
 - [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
 - [ ] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
 - [ ] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval.
-- [ ] Include at least 100 high quality examples
+- [ ] **Include at least 15 high quality examples.**
 
 If there is anything else that makes your eval worth including, please document it below.
 
@@ -35,7 +41,7 @@ If there is anything else that makes your eval worth including, please document
 
 Your eval should
 - [ ] Check that your data is in `evals/registry/data/{name}`
-- [ ] Check that your yaml is registered at `evals/registry/evals/{name}.jsonl`
+- [ ] Check that your yaml is registered at `evals/registry/evals/{name}.yaml`
 - [ ] Ensure you have the right to use the data you submit via this eval
 
 (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
@@ -66,3 +72,16 @@ We know that you might be excited to contribute to OpenAI's mission, help improv
 - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push
 
 Failure to fill out all required fields will result in the PR being closed.
+
+### Eval JSON data 
+
+Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
+
+<details>
+ <summary>View evals in JSON</summary>
+
+ ### Eval
+ ```jsonl
+ INSERT_EVAL_HERE
+ ```
+</details>
diff --git a/.github/workflows/parse_yaml.py b/.github/workflows/parse_yaml.py
@@ -0,0 +1,12 @@
+import sys
+import yaml
+
+def get_first_key(file_path):
+ with open(file_path, 'r') as yaml_file:
+ content = yaml.safe_load(yaml_file)
+ first_key = next(iter(content))
+ return first_key
+
+if __name__ == "__main__":
+ yaml_file_path = sys.argv[1]
+ print(get_first_key(yaml_file_path))
diff --git a/.github/workflows/test_eval.yaml b/.github/workflows/test_eval.yaml
@@ -0,0 +1,54 @@
+name: Run new evals
+
+on:
+ pull_request:
+ branches:
+ - main
+
+jobs:
+ check_files:
+ runs-on: ubuntu-latest
+
+ steps:
+ - name: Checkout repository
+ uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+ lfs: true
+
+ - name: Install Git LFS
+ run: |
+ sudo apt-get install git-lfs
+ git lfs install
+
+ - name: Set up Python
+ uses: actions/setup-python@v2
+ with:
+ python-version: 3.9
+
+ - name: Install dependencies
+ run: |
+ python -m pip install --upgrade pip
+ pip install pyyaml
+ pip install -e .
+
+ - name: Get list of new YAML files in evals/registry/evals
+ id: get_files
+ run: |
+ # Use environment files to store the output
+ git diff --name-only --diff-filter=A ${{ github.event.pull_request.base.sha }} ${{ github.sha }} | grep '^evals/registry/evals/.*\.yaml$' | xargs > new_files
+ echo "new_files=$(cat new_files)" >> $GITHUB_ENV
+
+ - name: Run oaieval command for each new YAML file
+ run: |
+ files="${{ env.new_files }}"
+ if [ -n "$files" ]; then
+ for file in $files; do
+ echo "Processing $file"
+ first_key=$(python .github/workflows/parse_yaml.py $file)
+ echo "Eval Name: $first_key"
+ oaieval dummy $first_key --max_samples 10
+ done
+ else
+ echo "No new YAML files found in evals/registry/evals"
+ fi
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
 __pycache__/
 evals.egg-info/
+.vscode/
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,3 +1,4 @@
 recursive-include evals *.py
 recursive-include evals *.yaml
 recursive-include evals *.sql
+recursive-include evals/registry/data *.jsonl
diff --git a/README.md b/README.md
@@ -1,18 +1,23 @@
 # Evals
 
-Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
+Evals is a framework for evaluating LLMs (large language models) or systems built using LLMs as components. It also includes an open-source registry of challenging evals.
 
-You can use Evals to create and run evaluations that:
-- use datasets to generate prompts,
-- measure the quality of completions provided by an OpenAI model, and
-- compare performance across different datasets and models.
+We now support evaluating the behavior of any system including prompt chains or tool-using agents, via the [Completion Function Protocol](docs/completion-fns.md).
 
-With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps **in order**:
-1. Read through this doc and follow the [setup instructions below](README.md#Setup).
-2. Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
-3. Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
-4. Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
-5. See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).
+With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. An "eval" is a task used to evaluate the quality of a system's behavior. To get started, we recommend that you follow these steps:
+
+To get set up with evals, follow the [setup instructions below](README.md#Setup).
+
+#### Running evals
+- Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
+- Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
+
+#### Writing evals
+- Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
+- See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).
+
+#### Writing CompletionFns
+- Write your own completion functions: [completion-fns.md](docs/completion-fns.md)
 
 If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.
 
@@ -24,7 +29,9 @@ ____________________
 
 ## Setup
 
-To run evals, you will need to set up and specify your OpenAI API key to run evals. If you need to generate an API key, you can do so at [https://platform.openai.com/account/api-keys](https://platform.openai.com/account/api-keys). After you obtain an API key, specify it using the `OPENAI_API_KEY` environment variable. **Please be aware of the [costs](https://openai.com/pricing) associated with using the API when running evals.**
+To run evals, you will need to set up and specify your OpenAI API key. You can generate one at <https://platform.openai.com/account/api-keys>. After you obtain an API key, specify it using the `OPENAI_API_KEY` environment variable. **Please be aware of the [costs](https://openai.com/pricing) associated with using the API when running evals.**
+
+**Minimal Required Version: Python 3.9**
 
 ### Downloading evals
 
@@ -70,9 +77,9 @@ Do you have any examples of evals implemented in multiple different ways?
 
 - Yes! In particular, see `evals/registry/evals/coqa.yaml`. We have implemented small subsets of the [CoQA](https://stanfordnlp.github.io/coqa/) dataset for various eval templates to help illustrate the differences.
 
-I changed my data but this isn't reflected when running my eval, what's going on?
+When I run an eval, it sometimes hangs at the very end (after the final report). What's going on?
 
-- Your data may have been cached to `/tmp/filecache`. Try removing this cache and rerunning your eval.
+- This is a known issue, but you should be able to interrupt it safely and the eval should finish immediately after.
 
 There's a lot of code, and I just want to spin up a quick eval. Help? OR,
 

diff --git a/docs/build-eval.md b/docs/build-eval.md
@@ -4,10 +4,27 @@ This document walks through the end-to-end process for building an eval, which i
 
 The steps in this process are building your dataset, registering a new eval with your dataset, and running your eval. Crucially, we assume that you are using an [existing eval template](eval-templates.md) out of the box (if that's not the case, see [this example of building a custom eval](custom-eval.md)). If you are interested in contributing your eval publically, we also include some criteria at the bottom for what we think makes an interesting eval.
 
+We are looking for evals in the following categories:
+
+- Over-refusals
+- Safety
+- System message steerability
+- In-the-wild hallucinations
+- Math / logical / physical reasoning
+- Real-world use case (please describe in your PR how this capability would be used in a product)
+- Other foundational capability
+
+If you have an eval that falls outside this category but still is a diverse example, please contribute it!
+
 ## Formatting your data
 
 Once you have an eval in mind that you wish to implement, you will need to convert your samples into the right JSON lines (JSONL) format. A JSONL file is just a JSON file with a unique JSON object per line.
 
+You can use the `openai` CLI (available with [OpenAI-Python](https://github.com/openai/openai-python)) to transform data from some common file types into JSONL:
+``` 
+openai tools fine_tunes.prepare_data -f data[.csv, .json, .txt, .xlsx or .tsv]
+```
+
 We include some examples of JSONL eval files in [registry/data/README.md](../evals/registry/data/README.md)
 
 Each JSON object will represent one data point in your eval. The keys you need in the JSON object depend on the eval template. All templates expect an `"input"` key which is the prompt, ideally specified in [chat format](https://platform.openai.com/docs/guides/chat/introduction) (though strings are also supported). We recommend chat format even if you are evaluating non chat models. If you are evaluating both chat and non chat models, we handle the conversion between chat formatted prompts and raw string prompts (see the conversion logic [here](../evals/prompt/base.py)).
@@ -16,7 +33,7 @@ For the basic evals `Match`, `Includes`, and `FuzzyMatch`, the other required ke
 
 We have implemented small subsets of the [CoQA](https://stanfordnlp.github.io/coqa/) dataset for various eval templates to illustrate how the data should be formatted. See [`coqa/match.jsonl`](../evals/registry/data/coqa/match.jsonl) for an example of data that is suitable for the `Match` basic eval template and [`coqa/samples.jsonl`](../evals/registry/data/coqa/samples.jsonl) for data that is suitable for `fact` and `closedqa` model-graded evals. Note that even though these two model-graded evals expect different keys, we can include the superset of keys in our data in order to support both evals.
 
-If the dataset file is on your local machine, put the YAML file in `evals/registry/evals/data/<eval_name>/samples.jsonl`. If it is in Cloud Object Storage, we support path-style URLs for the major clouds (for your personal use only, we will not accept PRs with cloud URLs).
+If the dataset file is on your local machine, put the `jsonl` file in `evals/registry/data/<eval_name>/samples.jsonl`. If it is in Cloud Object Storage, we support path-style URLs for the major clouds (for your personal use only, we will not accept PRs with cloud URLs).
 
 ## Registering the eval
 
@@ -32,7 +49,7 @@ Register the eval by adding a file to `evals/registry/evals/<eval_name>.yaml` us
  samples_jsonl: <eval_name>/samples.jsonl
 ```
 
-Upon running the eval, the data will be searched for in `evals/registry/data`, e.g. if `test_match/samples.jsonl` is the provided filepath the data is expected to be in `evals/registry/data/test_match/samples.jsonl`. 
+Upon running the eval, the data will be searched for in `evals/registry/data`, e.g. if `test_match/samples.jsonl` is the provided filepath the data is expected to be in `evals/registry/data/test_match/samples.jsonl`.
 
 The naming convention for evals is in the form `<eval_name>.<split>.<version>`.
 - `<eval_name>` is the eval name, used to group evals whose scores are comparable.
@@ -43,17 +60,17 @@ In general, running the same eval name against the same model should always give
 
 ## Running the eval
 
-You can now run your eval on your data from the CLI with your choice of model:
+You can now run your eval on your data from the CLI with your choice of model or completion function:
 ```
 oaieval gpt-3.5-turbo <eval_name>
 ```
-Congratulations, you have built your eval! Keep iterating on it until you are confident in the results. Remember, if you change the data file, remove `/tmp/filecache` so that the eval is run with your updated data.
+Congratulations, you have built your eval! Keep iterating on it until you are confident in the results.
 
 ## For model-graded evals: a step-by-step workflow
 
 We expect that the existing model-graded evals such as `fact`, `closedqa`, and `battle` will fit many use cases. However, other use cases may benefit from more customization, e.g., a different evaluation prompt. For these, there will be a bit more work involved, but generally still no coding required!
 
-1. If you can't use an existing model-graded eval, create a new YAML in `evals/registry/modelgraded` to specify the [parameters](eval-templates.md#parameters-for-model-graded-evals) of your eval. See [`humor.yaml`](../evals/registry/modelgraded/humor.yaml) for an example.
+1. If you can't use an existing model-graded eval, create a new YAML or create a new entry to an existing YAML in `evals/registry/modelgraded` to specify the [parameters](eval-templates.md#parameters-for-model-graded-evals) of your eval. See [`humor.yaml`](../evals/registry/modelgraded/humor.yaml) for an example.
  - Note that, even if you are creating a new YAML, you may find it easiest to copy an existing YAML as a starting point. For example, model-graded evals which check a model completion against a rubric can copy `closedqa.yaml` and just edit the `args`.
 2. Next, you will create your dataset and register your eval, as described above. See [`joke_fruits_labeled.jsonl`](../evals/registry/data/test_metaeval/joke_fruits_labeled.jsonl) and [`joke-fruits`](../evals/registry/evals/test-modelgraded.yaml), for example.
  - Note that it is recommended to specify `eval_type` at this step, when you register your eval, rather than step 1.

diff --git a/docs/completion-fn-protocol.md b/docs/completion-fn-protocol.md
@@ -0,0 +1,41 @@
+### The Completion Function Protocol
+
+Here are the interfaces needed to implement the completion function protocol. Any implementation of this interface can be used inside `oaieval`.
+
+Reference implementations:
+- [OpenAICompletionFn](../evals/completion_fns/openai.py)
+- [LangChainLLMCompletionFn](../evals/completion_fns/langchain_llm.py)
+
+#### CompletionFn
+Completion functions should implement the `CompletionFn` interface:
+```python
+class CompletionFn(Protocol):
+ def __call__(
+ self,
+ prompt: Union[str, list[dict[str, str]]],
+ **kwargs,
+ ) -> CompletionResult:
+```
+
+We take a `prompt` representing a single sample from an eval. These prompts can be represented as either a text string or a list of messages in [OpenAI Chat format](https://platform.openai.com/docs/guides/chat/introduction). To work with the existing evals, Completion Function implementations would need to handle both types of inputs, but we provide helper functionality to convert Chat formatted messages into a text string if that is the preferred input for your program:
+```python
+from evals.prompt.base import CompletionPrompt
+
+# chat_prompt: list[dict[str, str]] -> text_prompt: str
+text_prompt = CompletionPrompt(chat_prompt).to_formatted_prompt()
+```
+
+#### CompletionResult
+The completion function should return an object implementing the `CompletionResult` interface:
+```python
+class CompletionResult(ABC):
+ @abstractmethod
+ def get_completions(self) -> list[str]:
+ pass
+```
+The `get_completions` method returns a list of string completions. Each element should be considered a unique completion (in most cases this will be a list of length 1).
+
+#### Using your CompletionFn
+This is all that's needed to implement a Completion function that works with our existing Evals, allowing you to more easily evaluate your end-to-end logic on tasks.
+
+See [completion-fns.md](completion-fns.md) to see how to register and use your completion function with `oaieval`.