[evals] Refactor evals package to expose completion_fn. (openai#515)

PAIR=jasonwei - Move Evals functionality to use CompletionFns from ModelSpecs. --------- Co-authored-by: Jason Wei <[email protected]> Co-authored-by: Andrew Kondrich <[email protected]> Co-authored-by: Andrew Kondrich <[email protected]> Co-authored-by: Alvin Wang <[email protected]> Co-authored-by: joe-at-openai <[email protected]>
LeonSun128 · Apr 11, 2023 · 64fb72a · 64fb72a
1 parent f7ebbe8
commit 64fb72a
Show file tree

Hide file tree

Showing 29 changed files with 730 additions and 560 deletions.
diff --git a/.github/workflows/test_eval.yaml b/.github/workflows/test_eval.yaml
@@ -15,7 +15,7 @@ jobs:
  with:
  fetch-depth: 0
  lfs: true
- 
+
  - name: Install Git LFS
  run: |
  sudo apt-get install git-lfs
@@ -47,8 +47,7 @@ jobs:
  echo "Processing $file"
  first_key=$(python .github/workflows/parse_yaml.py $file)
  echo "Eval Name: $first_key"
- oaieval dummy-chat $first_key --max_samples 10
- oaieval dummy-completion $first_key --max_samples 10
+ oaieval dummy $first_key --max_samples 10
  done
  else
  echo "No new YAML files found in evals/registry/evals"

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,4 +1,4 @@
 recursive-include evals *.py
 recursive-include evals *.yaml
 recursive-include evals *.sql
-recursive-include evals *.jsonl
+recursive-include evals/registry/data *.jsonl
diff --git a/README.md b/README.md
@@ -1,18 +1,23 @@
 # Evals
 
-Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
-
-You can use Evals to create and run evaluations that:
-- use datasets to generate prompts,
-- measure the quality of completions provided by an OpenAI model, and
-- compare performance across different datasets and models.
-
-With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps **in order**:
-1. Read through this doc and follow the [setup instructions below](README.md#Setup).
-2. Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
-3. Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
-4. Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
-5. See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).
+Evals is a framework for evaluating LLMs (large language models) or systems built using LLMs as components. It also includes an open-source registry of challenging evals.
+
+We now support evaluating the behavior of any system including prompt chains or tool-using agents, via the [Completion Function Protocol](docs/completion-fns.md).
+
+With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. An "eval" is a task used to evaluate the quality of a system's behavior. To get started, we recommend that you follow these steps:
+
+To get set up with evals, follow the [setup instructions below](README.md#Setup).
+
+#### Running evals
+- Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
+- Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
+
+#### Writing evals
+- Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
+- See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).
+
+#### Writing CompletionFns
+- Write your own completion functions: [completion-fns.md](docs/completion-fns.md)
 
 If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.
 

diff --git a/docs/build-eval.md b/docs/build-eval.md
@@ -60,7 +60,7 @@ In general, running the same eval name against the same model should always give
 
 ## Running the eval
 
-You can now run your eval on your data from the CLI with your choice of model:
+You can now run your eval on your data from the CLI with your choice of model or completion function:
 ```
 oaieval gpt-3.5-turbo <eval_name>
 ```

diff --git a/docs/completion-fn-protocol.md b/docs/completion-fn-protocol.md
@@ -0,0 +1,41 @@
+### The Completion Function Protocol
+
+Here are the interfaces needed to implement the completion function protocol. Any implementation of this interface can be used inside `oaieval`.
+
+Reference implementations:
+- [OpenAICompletionFn](../evals/completion_fns/openai.py)
+- [LangChainLLMCompletionFn](../evals/completion_fns/langchain_llm.py)
+
+#### CompletionFn
+Completion functions should implement the `CompletionFn` interface:
+```python
+class CompletionFn(Protocol):
+ def __call__(
+ self,
+ prompt: Union[str, list[dict[str, str]]],
+ **kwargs,
+ ) -> CompletionResult:
+```
+
+We take a `prompt` representing a single sample from an eval. These prompts can be represented as either a text string or a list of messages in [OpenAI Chat format](https://platform.openai.com/docs/guides/chat/introduction). To work with the existing evals, Completion Function implementations would need to handle both types of inputs, but we provide helper functionality to convert Chat formatted messages into a text string if that is the preferred input for your program:
+```python
+from evals.prompt.base import CompletionPrompt
+
+# chat_prompt: list[dict[str, str]] -> text_prompt: str
+text_prompt = CompletionPrompt(chat_prompt).to_formatted_prompt()
+```
+
+#### CompletionResult
+The completion function should return an object implementing the `CompletionResult` interface:
+```python
+class CompletionResult(ABC):
+ @abstractmethod
+ def get_completions(self) -> list[str]:
+ pass
+```
+The `get_completions` method returns a list of string completions. Each element should be considered a unique completion (in most cases this will be a list of length 1).
+
+#### Using your CompletionFn
+This is all that's needed to implement a Completion function that works with our existing Evals, allowing you to more easily evaluate your end-to-end logic on tasks.
+
+See [completion-fns.md](completion-fns.md) to see how to register and use your completion function with `oaieval`.
diff --git a/docs/completion-fns.md b/docs/completion-fns.md
@@ -0,0 +1,49 @@
+# Completion Functions
+
+## What are completion functions
+In [run-evals.md](run-evals.md), we learned how to make calls to `oaieval` to run an eval against a completion function. Completion Functions are generalizations of model completions, where a "completion" is some text output that would be our answer to the prompt. For example, if "Who played the girl elf in the hobbit?" is our prompt, the correct completion is "Evangeline Lilly". While we can just test a model directly to see if it generates "Evangeline Lilly", we can imagine doing numerous other operations under the hood to improve our ability to answer this question, like giving the model access to a browser to look up the answer before responding. Making it easy to implement this kind of under-the-hood operators before responding is the motivation behind building Completion Functions.
+
+## How to implement completion functions
+A completion function needs to implement some interfaces that make it usable within Evals. At its core, it is just standardizing inputs to be a text string or [Chat conversation](https://platform.openai.com/docs/guides/chat), and the output to be a list of text strings. Implementing this interface will allow you to run your Completion Function against any eval in Evals.
+
+The exact interfaces needed are described in detail in [completion-fn-protocol.md](completion-fn-protocol.md)
+
+We include some example implementations inside `evals/completion_fns`. For example, the [`LangChainLLMCompletionFn`](../evals/completion_fns/langchain_llm.py) implements a way to generate completions from [LangChain LLMs](https://python.langchain.com/en/latest/modules/models/llms/getting_started.html). We can then use these completion functions with `oaieval`:
+```
+oaieval langchain/llm/flan-t5-xl test-match
+```
+
+## Registering Completion Functions
+Once you have written a completion function, we need to make the class visible to the `oaieval` CLI. Similar to how we register our evals, we also register Completion Functions inside `evals/registry/completion_fns` as `yaml` files. Here is the registration for our langchain LLM completion function:
+```yaml
+langchain/llm/flan-t5-xl:
+ class: evals.completion_fns.langchain_llm:LangChainLLMCompletionFn
+ args:
+ llm: HuggingFaceHub
+ llm_kwargs:
+ repo_id: google/flan-t5-xl
+```
+Here is how it breaks down
+`langchain/llm/flan-t5-xl`: This is the top level key that will be used to access this completion function with `oaieval`.
+`class`: This is the path to your implementation of the completion function protocol. This class needs to importable within your python environment.
+`args`: These are arguments that are passed to your completion function when it is instantiated.
+
+
+### Developing Completion Functions outside of Evals
+It is possible to register CompletionFunctions without directly modifying the registry or code inside `Evals` by using the `--registry_path` argument. As an example, let's say I want to use `MyCompletionFn` located inside `~/my_project/`:
+```
+my_project
+├── my_completion_fn.py
+└── completion_fns
+ └── my_completion_fn.yaml
+```
+
+If `my_project` is importable within the python environment (accessible via PYTHONPATH), we can structure `my_completion_fn.yaml` as:
+```
+my_completion_fn:
+ class: my_project.my_completion_fn:MyCompletionFn
+```
+Then, we can make calls to `oaieval` using:
+```
+oaieval my_completion_fn test-match --registry_path ~/my_project
+```
diff --git a/docs/run-evals.md b/docs/run-evals.md
@@ -4,12 +4,15 @@ We provide two command line interfaces (CLIs): `oaieval` for running a single ev
 
 ## Running an eval
 
-When using the `oaieval` command, you will need to provide both the model you wish to evaluate as well as the eval to run. E.g.,
+When using the `oaieval` command, you will need to provide the completion function you wish to evaluate as well as the eval to run. E.g.,
 ```sh
 oaieval gpt-3.5-turbo test-match
 ```
+The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`.
 
-In this example, `gpt-3.5-turbo` is the model to evaluate, and `test-match` is the eval to run. The valid model names are those which you have access to via the API. The valid eval names are specified in the YAML files under `evals/registry/evals`, and their corresponding implementations can be found in `evals/elsuite`.
+In this example, `gpt-3.5-turbo` is an OpenAI model that we dynamically instantiate as a completion function using `OpenAIChatCompletionFn(model=gpt-3.5-turbo)`. Any implementation of the `CompletionFn` protocol can be run against `oaieval`. By default, we support calling `oaieval` with any model available on the OpenAI API or with CompletionFunctions available in [`evals/registry/completion_fns`](../evals/registry/completion_fns/). We are always interested in adding more completion functions and we encourage you to implement you own to reflect specific use cases.
+
+More details on `CompletionFn` found here: [`completion-fns.md`](completion-fns.md)
 
 These CLIs can accept various flags to modify their default behavior. For example:
 - If you wish to log to a Snowflake database (which you have already set up as described in the [README](../README.md)), add `--no-local-run`.

diff --git a/evals/__init__.py b/evals/__init__.py
@@ -1,4 +1,8 @@
-from .api import check_sampled_text, completion_query, sample_freeform
-from .base import ModelSpec, ModelSpecs
+from .api import CompletionFn, CompletionResult, DummyCompletionFn, record_and_check_match
+from .completion_fns.openai import (
+ OpenAIChatCompletionFn,
+ OpenAICompletionFn,
+ OpenAICompletionResult,
+)
 from .data import get_csv, get_json, get_jsonl, get_jsonls, get_lines, iter_jsonls
 from .eval import Eval