Skip to content

Commit

Permalink
[evals] Refactor evals package to expose completion_fn. (openai#515)
Browse files Browse the repository at this point in the history
PAIR=jasonwei
- Move Evals functionality to use CompletionFns from ModelSpecs.
---------

Co-authored-by: Jason Wei <[email protected]>
Co-authored-by: Andrew Kondrich <[email protected]>
Co-authored-by: Andrew Kondrich <[email protected]>
Co-authored-by: Alvin Wang <[email protected]>
Co-authored-by: joe-at-openai <[email protected]>
  • Loading branch information
6 people committed Apr 11, 2023
1 parent f7ebbe8 commit 64fb72a
Show file tree
Hide file tree
Showing 29 changed files with 730 additions and 560 deletions.
5 changes: 2 additions & 3 deletions .github/workflows/test_eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
with:
fetch-depth: 0
lfs: true

- name: Install Git LFS
run: |
sudo apt-get install git-lfs
Expand Down Expand Up @@ -47,8 +47,7 @@ jobs:
echo "Processing $file"
first_key=$(python .github/workflows/parse_yaml.py $file)
echo "Eval Name: $first_key"
oaieval dummy-chat $first_key --max_samples 10
oaieval dummy-completion $first_key --max_samples 10
oaieval dummy $first_key --max_samples 10
done
else
echo "No new YAML files found in evals/registry/evals"
Expand Down
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
recursive-include evals *.py
recursive-include evals *.yaml
recursive-include evals *.sql
recursive-include evals *.jsonl
recursive-include evals/registry/data *.jsonl
31 changes: 18 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
# Evals

Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.

You can use Evals to create and run evaluations that:
- use datasets to generate prompts,
- measure the quality of completions provided by an OpenAI model, and
- compare performance across different datasets and models.

With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps **in order**:
1. Read through this doc and follow the [setup instructions below](README.md#Setup).
2. Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
3. Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
4. Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
5. See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).
Evals is a framework for evaluating LLMs (large language models) or systems built using LLMs as components. It also includes an open-source registry of challenging evals.

We now support evaluating the behavior of any system including prompt chains or tool-using agents, via the [Completion Function Protocol](docs/completion-fns.md).

With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. An "eval" is a task used to evaluate the quality of a system's behavior. To get started, we recommend that you follow these steps:

To get set up with evals, follow the [setup instructions below](README.md#Setup).

#### Running evals
- Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
- Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).

#### Writing evals
- Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
- See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).

#### Writing CompletionFns
- Write your own completion functions: [completion-fns.md](docs/completion-fns.md)

If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.

Expand Down
2 changes: 1 addition & 1 deletion docs/build-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ In general, running the same eval name against the same model should always give

## Running the eval

You can now run your eval on your data from the CLI with your choice of model:
You can now run your eval on your data from the CLI with your choice of model or completion function:
```
oaieval gpt-3.5-turbo <eval_name>
```
Expand Down
41 changes: 41 additions & 0 deletions docs/completion-fn-protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
### The Completion Function Protocol

Here are the interfaces needed to implement the completion function protocol. Any implementation of this interface can be used inside `oaieval`.

Reference implementations:
- [OpenAICompletionFn](../evals/completion_fns/openai.py)
- [LangChainLLMCompletionFn](../evals/completion_fns/langchain_llm.py)

#### CompletionFn
Completion functions should implement the `CompletionFn` interface:
```python
class CompletionFn(Protocol):
def __call__(
self,
prompt: Union[str, list[dict[str, str]]],
**kwargs,
) -> CompletionResult:
```

We take a `prompt` representing a single sample from an eval. These prompts can be represented as either a text string or a list of messages in [OpenAI Chat format](https://platform.openai.com/docs/guides/chat/introduction). To work with the existing evals, Completion Function implementations would need to handle both types of inputs, but we provide helper functionality to convert Chat formatted messages into a text string if that is the preferred input for your program:
```python
from evals.prompt.base import CompletionPrompt

# chat_prompt: list[dict[str, str]] -> text_prompt: str
text_prompt = CompletionPrompt(chat_prompt).to_formatted_prompt()
```

#### CompletionResult
The completion function should return an object implementing the `CompletionResult` interface:
```python
class CompletionResult(ABC):
@abstractmethod
def get_completions(self) -> list[str]:
pass
```
The `get_completions` method returns a list of string completions. Each element should be considered a unique completion (in most cases this will be a list of length 1).

#### Using your CompletionFn
This is all that's needed to implement a Completion function that works with our existing Evals, allowing you to more easily evaluate your end-to-end logic on tasks.

See [completion-fns.md](completion-fns.md) to see how to register and use your completion function with `oaieval`.
49 changes: 49 additions & 0 deletions docs/completion-fns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Completion Functions

## What are completion functions
In [run-evals.md](run-evals.md), we learned how to make calls to `oaieval` to run an eval against a completion function. Completion Functions are generalizations of model completions, where a "completion" is some text output that would be our answer to the prompt. For example, if "Who played the girl elf in the hobbit?" is our prompt, the correct completion is "Evangeline Lilly". While we can just test a model directly to see if it generates "Evangeline Lilly", we can imagine doing numerous other operations under the hood to improve our ability to answer this question, like giving the model access to a browser to look up the answer before responding. Making it easy to implement this kind of under-the-hood operators before responding is the motivation behind building Completion Functions.

## How to implement completion functions
A completion function needs to implement some interfaces that make it usable within Evals. At its core, it is just standardizing inputs to be a text string or [Chat conversation](https://platform.openai.com/docs/guides/chat), and the output to be a list of text strings. Implementing this interface will allow you to run your Completion Function against any eval in Evals.

The exact interfaces needed are described in detail in [completion-fn-protocol.md](completion-fn-protocol.md)

We include some example implementations inside `evals/completion_fns`. For example, the [`LangChainLLMCompletionFn`](../evals/completion_fns/langchain_llm.py) implements a way to generate completions from [LangChain LLMs](https://python.langchain.com/en/latest/modules/models/llms/getting_started.html). We can then use these completion functions with `oaieval`:
```
oaieval langchain/llm/flan-t5-xl test-match
```

## Registering Completion Functions
Once you have written a completion function, we need to make the class visible to the `oaieval` CLI. Similar to how we register our evals, we also register Completion Functions inside `evals/registry/completion_fns` as `yaml` files. Here is the registration for our langchain LLM completion function:
```yaml
langchain/llm/flan-t5-xl:
class: evals.completion_fns.langchain_llm:LangChainLLMCompletionFn
args:
llm: HuggingFaceHub
llm_kwargs:
repo_id: google/flan-t5-xl
```
Here is how it breaks down
`langchain/llm/flan-t5-xl`: This is the top level key that will be used to access this completion function with `oaieval`.
`class`: This is the path to your implementation of the completion function protocol. This class needs to importable within your python environment.
`args`: These are arguments that are passed to your completion function when it is instantiated.


### Developing Completion Functions outside of Evals
It is possible to register CompletionFunctions without directly modifying the registry or code inside `Evals` by using the `--registry_path` argument. As an example, let's say I want to use `MyCompletionFn` located inside `~/my_project/`:
```
my_project
├── my_completion_fn.py
└── completion_fns
└── my_completion_fn.yaml
```

If `my_project` is importable within the python environment (accessible via PYTHONPATH), we can structure `my_completion_fn.yaml` as:
```
my_completion_fn:
class: my_project.my_completion_fn:MyCompletionFn
```
Then, we can make calls to `oaieval` using:
```
oaieval my_completion_fn test-match --registry_path ~/my_project
```
7 changes: 5 additions & 2 deletions docs/run-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@ We provide two command line interfaces (CLIs): `oaieval` for running a single ev

## Running an eval

When using the `oaieval` command, you will need to provide both the model you wish to evaluate as well as the eval to run. E.g.,
When using the `oaieval` command, you will need to provide the completion function you wish to evaluate as well as the eval to run. E.g.,
```sh
oaieval gpt-3.5-turbo test-match
```
The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`.

In this example, `gpt-3.5-turbo` is the model to evaluate, and `test-match` is the eval to run. The valid model names are those which you have access to via the API. The valid eval names are specified in the YAML files under `evals/registry/evals`, and their corresponding implementations can be found in `evals/elsuite`.
In this example, `gpt-3.5-turbo` is an OpenAI model that we dynamically instantiate as a completion function using `OpenAIChatCompletionFn(model=gpt-3.5-turbo)`. Any implementation of the `CompletionFn` protocol can be run against `oaieval`. By default, we support calling `oaieval` with any model available on the OpenAI API or with CompletionFunctions available in [`evals/registry/completion_fns`](../evals/registry/completion_fns/). We are always interested in adding more completion functions and we encourage you to implement you own to reflect specific use cases.

More details on `CompletionFn` found here: [`completion-fns.md`](completion-fns.md)

These CLIs can accept various flags to modify their default behavior. For example:
- If you wish to log to a Snowflake database (which you have already set up as described in the [README](../README.md)), add `--no-local-run`.
Expand Down
8 changes: 6 additions & 2 deletions evals/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
from .api import check_sampled_text, completion_query, sample_freeform
from .base import ModelSpec, ModelSpecs
from .api import CompletionFn, CompletionResult, DummyCompletionFn, record_and_check_match
from .completion_fns.openai import (
OpenAIChatCompletionFn,
OpenAICompletionFn,
OpenAICompletionResult,
)
from .data import get_csv, get_json, get_jsonl, get_jsonls, get_lines, iter_jsonls
from .eval import Eval
Loading

0 comments on commit 64fb72a

Please sign in to comment.