Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[evals] Refactor evals package to expose completion_fn. #515

Merged
merged 23 commits into from
Apr 11, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
d87a056
[evals] Refactor evals package to expose `completion_fn`.
hwchung27 Mar 29, 2023
d9c1395
Add `record_raw_samples`
hwchung27 Apr 2, 2023
a1c6207
Andrew/evals refactor (#579)
andrew-openai Apr 5, 2023
deb29d3
update manifest and pyproject to support fetching data on pip install…
andrew-openai Apr 5, 2023
9b1c350
we need to still use the interop for string/list[dicts] for modelgrad…
andrew-openai Apr 5, 2023
c470d52
refactor simple evals to not use result.prompt (#593)
andrew-openai Apr 5, 2023
b691cfa
Clean up duplicate recordings
hwchung27 Apr 6, 2023
7266049
Replace ModelSpecs with CompletionFn (#594)
jwang47 Apr 6, 2023
b2a45cf
Add --registry_path CLI arg (#601)
jwang47 Apr 6, 2023
924d2d4
Andrew/langchain llms (#602)
andrew-openai Apr 7, 2023
4401cce
rm sample freeform, some docs (#603)
andrew-openai Apr 7, 2023
013d636
Update completion-fn-protocol.md
andrew-openai Apr 7, 2023
08062bc
some documentation cleanup
joe-at-openai Apr 10, 2023
3367006
some documentation cleanup
joe-at-openai Apr 10, 2023
5e71a76
some documentation cleanup
joe-at-openai Apr 10, 2023
e621b6f
inner monologue example (#610)
andrew-openai Apr 10, 2023
49d17ed
Update README.md
andrew-openai Apr 10, 2023
1bfba77
Update run-evals.md
andrew-openai Apr 10, 2023
b018aff
cleanup
andrew-openai Apr 10, 2023
5222f2c
Merge branch 'main' into evals_refactor_merge_main
andrew-openai Apr 10, 2023
9db703d
get oaieval to run
andrew-openai Apr 10, 2023
02bc2cb
address comments
andrew-openai Apr 11, 2023
50114a5
bump version
andrew-openai Apr 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .github/workflows/test_eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
with:
fetch-depth: 0
lfs: true

- name: Install Git LFS
run: |
sudo apt-get install git-lfs
Expand Down Expand Up @@ -47,8 +47,7 @@ jobs:
echo "Processing $file"
first_key=$(python .github/workflows/parse_yaml.py $file)
echo "Eval Name: $first_key"
oaieval dummy-chat $first_key --max_samples 10
oaieval dummy-completion $first_key --max_samples 10
oaieval dummy $first_key --max_samples 10
done
else
echo "No new YAML files found in evals/registry/evals"
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
recursive-include evals *.py
recursive-include evals *.yaml
recursive-include evals *.sql
recursive-include evals/registry/data *.jsonl
31 changes: 18 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
# Evals

Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.

You can use Evals to create and run evaluations that:
- use datasets to generate prompts,
- measure the quality of completions provided by an OpenAI model, and
- compare performance across different datasets and models.

With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps **in order**:
1. Read through this doc and follow the [setup instructions below](README.md#Setup).
2. Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
3. Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
4. Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
5. See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).
Evals is a framework for evaluating LLMs (large language models) or systems built using LLMs as components. It also includes an open-source registry of challenging evals.

We now support evaluating the behavior of any system including prompt chains or tool-using agents, via the [Completion Function Protocol](docs/completion-fns.md).

With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. An "eval" is a task used to evaluate the quality of a system's behavior. To get started, we recommend that you follow these steps:

To get set up with evals, follow the [setup instructions below](README.md#Setup).

#### Running evals
- Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
- Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).

#### Writing evals
- Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
- See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).

#### Writing CompletionFns
- Write your own completion functions: [completion-fns.md](docs/completion-fns.md)

If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.

Expand Down
2 changes: 1 addition & 1 deletion docs/build-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ In general, running the same eval name against the same model should always give

## Running the eval

You can now run your eval on your data from the CLI with your choice of model:
You can now run your eval on your data from the CLI with your choice of model or completion function:
```
oaieval gpt-3.5-turbo <eval_name>
```
Expand Down
41 changes: 41 additions & 0 deletions docs/completion-fn-protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
### The Completion Function Protocol

Here are the interfaces needed to implement the completion function protocol. Any implementation of this interface can be used inside `oaieval`.

Reference implementations:
- [OpenAICompletionFn](../evals/completion_fns/openai.py)
- [LangChainLLMCompletionFn](../evals/completion_fns/langchain_llm.py)

#### CompletionFn
Completion functions should implement the `CompletionFn` interface:
```python
class CompletionFn(Protocol):
def __call__(
self,
prompt: Union[str, list[dict[str, str]]],
**kwargs,
) -> CompletionResult:
```

We take a `prompt` representing a single sample from an eval. These prompts can be represented as either a text string or a list of messages in [OpenAI Chat format](https://platform.openai.com/docs/guides/chat/introduction). To work with the existing evals, Completion Function implementations would need to handle both types of inputs, but we provide helper functionality to convert Chat formatted messages into a text string if that is the preferred input for your program:
```python
from evals.prompt.base import CompletionPrompt

# chat_prompt: list[dict[str, str]] -> text_prompt: str
text_prompt = CompletionPrompt(chat_prompt).to_formatted_prompt()
```

#### CompletionResult
The completion function should return an object implementing the `CompletionResult` interface:
```python
class CompletionResult(ABC):
@abstractmethod
def get_completions(self) -> list[str]:
pass
```
The `get_completions` method returns a list of string completions. Each element should be considered a unique completion (in most cases this will be a list of length 1).

#### Using your CompletionFn
This is all that's needed to implement a Completion function that works with our existing Evals, allowing you to more easily evaluate your end-to-end logic on tasks.

See [completion-fns.md](completion-fns.md) to see how to register and use your completion function with `oaieval`.
49 changes: 49 additions & 0 deletions docs/completion-fns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Completion Functions

## What are completion functions
In [run-evals.md](run-evals.md), we learned how to make calls to `oaieval` to run an eval against a completion function. Completion Functions are generalizations of model completions, where a "completion" is some text output that would be our answer to the prompt. For example, if "Who played the girl elf in the hobbit?" is our prompt, the correct completion is "Evangeline Lilly". While we can just test a model directly to see if it generates "Evangeline Lilly", we can imagine doing numerous other operations under the hood to improve our ability to answer this question, like giving the model access to a browser to look up the answer before responding. Making it easy to implement this kind of under-the-hood operators before responding is the motivation behind building Completion Functions.

## How to implement completion functions
A completion function needs to implement some interfaces that make it usable within Evals. At its core, it is just standardizing inputs to be a text string or [Chat conversation](https://platform.openai.com/docs/guides/chat), and the output to be a list of text strings. Implementing this interface will allow you to run your Completion Function against any eval in Evals.

The exact interfaces needed are described in detail in [completion-fn-protocol.md](completion-fn-protocol.md)

We include some example implementations inside `evals/completion_fns`. For example, the [`LangChainLLMCompletionFn`](../evals/completion_fns/langchain_llm.py) implements a way to generate completions from [LangChain LLMs](https://python.langchain.com/en/latest/modules/models/llms/getting_started.html). We can then use these completion functions with `oaieval`:
```
oaieval langchain/llm/flan-t5-xl test-match
```

## Registering Completion Functions
Once you have written a completion function, we need to make the class visible to the `oaieval` CLI. Similar to how we register our evals, we also register Completion Functions inside `evals/registry/completion_fns` as `yaml` files. Here is the registration for our langchain LLM completion function:
```yaml
langchain/llm/flan-t5-xl:
class: evals.completion_fns.langchain_llm:LangChainLLMCompletionFn
args:
llm: HuggingFaceHub
llm_kwargs:
repo_id: google/flan-t5-xl
```
Here is how it breaks down
`langchain/llm/flan-t5-xl`: This is the top level key that will be used to access this completion function with `oaieval`.
`class`: This is the path to your implementation of the completion function protocol. This class needs to importable within your python environment.
`args`: These are arguments that are passed to your completion function when it is instantiated.


### Developing Completion Functions outside of Evals
It is possible to register CompletionFunctions without directly modifying the registry or code inside `Evals` by using the `--registry_path` argument. As an example, let's say I want to use `MyCompletionFn` located inside `~/my_project/`:
```
my_project
├── my_completion_fn.py
└── completion_fns
└── my_completion_fn.yaml
```

If `my_project` is importable within the python environment (accessible via PYTHONPATH), we can structure `my_completion_fn.yaml` as:
```
my_completion_fn:
class: my_project.my_completion_fn:MyCompletionFn
```
Then, we can make calls to `oaieval` using:
```
oaieval my_completion_fn test-match --registry_path ~/my_project
```
7 changes: 5 additions & 2 deletions docs/run-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@ We provide two command line interfaces (CLIs): `oaieval` for running a single ev

## Running an eval

When using the `oaieval` command, you will need to provide both the model you wish to evaluate as well as the eval to run. E.g.,
When using the `oaieval` command, you will need to provide the completion function you wish to evaluate as well as the eval to run. E.g.,
```sh
oaieval gpt-3.5-turbo test-match
```
The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`.

In this example, `gpt-3.5-turbo` is the model to evaluate, and `test-match` is the eval to run. The valid model names are those which you have access to via the API. The valid eval names are specified in the YAML files under `evals/registry/evals`, and their corresponding implementations can be found in `evals/elsuite`.
In this example, `gpt-3.5-turbo` is an OpenAI model that we dynamically instantiate as a completion function using `OpenAIChatCompletionFn(model=gpt-3.5-turbo)`. Any implementation of the `CompletionFn` protocol can be run against `oaieval`. By default, we support calling `oaieval` with any model availableon the OpenAI API or with CompletionFunctions available in [`evals/registry/completion_fns`](../evals/registry/completion_fns/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: availableon


More details on `CompletionFn` found here: [`completion-fns.md`](completion-fns.md)

These CLIs can accept various flags to modify their default behavior. For example:
- If you wish to log to a Snowflake database (which you have already set up as described in the [README](../README.md)), add `--no-local-run`.
Expand Down
8 changes: 6 additions & 2 deletions evals/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
from .api import check_sampled_text, completion_query, sample_freeform
from .base import ModelSpec, ModelSpecs
from .api import CompletionFn, CompletionResult, DummyCompletionFn, record_and_check_match
from .completion_fns.openai import (
OpenAIChatCompletionFn,
OpenAICompletionFn,
OpenAICompletionResult,
)
from .data import get_csv, get_json, get_jsonl, get_jsonls, get_lines, iter_jsonls
from .eval import Eval
Loading