Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[evals] Refactor evals package to expose completion_fn. #515

Merged
merged 23 commits into from
Apr 11, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
d87a056
[evals] Refactor evals package to expose `completion_fn`.
hwchung27 Mar 29, 2023
d9c1395
Add `record_raw_samples`
hwchung27 Apr 2, 2023
a1c6207
Andrew/evals refactor (#579)
andrew-openai Apr 5, 2023
deb29d3
update manifest and pyproject to support fetching data on pip install…
andrew-openai Apr 5, 2023
9b1c350
we need to still use the interop for string/list[dicts] for modelgrad…
andrew-openai Apr 5, 2023
c470d52
refactor simple evals to not use result.prompt (#593)
andrew-openai Apr 5, 2023
b691cfa
Clean up duplicate recordings
hwchung27 Apr 6, 2023
7266049
Replace ModelSpecs with CompletionFn (#594)
jwang47 Apr 6, 2023
b2a45cf
Add --registry_path CLI arg (#601)
jwang47 Apr 6, 2023
924d2d4
Andrew/langchain llms (#602)
andrew-openai Apr 7, 2023
4401cce
rm sample freeform, some docs (#603)
andrew-openai Apr 7, 2023
013d636
Update completion-fn-protocol.md
andrew-openai Apr 7, 2023
08062bc
some documentation cleanup
joe-at-openai Apr 10, 2023
3367006
some documentation cleanup
joe-at-openai Apr 10, 2023
5e71a76
some documentation cleanup
joe-at-openai Apr 10, 2023
e621b6f
inner monologue example (#610)
andrew-openai Apr 10, 2023
49d17ed
Update README.md
andrew-openai Apr 10, 2023
1bfba77
Update run-evals.md
andrew-openai Apr 10, 2023
b018aff
cleanup
andrew-openai Apr 10, 2023
5222f2c
Merge branch 'main' into evals_refactor_merge_main
andrew-openai Apr 10, 2023
9db703d
get oaieval to run
andrew-openai Apr 10, 2023
02bc2cb
address comments
andrew-openai Apr 11, 2023
50114a5
bump version
andrew-openai Apr 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
[evals] Refactor evals package to expose completion_fn.
PAIR=jasonwei

Co-authored-by: Jason Wei <[email protected]>
  • Loading branch information
hwchung27 and jasonwei20 committed Mar 29, 2023
commit d87a056e88db85873fa0ec7f50958b798a825795
2 changes: 1 addition & 1 deletion evals/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .api import check_sampled_text, completion_query, sample_freeform
from .api import check_sampled_text, completion_query, sample_freeform, postprocess_sample_freeform, record_and_check_match
from .base import ModelSpec, ModelSpecs
from .data import get_csv, get_json, get_jsonl, get_jsonls, get_lines, iter_jsonls
from .eval import Eval
80 changes: 70 additions & 10 deletions evals/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ def completion_query(
return result, openai_create_prompt, metadata


# TODO(hwc): remove this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we concluded we'll keep it but refactor it to use new fns?

def check_sampled_text(
model_spec: ModelSpec,
prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt, Prompt],
Expand All @@ -123,13 +124,6 @@ def check_sampled_text(
=======
The option that was picked, i.e., matched the completion, or None.
"""
if isinstance(expected, tuple):
expected = list(expected)
elif not isinstance(expected, list):
expected = [expected]
if options is None:
options = expected

result, actual_prompt, metadata = completion_query(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to rewrite completion_query with new fn?

prompt=prompt,
temperature=0.0,
Expand All @@ -139,6 +133,31 @@ def check_sampled_text(

sampled = choice["text"].strip() if model_spec.strip_completion else choice["text"]

return record_and_check_match(
prompt=actual_prompt,
sampled=sampled,
expected=expected,
metadata=metadata,
separator=separator,
options=options,
)


def record_and_check_match(
prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt],
sampled: str,
expected: Union[str, list[str], tuple[str]],
metadata: dict,
separator: Callable[[str], bool] = None,
options: Optional[list[str]] = None,
):
if isinstance(expected, tuple):
expected = list(expected)
elif not isinstance(expected, list):
expected = [expected]
if options is None:
options = expected

picked = None
for option in options:
if not sampled.startswith(option):
Expand All @@ -153,7 +172,7 @@ def check_sampled_text(
break

result = {
"prompt": actual_prompt,
"prompt": prompt,
"sampled": sampled,
"options": options,
"picked": picked,
Expand All @@ -175,7 +194,7 @@ def sample_freeform(
top_p: float = 0.9,
max_tokens: int = 512,
stop: Optional[str] = None,
n_samples: int = None,
n_samples: Optional[int] = None,
return_logprobs: bool = False,
**kwargs,
) -> Union[str, list[str], dict]:
Expand Down Expand Up @@ -215,10 +234,51 @@ def sample_freeform(
headers={},
**kwargs,
)
return postprocess_sample_freeform(
response,
actual_prompt,
metadata,
model_spec,
n_samples=n_samples,
return_logprobs=return_logprobs,
**kwargs)


def postprocess_sample_freeform(
jwang47 marked this conversation as resolved.
Show resolved Hide resolved
response: dict,
prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt, Prompt],
metadata: dict,
model_spec: ModelSpec,
*,
n_samples: Optional[int] = None,
return_logprobs: bool = False,
**kwargs,
) -> Union[str, list[str], dict]:
"""
Records the sampled response, prompt and metedata, and returns the sampled text.
Typically called after `sample_freeform`.

ARGS
====
`response`: The result of the API call.
`prompt`: See `completion_query`.
`n_samples`: The number of samples to generate (1 if None).
`return_logprobs`: If True, returns the tokens and corresponding logprobs
in addition to the sampled text.
`kwargs`: See `completion_query`.

RETURNS
=======
If `return_logprobs` is True, returns a dict with the sampled text, tokens,
and corresponding logprobs. If `n_samples` is None, the outer list is
removed from all values.
Otherwise, returns the sampled text, or a list of sampled texts if
`n_samples` is not None.
"""
sampled = [choice["text"] for choice in response["choices"]]
if n_samples is None:
sampled = sampled[0]
record_sampling(prompt=actual_prompt, sampled=sampled, metadata=metadata)
record_sampling(prompt=prompt, sampled=sampled, metadata=metadata)

if return_logprobs:
assert not model_spec.is_chat, "logprobs only works for non-chat models"
Expand Down
16 changes: 11 additions & 5 deletions evals/elsuite/basic/fuzzy_match.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,26 @@ def __init__(
samples_jsonl: str,
*args,
max_tokens: int = 500,
completion_fn: utils.CompletionFn = evals.completion_query,
**kwargs,
):
super().__init__(model_specs, *args, **kwargs)
self.max_tokens = max_tokens
self.samples_jsonl = samples_jsonl
self._completion_fn = completion_fn

def eval_sample(self, test_sample, rng):
del rng
prompt, correct_answers = test_sample["input"], test_sample["ideal"]
generated_answer = evals.sample_freeform(
self.model_spec,
prompt,
temperature=0.0,
response, actual_prompt, metadata = self._completion_fn(
prompt=prompt,
temperature=0.0, # Q: why are these hardcoded?
max_tokens=16,
model_spec=self.model_spec,
)
generated_answer: str = evals.postprocess_sample_freeform(
response, actual_prompt, metadata, self.model_spec)

matches = [
utils.fuzzy_match(generated_answer, correct_answer)
for correct_answer in correct_answers
Expand All @@ -40,7 +46,7 @@ def eval_sample(self, test_sample, rng):
)

def run(self, recorder: RecorderBase):
samples = evals.get_jsonl(self.samples_jsonl)
samples = self.get_samples()
self.eval_all_samples(recorder, samples)

return {
Expand Down
17 changes: 12 additions & 5 deletions evals/elsuite/basic/includes.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from typing import Any

import evals
import evals.elsuite.utils
from evals.elsuite import utils
import evals.metrics
import numpy as np

Expand All @@ -13,24 +13,31 @@ def __init__(
samples_jsonl: str,
*args,
max_tokens: int = 500,
completion_fn: utils.CompletionFn = evals.completion_query,
**kwargs,
):
super().__init__(model_specs, *args, **kwargs)
self.max_tokens = max_tokens
self.samples_jsonl = samples_jsonl
self._completion_fn = completion_fn

def eval_sample(self, sample: Any, *_):
sampled = evals.sample_freeform(
self.model_spec, sample["input"], max_tokens=self.max_tokens
response, actual_prompt, metadata = self._completion_fn(
prompt=sample["input"],
max_tokens=self.max_tokens,
model_spec=self.model_spec,
)
sampled: str = evals.postprocess_sample_freeform(
jwang47 marked this conversation as resolved.
Show resolved Hide resolved
response, actual_prompt, metadata, self.model_spec)

includes_answer = any(
[evals.elsuite.utils.get_answer(sampled, ref) for ref in sample["ideal"]]
[utils.get_answer(sampled, ref) for ref in sample["ideal"]]
)
evals.record.record_metrics(accuracy=float(includes_answer))
return includes_answer

def run(self, recorder):
samples = evals.get_jsonl(self.samples_jsonl)
samples = self.get_samples()
self.eval_all_samples(recorder, samples)
events = recorder.get_scores("accuracy")
return {
Expand Down
20 changes: 18 additions & 2 deletions evals/elsuite/basic/match.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import evals
import evals.metrics
from evals.elsuite import utils
from evals.prompt.base import is_chat_prompt


Expand All @@ -14,6 +15,7 @@ def __init__(
max_tokens: int = 500,
num_few_shot: int = 0,
few_shot_jsonl: str = None,
completion_fn: utils.CompletionFn = evals.completion_query,
**kwargs,
):
super().__init__(model_specs, *args, **kwargs)
Expand All @@ -24,6 +26,7 @@ def __init__(
assert few_shot_jsonl is not None, "few shot requires few shot sample dataset"
self.few_shot_jsonl = few_shot_jsonl
self.few_shot = evals.get_jsonl(self.few_shot_jsonl)
self._completion_fn = completion_fn

def eval_sample(self, sample: Any, *_):
prompt = sample["input"]
Expand All @@ -34,10 +37,23 @@ def eval_sample(self, sample: Any, *_):
prompt += s["sample"]
prompt += sample["input"][-1:]

return evals.check_sampled_text(self.model_spec, prompt, expected=sample["ideal"])
# TODO(hwc): is there a case where we want to use `result` other than "choices"?
result, actual_prompt, metadata = self._completion_fn(
prompt=prompt,
temperature=0.0,
model_spec=self.model_spec,
)
choice = result["choices"][0]
sampled = choice["text"].strip() if self.model_spec.strip_completion else choice["text"]
return evals.record_and_check_match(
prompt=actual_prompt,
sampled=sampled,
expected=sample["ideal"],
metadata=metadata
)

def run(self, recorder):
samples = evals.get_jsonl(self.samples_jsonl)
samples= self.get_samples()
self.eval_all_samples(recorder, samples)
events = recorder.get_events("match")
return {
Expand Down
2 changes: 1 addition & 1 deletion evals/elsuite/modelgraded/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ def eval_sample(self, test_sample: dict, rng: Random) -> None:
return choice

def run(self, recorder):
samples = evals.get_jsonl(self.samples_jsonl)
samples = self.get_samples()

self.eval_all_samples(recorder, samples)
all_sample_metrics = recorder.get_metrics()
Expand Down
13 changes: 11 additions & 2 deletions evals/elsuite/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

import evals
import evals.metrics
from evals.elsuite import utils
from evals.prompt.base import is_chat_prompt


Expand All @@ -16,6 +17,7 @@ def __init__(
max_tokens: int = 500,
num_few_shot: int = 0,
few_shot_jsonl: str = None,
completion_fn: utils.CompletionFn = evals.completion_query,
**kwargs,
):
super().__init__(model_specs, *args, **kwargs)
Expand All @@ -29,6 +31,7 @@ def __init__(
self.few_shot = evals.get_jsonl(self.few_shot_jsonl)

self.bleu = BLEU(effective_order=True)
self._completion_fn = completion_fn

def eval_sample(self, sample: Any, *_):
prompt = sample["input"]
Expand All @@ -45,7 +48,13 @@ def eval_sample(self, sample: Any, *_):
elif not isinstance(expected, list):
expected = [expected]

sampled = evals.sample_freeform(self.model_spec, prompt, max_tokens=self.max_tokens)
response, actual_prompt, metadata = self._completion_fn(
prompt=prompt,
max_tokens=self.max_tokens,
model_spec=self.model_spec,
)
sampled: str = evals.postprocess_sample_freeform(
response, actual_prompt, metadata, self.model_spec)

score = None
if expected is not None:
Expand All @@ -61,7 +70,7 @@ def eval_sample(self, sample: Any, *_):
return match

def run(self, recorder):
samples = evals.get_jsonl(self.samples_jsonl)
samples = self.get_samples()
self.eval_all_samples(recorder, samples)
events = recorder.get_events("match")

Expand Down
37 changes: 37 additions & 0 deletions evals/elsuite/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,19 @@
import re
import string
from collections import Counter, defaultdict
from typing import Union
from typing_extensions import Protocol

from evals.api import sample_freeform
from evals.prompt.base import chat_prompt_to_text_prompt, is_chat_prompt

from evals.base import ModelSpec
from evals.prompt.base import (
OpenAICreateChatPrompt,
OpenAICreatePrompt,
Prompt,
)


def get_answer(text, answer_prompt):
idx = text.rfind(answer_prompt)
Expand Down Expand Up @@ -135,3 +144,31 @@ def __call__(self, **kwargs):
**self.completion_kwargs,
)
return completion, prompt


class CompletionFn(Protocol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like CompletionFn but there is some organization to be done, for example:

  1. When to use CompletionFn versus completion_query
  2. Should some of the utility functionality like sample_freeform use CompletionFn
  3. Should we separate implementations of CompletionFn from recording/evaluating utilities
  4. Should completion_query be a subclass of CompletionFn (and be renamed to something like openai_completion_query)

Let's discuss?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd like to have evals only use CompletionFn as opposed to picking between CompletionFn and completion_query (more accurately openai_completion_query as @andrew-openai pointed out).

Also happy to discuss if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing to consider: We need CompletionFn subclasses to probably support both chat and non-chat inputs, which means implementing some generic casting behavior to go from chat to non-chat. I think luckily we have a lot of this already, which is implemented in PromptFn and chat_prompt_to_text_prompt, but just need to add it to CompletionFn

Copy link
Contributor

@andrew-openai andrew-openai Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor needs to extend through the codebase, something like:

.
├── completion_fns
│   ├── __init__.py
│   ├── completion_fn.py (contains CompletionFn protocol)
│   ├── openai_completion_fn.py (contains OpenAICompletionFn implementation)
│   └── ... (other implementations)
├── api.py (updated to use CompletionFn instances)
├── utils.py (updated to use the refactored api.py and CompletionFn implementations)
└── ... (other existing modules)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my current proposal for making api.py a bit neater:

  1. Rewrite sample_freeform to support arbitrary CompletionFns.
  2. Rewrite check_sampled_text to use sample_freeform and record_and_check_match. Remove record_sampling call from record_and_check_match. check_sampled_text to be renamed to check_match_sampled_text and moved to elsuite.utils.
  3. completion_query as a subclass of CompletionFn
  4. CompletionFn to be moved to api`.
  5. evals.Eval has required argument completion_fn


def __call__(
self,
model_spec: ModelSpec,
prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt, Prompt],
**kwargs
) ->tuple[dict, Union[OpenAICreatePrompt, OpenAICreateChatPrompt], dict]:
jwang47 marked this conversation as resolved.
Show resolved Hide resolved
"""
ARGS
====
`model_spec`: `ModelSpec` containing model details to use in the query.
This should be the dict returned by `registry.get_model()`.
If `model_spec` is not provided, we use the default model that was
intialized at the beginning of the run.
`prompt`: Either a `Prompt` object or a raw prompt that will get wrapped in
the approriate `Prompt` class.
`kwargs`: Other arguments passed to the API.

RETURNS
=======
The result of the API call.
The prompt that was fed into the API call as a str.
A dict containing metadata about the query.
"""
pass
Loading