Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How deterministic should evals be? #629

Open
garyrob opened this issue Apr 10, 2023 · 0 comments
Open

How deterministic should evals be? #629

garyrob opened this issue Apr 10, 2023 · 0 comments

Comments

@garyrob
Copy link

garyrob commented Apr 10, 2023

How deterministic should evals be? For example, one area of poor performance of GPT-4 is in evaluating whether a given joke is funny.

There are Twitter feeds that consist almost entirely of jokes. Some are quite popular, and the best jokes get many retweets. So, the jokes that get the most retweets in such a feed could be assumed to be funnier than jokes that get the least. I have been manually examing such feeds, and I do find that that criterion is very consistently the same as my personal judgment of pairs of jokes, where one is amount the most-retweeted and the other is among the least-retweeted within a particular feed.

So, I am thinking that evals could be useful which ask GPT to judge which joke of a joke pair is funnier, where the acceptable answer is the one with the most retweets. To make sure the judgements are as objective as they can be, the pairs would always contain one of the most retweeted jokes and one of the least from the same twitter account, which would be a popular one such the retweets can be in the thousands.

But this type of eval nevertheless has a clear subjective component.

Are such evals acceptable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant