How deterministic should evals be? #629

garyrob · 2023-04-10T23:36:58Z

How deterministic should evals be? For example, one area of poor performance of GPT-4 is in evaluating whether a given joke is funny.

There are Twitter feeds that consist almost entirely of jokes. Some are quite popular, and the best jokes get many retweets. So, the jokes that get the most retweets in such a feed could be assumed to be funnier than jokes that get the least. I have been manually examing such feeds, and I do find that that criterion is very consistently the same as my personal judgment of pairs of jokes, where one is amount the most-retweeted and the other is among the least-retweeted within a particular feed.

So, I am thinking that evals could be useful which ask GPT to judge which joke of a joke pair is funnier, where the acceptable answer is the one with the most retweets. To make sure the judgements are as objective as they can be, the pairs would always contain one of the most retweeted jokes and one of the least from the same twitter account, which would be a popular one such the retweets can be in the thousands.

But this type of eval nevertheless has a clear subjective component.

Are such evals acceptable?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How deterministic should evals be? #629

How deterministic should evals be? #629

garyrob commented Apr 10, 2023

How deterministic should evals be? #629

How deterministic should evals be? #629

Comments

garyrob commented Apr 10, 2023