You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How deterministic should evals be? For example, one area of poor performance of GPT-4 is in evaluating whether a given joke is funny.
There are Twitter feeds that consist almost entirely of jokes. Some are quite popular, and the best jokes get many retweets. So, the jokes that get the most retweets in such a feed could be assumed to be funnier than jokes that get the least. I have been manually examing such feeds, and I do find that that criterion is very consistently the same as my personal judgment of pairs of jokes, where one is amount the most-retweeted and the other is among the least-retweeted within a particular feed.
So, I am thinking that evals could be useful which ask GPT to judge which joke of a joke pair is funnier, where the acceptable answer is the one with the most retweets. To make sure the judgements are as objective as they can be, the pairs would always contain one of the most retweeted jokes and one of the least from the same twitter account, which would be a popular one such the retweets can be in the thousands.
But this type of eval nevertheless has a clear subjective component.
Are such evals acceptable?
The text was updated successfully, but these errors were encountered:
How deterministic should evals be? For example, one area of poor performance of GPT-4 is in evaluating whether a given joke is funny.
There are Twitter feeds that consist almost entirely of jokes. Some are quite popular, and the best jokes get many retweets. So, the jokes that get the most retweets in such a feed could be assumed to be funnier than jokes that get the least. I have been manually examing such feeds, and I do find that that criterion is very consistently the same as my personal judgment of pairs of jokes, where one is amount the most-retweeted and the other is among the least-retweeted within a particular feed.
So, I am thinking that evals could be useful which ask GPT to judge which joke of a joke pair is funnier, where the acceptable answer is the one with the most retweets. To make sure the judgements are as objective as they can be, the pairs would always contain one of the most retweeted jokes and one of the least from the same twitter account, which would be a popular one such the retweets can be in the thousands.
But this type of eval nevertheless has a clear subjective component.
Are such evals acceptable?
The text was updated successfully, but these errors were encountered: