Why Does Function Calling Lower Model Accuracy on Berkeley Leaderboard? #606
-
First off, big thanks to everyone involved in creating the Berkeley Function-Calling Leaderboard. It’s a fantastic resource for understanding how well large models handle function calling tasks. I’ve noticed something curious on the leaderboard: for the same model, like GPT-4-1106-Preview, using a system prompt directly seems to yield better results in function calling tasks compared to when the function calling feature is explicitly enabled. For instance, GPT-4-1106-Preview (Prompt) has an overall accuracy of 85.65, while GPT-4-1106-Preview (FC) comes in at 79.65. Why does enabling the function calling feature result in a drop in performance? Am I missing something here, or is this actually the case? If anyone has come across any discussions or papers on this, I’d appreciate a pointer. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Great question! One way to understand is this to look at the breakdown by selecting the "Expand/Collapse Table" feature. Different models have very limited capability when it comes to function-calling. For example, some models might not even support multiple or parallel function calling. On the other hand, with Prompting they can support this based on the instructions. So, it's quite likely that the FC models perform quite well when it comes to FC on simple tasks, but might not have support for more advanced features where prompting models score non-0. Net, for some models, prompting is better in expectation! |
Beta Was this translation helpful? Give feedback.
Great question! One way to understand is this to look at the breakdown by selecting the "Expand/Collapse Table" feature. Different models have very limited capability when it comes to function-calling. For example, some models might not even support multiple or parallel function calling. On the other hand, with Prompting they can support this based on the instructions.
So, it's quite likely that the FC models perform quite well when it comes to FC on simple tasks, but might not have support for more advanced features where prompting models score non-0. Net, for some models, prompting is better in expectation!