Why Does Function Calling Lower Model Accuracy on Berkeley Leaderboard? #606

JoeyWu-tech · 2024-08-26T08:16:45Z

JoeyWu-tech
Aug 26, 2024

First off, big thanks to everyone involved in creating the Berkeley Function-Calling Leaderboard. It’s a fantastic resource for understanding how well large models handle function calling tasks.

I’ve noticed something curious on the leaderboard: for the same model, like GPT-4-1106-Preview, using a system prompt directly seems to yield better results in function calling tasks compared to when the function calling feature is explicitly enabled. For instance, GPT-4-1106-Preview (Prompt) has an overall accuracy of 85.65, while GPT-4-1106-Preview (FC) comes in at 79.65.

Why does enabling the function calling feature result in a drop in performance? Am I missing something here, or is this actually the case?

If anyone has come across any discussions or papers on this, I’d appreciate a pointer.

Answered by ShishirPatil

Aug 29, 2024

Great question! One way to understand is this to look at the breakdown by selecting the "Expand/Collapse Table" feature. Different models have very limited capability when it comes to function-calling. For example, some models might not even support multiple or parallel function calling. On the other hand, with Prompting they can support this based on the instructions.

So, it's quite likely that the FC models perform quite well when it comes to FC on simple tasks, but might not have support for more advanced features where prompting models score non-0. Net, for some models, prompting is better in expectation!

View full answer

ShishirPatil · 2024-08-29T07:02:13Z

ShishirPatil
Aug 29, 2024
Maintainer

Great question! One way to understand is this to look at the breakdown by selecting the "Expand/Collapse Table" feature. Different models have very limited capability when it comes to function-calling. For example, some models might not even support multiple or parallel function calling. On the other hand, with Prompting they can support this based on the instructions.

So, it's quite likely that the FC models perform quite well when it comes to FC on simple tasks, but might not have support for more advanced features where prompting models score non-0. Net, for some models, prompting is better in expectation!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Does Function Calling Lower Model Accuracy on Berkeley Leaderboard? #606

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why Does Function Calling Lower Model Accuracy on Berkeley Leaderboard? #606

JoeyWu-tech Aug 26, 2024

Replies: 1 comment

ShishirPatil Aug 29, 2024 Maintainer

JoeyWu-tech
Aug 26, 2024

ShishirPatil
Aug 29, 2024
Maintainer