-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Add Major Code Benchmarks #1157
Comments
Can we first support MBPP and HumanEval? These two have already been widely used and reported. |
Let's look at supporting calling the BigCode Eval Harness instead of doing it from scratch ourselves. This could save time, but my primary motivation is wanting to not undermine their work and promote the ecosystem more broadly. |
Spoke with @clefourrier regarding code tasks, and have some updates here: the Bigcode harness is still used actively, but it will predominantly stay at its current version as the maintainers have other new responsibilities and projects. They've stated we are welcome to incorporate BigCode tasks/features into the Eval Harness and are happy to answer questions. So code tasks are now on the table--our preference would be to keep all model-generated code execution to be done offline, in a second step so that potentially-unsafe code is not run except on safe machines. The ideal path would be to first implement a general utility |
cc @loubnabnl who was the core developer of the BigCode Harness |
Following #1152 we'll be willing to support code benchmarks, which we previously had not been supporting due to not wanting to endorse un-sandboxed model-generated code execution.
This issue is meant to collect discussion and solicit feedback on code generated benchmarks:
What benchmarks are most crucial to support?
What papers are best references for prompting / reference benchmark numbers / etc.
Known drawbacks of these benchmarks (e.g. HumanEval's underspecified unit tests)
Progress on implementing these benchmarks:
The text was updated successfully, but these errors were encountered: