Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Add Major Code Benchmarks #1157

Open
6 tasks
haileyschoelkopf opened this issue Dec 18, 2023 · 4 comments · May be fixed by #1992
Open
6 tasks

[Discussion] Add Major Code Benchmarks #1157

haileyschoelkopf opened this issue Dec 18, 2023 · 4 comments · May be fixed by #1992
Labels
opinions wanted For discussing open questions.

Comments

@haileyschoelkopf
Copy link
Contributor

Following #1152 we'll be willing to support code benchmarks, which we previously had not been supporting due to not wanting to endorse un-sandboxed model-generated code execution.

This issue is meant to collect discussion and solicit feedback on code generated benchmarks:

  • What benchmarks are most crucial to support?

  • What papers are best references for prompting / reference benchmark numbers / etc.

  • Known drawbacks of these benchmarks (e.g. HumanEval's underspecified unit tests)

  • Progress on implementing these benchmarks:

    • MBPP
    • HumanEval (+)
    • HumanEvalPack
    • Program-of-Thought
    • PAL (Program-Aided LMs)
    • ???
@haonan-li
Copy link
Contributor

Can we first support MBPP and HumanEval? These two have already been widely used and reported.

@StellaAthena
Copy link
Member

Let's look at supporting calling the BigCode Eval Harness instead of doing it from scratch ourselves. This could save time, but my primary motivation is wanting to not undermine their work and promote the ecosystem more broadly.

@haileyschoelkopf
Copy link
Contributor Author

Spoke with @clefourrier regarding code tasks, and have some updates here:

the Bigcode harness is still used actively, but it will predominantly stay at its current version as the maintainers have other new responsibilities and projects. They've stated we are welcome to incorporate BigCode tasks/features into the Eval Harness and are happy to answer questions.

So code tasks are now on the table--our preference would be to keep all model-generated code execution to be done offline, in a second step so that potentially-unsafe code is not run except on safe machines. The ideal path would be to first implement a general utility lm_eval score to allow for re-scoring logged samples #1627 , then have user-defined metrics requiring model-generated code execution to be run at this stage on a separate machine if desired.

@clefourrier
Copy link
Contributor

cc @loubnabnl who was the core developer of the BigCode Harness

@hjlee1371 hjlee1371 linked a pull request Jun 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
opinions wanted For discussing open questions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants