[Discussion] Add Major Code Benchmarks #1157

haileyschoelkopf · 2023-12-18T16:30:39Z

Following #1152 we'll be willing to support code benchmarks, which we previously had not been supporting due to not wanting to endorse un-sandboxed model-generated code execution.

This issue is meant to collect discussion and solicit feedback on code generated benchmarks:

What benchmarks are most crucial to support?
What papers are best references for prompting / reference benchmark numbers / etc.
Known drawbacks of these benchmarks (e.g. HumanEval's underspecified unit tests)
Progress on implementing these benchmarks:
- MBPP
- HumanEval (+)
- HumanEvalPack
- Program-of-Thought
- PAL (Program-Aided LMs)
- ???

haonan-li · 2023-12-27T09:35:02Z

Can we first support MBPP and HumanEval? These two have already been widely used and reported.

StellaAthena · 2023-12-31T16:38:36Z

Let's look at supporting calling the BigCode Eval Harness instead of doing it from scratch ourselves. This could save time, but my primary motivation is wanting to not undermine their work and promote the ecosystem more broadly.

haileyschoelkopf · 2024-05-27T20:06:09Z

Spoke with @clefourrier regarding code tasks, and have some updates here:

the Bigcode harness is still used actively, but it will predominantly stay at its current version as the maintainers have other new responsibilities and projects. They've stated we are welcome to incorporate BigCode tasks/features into the Eval Harness and are happy to answer questions.

So code tasks are now on the table--our preference would be to keep all model-generated code execution to be done offline, in a second step so that potentially-unsafe code is not run except on safe machines. The ideal path would be to first implement a general utility lm_eval score to allow for re-scoring logged samples #1627 , then have user-defined metrics requiring model-generated code execution to be run at this stage on a separate machine if desired.

clefourrier · 2024-05-28T08:27:26Z

cc @loubnabnl who was the core developer of the BigCode Harness

haileyschoelkopf added the opinions wanted For discussing open questions. label Dec 18, 2023

haileyschoelkopf mentioned this issue Dec 18, 2023

[FEATURE REQUEST] Can we have HumanEval+ benchmark? #1091

Closed

haileyschoelkopf pinned this issue Dec 18, 2023

StellaAthena mentioned this issue Jan 18, 2024

Tasks on code evaluation #1282

Closed

LSinev mentioned this issue Mar 5, 2024

humaneval task #1522

Closed

hjlee1371 linked a pull request Jun 19, 2024 that will close this issue

Add HumanEval #1992

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Add Major Code Benchmarks #1157

[Discussion] Add Major Code Benchmarks #1157

haileyschoelkopf commented Dec 18, 2023

haonan-li commented Dec 27, 2023

StellaAthena commented Dec 31, 2023

haileyschoelkopf commented May 27, 2024

clefourrier commented May 28, 2024

[Discussion] Add Major Code Benchmarks #1157

[Discussion] Add Major Code Benchmarks #1157

Comments

haileyschoelkopf commented Dec 18, 2023

haonan-li commented Dec 27, 2023

StellaAthena commented Dec 31, 2023

haileyschoelkopf commented May 27, 2024

clefourrier commented May 28, 2024