Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mela #1970

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

mela #1970

wants to merge 1 commit into from

Conversation

Geralt-Targaryen
Copy link

Add the ACL 2024 benchmark mela (multilingual evaluation of linguistic acceptability)

@CLAassistant
Copy link

CLAassistant commented Jun 16, 2024

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Member

@Geralt-Targaryen Thanks for the contribution! Can you see about reporoducing some of the scores reported in Table 3 to validate the implementation is working correctly?

@Geralt-Targaryen
Copy link
Author

@Geralt-Targaryen Thanks for the contribution! Can you see about reporoducing some of the scores reported in Table 3 to validate the implementation is working correctly?

Yes, here are some models' results from our original implementation and evaluation harness implementation:

model shot original (reported in the paper) lm eval harness
BLOOMZ 7B 0 5.85 5.99±0.85
BLOOMZ 7B 2 4.31 4.11±0.87
mT0 13B 0 6.62 7.72±0.88
mT0 13B 2 7.70 5.82±0.75
mTk 13B 0 2.24 3.16±1.01
mTk 13B 2 12.05 12.26±0.98

As we explained in the paper, linguistic acceptability is a task with large performance variations. Fluctuations that result from the selection of in-context examples, floating point precisions, and prompt formatting are expected. A slight difference between the two implementations is that in our original version, we used two newlines after the task description, but it seems that eval harness treats multiple newlines after the task description as a single one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants