Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactored v0.4 version shows differences to the existing harness in Japanese. #1392

Open
leocnj opened this issue Feb 4, 2024 · 2 comments

Comments

@leocnj
Copy link
Contributor

leocnj commented Feb 4, 2024

Now lm-eval v0.4 contains several multilingua tasks, e.g., mgsm, xwinograd, and so on.

For Japanese (Ja), a widely used LLM eval tool is a forked lm-eval (version < 0.4) by Stability-AI https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable. Inside, it defines a suite of Ja tasks. This tool will be called as harness_Ja in this issue report from now on.

I just compared lm-eval v0.4 with harness_Ja on the two tasks, i.e., mgsm and xwinograd.

lm-eval v0.4 result:

hf (pretrained=cyberagent/open-calm-1b,dtype=float16), gen_kwargs: (None), limit: 200.0, num_fewshot: None, batch_size: auto (64)

Tasks Version Filter n-shot Metric Value Stderr
xwinograd_jp 1 none 0 acc 0.59 ± 0.0349
mgsm_direct_ja 1 remove_whitespace 0 exact_match 0.00 ± 0.0000

harness_Ja result:
hf-causal (pretrained=cyberagent/open-calm-1b), limit: [200, 200], num_fewshot: [0, 0], batch_size: 32

Task Version Metric Value Stderr
xwinograd_ja 1 acc 0.660 ± 0.0336
mgsm 1 acc 0.005 ± 0.0050

Some differences can be observed:

  1. for the mgsm task, exact_match is now used, different to acc in the harness_Ja.
  2. for the xwinograd task, results are different, 0.59 (lm-eval v0.4) vs. 0,.66 (harness_Ja).

Regarding the difference 2, it looks that loglikelihood computing are arranged differently in these two versions. For harness_Ja, the computing is on entire sentences. However, in lm-eval v0.4, we only use a partial sentence.

    def construct_requests(self, doc, ctx):
        assert not ctx

        return [
            rf.loglikelihood("", doc["sentence1"]),
            rf.loglikelihood("", doc["sentence2"]),
        ]
@leocnj
Copy link
Contributor Author

leocnj commented Feb 4, 2024

Using one sample

The city councilmen refused the demonstrators a permit because _ feared violence.

options: [the city councilmen, the demonstrators]

to show input differences when computing loglikelihood.

For the first option
-harness_Ja: The city councilmen refused the demonstrators a permit because the city councilmen feared violence
-lm-eval v0.4: The city councilmen refused the demonstrators a permit because the city councilmen

@StellaAthena
Copy link
Member

If the Stability AI fork is widely used by the Japanese LLM community, we should prioritize the formatting used there. In the case of mgsm, I'm inclined to say we should use acc as a result.

However, in the case of the xwinograd implementation, I'm less certain about how to proceed. The Stability AI fork implements it in a fashion that's contradictory to widely accepted practice in the English LLM community. The way we do it is how pretty much every LLM paper going back to (and before) GPT-3 does it for autoregressive language models. Specifically, we compute P(feared violence | The city councilmen refused the demonstrators a permit because the city councilmen). That is, we look at which value of the masked term makes the true continuation most probable. I think we should probably support both options with careful documentation regarding the differences, but I'm not sure what we should use as the default.

@leocnj are there other tasks where there are differences? Have you had an opportunity to verify the others are the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants