refactored v0.4 version shows differences to the existing harness in Japanese. #1392

leocnj · 2024-02-04T18:26:32Z

Now lm-eval v0.4 contains several multilingua tasks, e.g., mgsm, xwinograd, and so on.

For Japanese (Ja), a widely used LLM eval tool is a forked lm-eval (version < 0.4) by Stability-AI https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable. Inside, it defines a suite of Ja tasks. This tool will be called as harness_Ja in this issue report from now on.

I just compared lm-eval v0.4 with harness_Ja on the two tasks, i.e., mgsm and xwinograd.

lm-eval v0.4 result:

hf (pretrained=cyberagent/open-calm-1b,dtype=float16), gen_kwargs: (None), limit: 200.0, num_fewshot: None, batch_size: auto (64)

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
xwinograd_jp	1	none	0	acc	0.59	±	0.0349
mgsm_direct_ja	1	remove_whitespace	0	exact_match	0.00	±	0.0000

harness_Ja result:
hf-causal (pretrained=cyberagent/open-calm-1b), limit: [200, 200], num_fewshot: [0, 0], batch_size: 32

Task	Version	Metric	Value		Stderr
xwinograd_ja	1	acc	0.660	±	0.0336
mgsm	1	acc	0.005	±	0.0050

Some differences can be observed:

for the mgsm task, exact_match is now used, different to acc in the harness_Ja.
for the xwinograd task, results are different, 0.59 (lm-eval v0.4) vs. 0,.66 (harness_Ja).

Regarding the difference 2, it looks that loglikelihood computing are arranged differently in these two versions. For harness_Ja, the computing is on entire sentences. However, in lm-eval v0.4, we only use a partial sentence.

    def construct_requests(self, doc, ctx):
        assert not ctx

        return [
            rf.loglikelihood("", doc["sentence1"]),
            rf.loglikelihood("", doc["sentence2"]),
        ]

The text was updated successfully, but these errors were encountered:

leocnj · 2024-02-04T18:32:37Z

Using one sample

The city councilmen refused the demonstrators a permit because _ feared violence.

options: [the city councilmen, the demonstrators]

to show input differences when computing loglikelihood.

For the first option
-harness_Ja: The city councilmen refused the demonstrators a permit because the city councilmen feared violence
-lm-eval v0.4: The city councilmen refused the demonstrators a permit because the city councilmen

StellaAthena · 2024-02-07T04:19:14Z

If the Stability AI fork is widely used by the Japanese LLM community, we should prioritize the formatting used there. In the case of mgsm, I'm inclined to say we should use acc as a result.

However, in the case of the xwinograd implementation, I'm less certain about how to proceed. The Stability AI fork implements it in a fashion that's contradictory to widely accepted practice in the English LLM community. The way we do it is how pretty much every LLM paper going back to (and before) GPT-3 does it for autoregressive language models. Specifically, we compute P(feared violence | The city councilmen refused the demonstrators a permit because the city councilmen). That is, we look at which value of the masked term makes the true continuation most probable. I think we should probably support both options with careful documentation regarding the differences, but I'm not sure what we should use as the default.

@leocnj are there other tasks where there are differences? Have you had an opportunity to verify the others are the same?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactored v0.4 version shows differences to the existing harness in Japanese. #1392

refactored v0.4 version shows differences to the existing harness in Japanese. #1392

leocnj commented Feb 4, 2024

leocnj commented Feb 4, 2024

StellaAthena commented Feb 7, 2024

refactored v0.4 version shows differences to the existing harness in Japanese. #1392

refactored v0.4 version shows differences to the existing harness in Japanese. #1392

Comments

leocnj commented Feb 4, 2024

leocnj commented Feb 4, 2024

StellaAthena commented Feb 7, 2024