Validate MNLI #450

StellaAthena · 2023-05-01T14:18:58Z

No description provided.

PatrykNeubauer · 2023-05-01T15:08:58Z

Hi, I'll be working on this one!

StellaAthena · 2023-05-08T01:52:36Z

PatrykNeubauer · 2023-05-08T08:47:52Z

Hey, sorry but nothing concrete yet, as I've been mostly getting up to speed on the library and evaluation of LLMs in general.

What I've found:

While the MNLI itself isn't present, the prompt format is consistent with the one used for ANLI in GPT-3 and MT-NLG papers.
MNLI is not present in LLaMa paper.
Neither MNLI nor GLUE papers suggest a prompt format.
T5 is trained with MNLI as one of the tasks using "mnli hypothesis: {{hypothesis}} premise: {{premise}}" format, with results in table 16.
Meta has evaluated FLAN, LaMDa, OPT and OPT-IML with PromptSource input formats, where one of the formats is similar to the one used here.
Something I've mentioned in the discord is that, (as you know best) is that papers like T0 or BLOOM test different prompts for this task from PromptSource.

Two possible sources of inconsistencies I've noticed:

lm-eval-harness uses a single new line between hypothesis and premise, while PromptSource uses two
not all sentences in MNLI end with punctuation, making the premise not gramatically correct e.g. "The loophole is now gone True, False, or Neither?"

PatrykNeubauer · 2023-05-11T11:50:10Z

Ran the evaluation on few models, tried out both the current version of the prompt and a slightly modified one with extra \n to see how a minor change like that affects the results.

To summarize:

MNLI is rarely used in the "big" LLM papers, however the results are around what these models normally get in NLI tasks (e.g. ANLI).
From all the papers (skipping ones affiliated with EutherAI) I've checked out, only OPT-IML and FLAN evaluate on MNLI, both using the FLAN benchmark. But similarily to PromptSource, FLAN evaluates on few different prompts at once, making comparison meaningless.
- What's in their repo is however not consistent with their paper, where they mention using "Does {{premise}} mean that {{hypothesis}}?", later change it to "Premise: {{premise}}\nHypothesis: {{hypothesis}}\nDoes the premise entail the hypothesis?" (p. 5 and 30).
- Also on p. 30 they mention that models not fine-tuned for instructions, completely fail on their format of the task, resulting to also using the GPT-3 format for LaMDA-PT in their paper.
As older benchmarks (2018), neither GLUE nor MNLI suggest a prompt format, they just use softmax classifiers.

So:

I'd recommend leaving this task as it is, since that seems to be the most established format - used by OpenAI in GPT-3 (p. 51), Microsoft in MT-NLG (p. 15) and Google in FLAN (p. 30).
- Only inconsistency seems to be how "Answer:" is added to the prompt, with it even being missing in the GPT-3 paper, however looking at other tasks, current "{}\nQuestion: {} True, False or Neither?\nAnswer:" seems to be the best option.
Side-note: this is slightly different from the most similar prompt in PromptSource - MNLI.

With single \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3372	±	0.0048

gpt3 (engine=davinci), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3943	±	0.0049

gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6456	±	0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3447	±	0.0048

hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3447	±	0.0048

hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3583	±	0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.5673	±	0.005

hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6674	±	0.0048

With extra \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3376	±	0.0048

gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6422	±	0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3536	±	0.0048

hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3452	±	0.0048

hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3583	±	0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.5673	±	0.005

hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6674	±	0.0048

(perhaps woth noticing that opt-125m got better than opt-350m with this format)

StellaAthena · 2023-05-11T14:07:25Z

This is an excellent report! I feel comfortable adopting this as our Officially Recommended Format now :)

StellaAthena added help wanted Contributors and extra help welcome. good first issue Good for newcomers validation For validation of task implementations. and removed help wanted Contributors and extra help welcome. labels May 1, 2023

StellaAthena assigned PatrykNeubauer May 1, 2023

StellaAthena changed the title ~~MNLI~~ Validate MNLI May 8, 2023

StellaAthena closed this as completed May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate MNLI #450

Validate MNLI #450

StellaAthena commented May 1, 2023

PatrykNeubauer commented May 1, 2023

StellaAthena commented May 8, 2023

PatrykNeubauer commented May 8, 2023

PatrykNeubauer commented May 11, 2023 •

edited

Loading

StellaAthena commented May 11, 2023

Validate MNLI #450

Validate MNLI #450

Comments

StellaAthena commented May 1, 2023

PatrykNeubauer commented May 1, 2023

StellaAthena commented May 8, 2023

PatrykNeubauer commented May 8, 2023

PatrykNeubauer commented May 11, 2023 • edited Loading

GPT

OPT

T5

GPT

OPT

T5

StellaAthena commented May 11, 2023

PatrykNeubauer commented May 11, 2023 •

edited

Loading