Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate MNLI #450

Closed
StellaAthena opened this issue May 1, 2023 · 5 comments
Closed

Validate MNLI #450

StellaAthena opened this issue May 1, 2023 · 5 comments
Assignees
Labels
good first issue Good for newcomers validation For validation of task implementations.

Comments

@StellaAthena
Copy link
Member

No description provided.

@StellaAthena StellaAthena added help wanted Contributors and extra help welcome. good first issue Good for newcomers validation For validation of task implementations. and removed help wanted Contributors and extra help welcome. labels May 1, 2023
@PatrykNeubauer
Copy link

Hi, I'll be working on this one!

@StellaAthena StellaAthena changed the title MNLI Validate MNLI May 8, 2023
@StellaAthena
Copy link
Member Author

@PatrykNeubauer any progress?

@PatrykNeubauer
Copy link

Hey, sorry but nothing concrete yet, as I've been mostly getting up to speed on the library and evaluation of LLMs in general.

What I've found:

  • While the MNLI itself isn't present, the prompt format is consistent with the one used for ANLI in GPT-3 and MT-NLG papers.
  • MNLI is not present in LLaMa paper.
  • Neither MNLI nor GLUE papers suggest a prompt format.
  • T5 is trained with MNLI as one of the tasks using "mnli hypothesis: {{hypothesis}} premise: {{premise}}" format, with results in table 16.
  • Meta has evaluated FLAN, LaMDa, OPT and OPT-IML with PromptSource input formats, where one of the formats is similar to the one used here.
  • Something I've mentioned in the discord is that, (as you know best) is that papers like T0 or BLOOM test different prompts for this task from PromptSource.

Two possible sources of inconsistencies I've noticed:

  • lm-eval-harness uses a single new line between hypothesis and premise, while PromptSource uses two
  • not all sentences in MNLI end with punctuation, making the premise not gramatically correct e.g. "The loophole is now gone True, False, or Neither?"

@PatrykNeubauer
Copy link

PatrykNeubauer commented May 11, 2023

Ran the evaluation on few models, tried out both the current version of the prompt and a slightly modified one with extra \n to see how a minor change like that affects the results.

To summarize:

  • MNLI is rarely used in the "big" LLM papers, however the results are around what these models normally get in NLI tasks (e.g. ANLI).
  • From all the papers (skipping ones affiliated with EutherAI) I've checked out, only OPT-IML and FLAN evaluate on MNLI, both using the FLAN benchmark. But similarily to PromptSource, FLAN evaluates on few different prompts at once, making comparison meaningless.
    • What's in their repo is however not consistent with their paper, where they mention using "Does {{premise}} mean that {{hypothesis}}?", later change it to "Premise: {{premise}}\nHypothesis: {{hypothesis}}\nDoes the premise entail the hypothesis?" (p. 5 and 30).
    • Also on p. 30 they mention that models not fine-tuned for instructions, completely fail on their format of the task, resulting to also using the GPT-3 format for LaMDA-PT in their paper.
  • As older benchmarks (2018), neither GLUE nor MNLI suggest a prompt format, they just use softmax classifiers.

So:

  • I'd recommend leaving this task as it is, since that seems to be the most established format - used by OpenAI in GPT-3 (p. 51), Microsoft in MT-NLG (p. 15) and Google in FLAN (p. 30).
    • Only inconsistency seems to be how "Answer:" is added to the prompt, with it even being missing in the GPT-3 paper, however looking at other tasks, current "{}\nQuestion: {} True, False or Neither?\nAnswer:" seems to be the best option.
  • Side-note: this is slightly different from the most similar prompt in PromptSource - MNLI.

With single \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3372 ± 0.0048

gpt3 (engine=davinci), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3943 ± 0.0049

gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.6456 ± 0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3447 ± 0.0048

hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3447 ± 0.0048

hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3583 ± 0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.5673 ± 0.005

hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.6674 ± 0.0048

With extra \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3376 ± 0.0048

gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.6422 ± 0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3536 ± 0.0048

hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3452 ± 0.0048

hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.3583 ± 0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.5673 ± 0.005

hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
mnli 0 acc 0.6674 ± 0.0048

(perhaps woth noticing that opt-125m got better than opt-350m with this format)

@StellaAthena
Copy link
Member Author

This is an excellent report! I feel comfortable adopting this as our Officially Recommended Format now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers validation For validation of task implementations.
Projects
No open projects
Development

No branches or pull requests

2 participants