Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the Natural Language Inference (NLI) evaluation #23

Closed
1 of 2 tasks
StellaAthena opened this issue Sep 16, 2020 · 1 comment · Fixed by #56
Closed
1 of 2 tasks

Implement the Natural Language Inference (NLI) evaluation #23

StellaAthena opened this issue Sep 16, 2020 · 1 comment · Fixed by #56
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral).
SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (∼ 33%), whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.

  • Data processing code implemented
  • Evaluation implemented

This should be modeled after the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena pinned this issue Oct 23, 2020
@anishthite anishthite moved this from To do to In progress in Implementing Evaluations Oct 24, 2020
@anishthite anishthite self-assigned this Oct 24, 2020
@anishthite anishthite linked a pull request Oct 24, 2020 that will close this issue
Implementing Evaluations automation moved this from In progress to Data integrated, Eval not done Oct 24, 2020
@StellaAthena StellaAthena unpinned this issue Oct 26, 2020
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@leogao2 leogao2 changed the title Implement the Natural Language Inference (NIL) evaluation Implement the Natural Language Inference (NLI) evaluation Jan 28, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@leogao2
Copy link
Contributor

leogao2 commented Feb 12, 2021

I'm pretty sure NLI is a category rather than a single evaluation. Closing for now.

@leogao2 leogao2 closed this as completed Feb 12, 2021
Implementing Evaluations automation moved this from To do, Evaluations to Implement to Done, evaluations Feb 12, 2021
@leogao2 leogao2 moved this from Done, evaluations to Done, other in Implementing Evaluations Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants