Implement the Natural Language Inference (NLI) evaluation #23

StellaAthena · 2020-09-16T17:00:44Z

From the GPT-3 paper

Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral).
SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (∼ 33%), whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.

Data processing code implemented
Evaluation implemented

This should be modeled after the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

leogao2 · 2021-02-12T05:17:06Z

I'm pretty sure NLI is a category rather than a single evaluation. Closing for now.

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena pinned this issue Oct 23, 2020

anishthite moved this from To do to In progress in Implementing Evaluations Oct 24, 2020

anishthite self-assigned this Oct 24, 2020

anishthite linked a pull request Oct 24, 2020 that will close this issue

add rte text #56

Merged

StellaAthena closed this as completed in #56 Oct 24, 2020

Implementing Evaluations automation moved this from In progress to Data integrated, Eval not done Oct 24, 2020

StellaAthena unpinned this issue Oct 26, 2020

StellaAthena reopened this Jan 5, 2021

StellaAthena unassigned anishthite Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

leogao2 changed the title ~~Implement the Natural Language Inference (NIL) evaluation~~ Implement the Natural Language Inference (NLI) evaluation Jan 28, 2021

leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021

leogao2 closed this as completed Feb 12, 2021

Implementing Evaluations automation moved this from To do, Evaluations to Implement to Done, evaluations Feb 12, 2021

leogao2 moved this from Done, evaluations to Done, other in Implementing Evaluations Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the Natural Language Inference (NLI) evaluation #23

Implement the Natural Language Inference (NLI) evaluation #23

StellaAthena commented Sep 16, 2020 •

edited

Loading

leogao2 commented Feb 12, 2021

Implement the Natural Language Inference (NLI) evaluation #23

Implement the Natural Language Inference (NLI) evaluation #23

Comments

StellaAthena commented Sep 16, 2020 • edited Loading

leogao2 commented Feb 12, 2021

StellaAthena commented Sep 16, 2020 •

edited

Loading