Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the Adversarial Natural Language Inference (ANLI) evaluation #24

Closed
1 of 2 tasks
StellaAthena opened this issue Sep 16, 2020 · 1 comment
Closed
1 of 2 tasks
Assignees
Labels
feature request A feature that isn't implemented yet.

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper:

Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first santence, or is possibly true (neutral).
SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (∼ 33%), whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.

  • Data processing code implemented
  • Evaluation implemented

This should be modeled after the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@cfoster0
Copy link
Contributor

cfoster0 commented Oct 1, 2020

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/anli

@leogao2 leogao2 moved this from To do to In progress in Implementing Evaluations Oct 5, 2020
@leogao2 leogao2 moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 6, 2020
@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@StellaAthena StellaAthena removed the good first issue Good for newcomers label Jan 21, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@leogao2 leogao2 changed the title Implement the Adversarial Natural Language Inference (ANIL) evaluation Implement the Adversarial Natural Language Inference (ANLI) evaluation Jan 28, 2021
@leogao2 leogao2 moved this from To do to In Progress in Implementing Evaluations Jan 30, 2021
@leogao2 leogao2 moved this from In Progress to Done in Implementing Evaluations Jan 30, 2021
@leogao2 leogao2 closed this as completed Jan 30, 2021
StellaAthena pushed a commit to asas-lab/lm-evaluation-harness that referenced this issue May 29, 2022
…stc8

Add the Schema Guided Dialogue (DSTC8) - Response generation
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
xnli changes for open models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet.
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
Development

No branches or pull requests

4 participants