Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the WSC273 Winograd Schemas Challenge evaluation #12

Closed
2 tasks done
StellaAthena opened this issue Sep 16, 2020 · 3 comments · Fixed by #4
Closed
2 tasks done

Implement the WSC273 Winograd Schemas Challenge evaluation #12

StellaAthena opened this issue Sep 16, 2020 · 3 comments · Fixed by #4
Assignees
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.

On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method described in [RWC+19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which is presented as binary classification and requires entity extraction to convert to the form described in this section. On Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance. We note that contamination analysis found some Winograd schemas in the training data but this appears to have only a small effect on results (see Section 4).

On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and human performance on the task as reported by [SBBC19] is 94.0%.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@cfoster0
Copy link
Contributor

cfoster0 commented Oct 5, 2020

I'm having trouble finding an official source for this, but it appears that this may be a BERT version of the WSC273 dataset:

https://github.com/vid-koci/bert-commonsense/blob/master/data/wsc273.txt

@cfoster0
Copy link
Contributor

The original was hosted on a Google Cloud Storage bucket. gs:https://commonsense-reasoning/reproduce/commonsense_test/wsc273.json Unfortunately, that is no longer available.

It likely is extant at the link below. The examples in Trinh and Le's paper are included.
https://git.cse.msu.edu/bakerb15/nlp-final-project/raw/master/Winogard/reproduce/commonsense_test/wsc273.json

@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena linked a pull request Oct 23, 2020 that will close this issue
Implementing Evaluations automation moved this from To do to Data integrated, Eval not done Oct 23, 2020
@cfoster0
Copy link
Contributor

Reopening. In case anyone else ends up confused (as we were) there are 4 different Winograd schema datasets that will be in this harness:

  1. WSC273, which is a set of 273 Winograd schemas. That's what this issue is about.
  2. Winogrande (XL), which is a set of 44k Winograd schemas obtained through an adversarial mining scheme. Add Winogrande dataset #45
  3. WNLI from the GLUE benchmark, which recasts the whole thing as an NLI task. LM Eval Refactor; GPT-3; GLUE tasks #3
  4. WSC from the SuperGLUE benchmark, which pulls from both the original set plus some from an organization called Commonsense Reasoning. SuperGLUE part 1 #4

@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@leogao2 leogao2 changed the title Implement the Winograd Schemas Challenge evaluation Implement the WSC273 Winograd Schemas Challenge evaluation Jan 28, 2021
@leogao2 leogao2 moved this from To do, Evaluations to Implement to Done in Implementing Evaluations Feb 3, 2021
@leogao2 leogao2 closed this as completed Feb 3, 2021
StellaAthena pushed a commit that referenced this issue Apr 29, 2022
Added null prompt support for T5 & Added BLIMP task template
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
Added null prompt support for T5 & Added BLIMP task template
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
Added null prompt support for T5 & Added BLIMP task template
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
Development

Successfully merging a pull request may close this issue.

4 participants