Implement the WSC273 Winograd Schemas Challenge evaluation #12

StellaAthena · 2020-09-16T16:39:58Z

From the GPT-3 paper

The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.

On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method described in [RWC+19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which is presented as binary classification and requires entity extraction to convert to the form described in this section. On Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance. We note that contamination analysis found some Winograd schemas in the training data but this appears to have only a small effect on results (see Section 4).

On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and human performance on the task as reported by [SBBC19] is 94.0%.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

cfoster0 · 2020-10-05T06:54:56Z

I'm having trouble finding an official source for this, but it appears that this may be a BERT version of the WSC273 dataset:

https://github.com/vid-koci/bert-commonsense/blob/master/data/wsc273.txt

cfoster0 · 2020-10-22T19:54:10Z

The original was hosted on a Google Cloud Storage bucket. gs:https://commonsense-reasoning/reproduce/commonsense_test/wsc273.json Unfortunately, that is no longer available.

It likely is extant at the link below. The examples in Trinh and Le's paper are included.
https://git.cse.msu.edu/bakerb15/nlp-final-project/raw/master/Winogard/reproduce/commonsense_test/wsc273.json

cfoster0 · 2020-10-23T06:27:09Z

Reopening. In case anyone else ends up confused (as we were) there are 4 different Winograd schema datasets that will be in this harness:

WSC273, which is a set of 273 Winograd schemas. That's what this issue is about.
Winogrande (XL), which is a set of 44k Winograd schemas obtained through an adversarial mining scheme. Add Winogrande dataset #45
WNLI from the GLUE benchmark, which recasts the whole thing as an NLI task. LM Eval Refactor; GPT-3; GLUE tasks #3
WSC from the SuperGLUE benchmark, which pulls from both the original set plus some from an organization called Commonsense Reasoning. SuperGLUE part 1 #4

Added null prompt support for T5 & Added BLIMP task template

update f1 function

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena linked a pull request Oct 23, 2020 that will close this issue

SuperGLUE part 1 #4

Merged

StellaAthena closed this as completed Oct 23, 2020

Implementing Evaluations automation moved this from To do to Data integrated, Eval not done Oct 23, 2020

cfoster0 mentioned this issue Oct 24, 2020

Winograd changes #50

Merged

StellaAthena reopened this Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021

leogao2 changed the title ~~Implement the Winograd Schemas Challenge evaluation~~ Implement the WSC273 Winograd Schemas Challenge evaluation Jan 28, 2021

jon-tow mentioned this issue Feb 2, 2021

Implement WSC273 evaluation and data processing #111

Merged

leogao2 moved this from To do, Evaluations to Implement to Done in Implementing Evaluations Feb 3, 2021

leogao2 assigned jon-tow Feb 3, 2021

leogao2 closed this as completed Feb 3, 2021

StellaAthena pushed a commit that referenced this issue Apr 29, 2022

Merge pull request #12 from tttyuntian/master

fce17ee

Added null prompt support for T5 & Added BLIMP task template

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#12 from tttyuntian/master

c0355d2

Added null prompt support for T5 & Added BLIMP task template

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#12 from tttyuntian/master

6fa6b42

Added null prompt support for T5 & Added BLIMP task template

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #12 from JessicaOjo/africamgsm

d777994

update f1 function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the WSC273 Winograd Schemas Challenge evaluation #12

Implement the WSC273 Winograd Schemas Challenge evaluation #12

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading

cfoster0 commented Oct 5, 2020

cfoster0 commented Oct 22, 2020

cfoster0 commented Oct 23, 2020

Implement the WSC273 Winograd Schemas Challenge evaluation #12

Implement the WSC273 Winograd Schemas Challenge evaluation #12

Comments

StellaAthena commented Sep 16, 2020 • edited by jon-tow Loading

cfoster0 commented Oct 5, 2020

cfoster0 commented Oct 22, 2020

cfoster0 commented Oct 23, 2020

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading