Implement the adversarially-mined Winogrande evaluation #13

StellaAthena · 2020-09-16T16:40:38Z

From the GPT-3 paper

The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.
On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method described in [RWC+19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which is presented as binary classification and requires entity extraction to convert to the form described in this section. On Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance. We note that contamination analysis found some Winograd schemas in the training data but this appears to have only a small effect on results (see Section 4).
On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and human performance on the task as reported by [SBBC19] is 94.0%.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

cfoster0 · 2020-10-01T04:28:29Z

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/winogrande

cfoster0 · 2020-10-22T04:30:45Z

Format of the evaluations is different, since this task uses the "partial evaluation" method, where they have the network rate the probability of the context with a missing word/phrase filled in.

Fix tasks for GEM/mlsum

Config task

Fix tasks for GEM/mlsum

remove limit in script

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

cfoster0 mentioned this issue Oct 22, 2020

Add Winogrande dataset #45

Merged

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena linked a pull request Oct 23, 2020 that will close this issue

Add Winogrande dataset #45

Merged

StellaAthena closed this as completed in #45 Oct 23, 2020

StellaAthena reopened this Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

jon-tow mentioned this issue Feb 3, 2021

Implement Winogrande evaluation #123

Merged

jon-tow closed this as completed Feb 3, 2021

leogao2 assigned jon-tow Feb 8, 2021

StellaAthena pushed a commit that referenced this issue Apr 29, 2022

Merge pull request #13 from Shashi456/mlsum

c166e26

Fix tasks for GEM/mlsum

haileyschoelkopf added a commit that referenced this issue Apr 19, 2023

Merge pull request #13 from EleutherAI/config-task

407c272

Config task

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#13 from Shashi456/mlsum

d1d655b

Fix tasks for GEM/mlsum

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#13 from Shashi456/mlsum

18fc9f8

Fix tasks for GEM/mlsum

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #13 from JessicaOjo/africamgsm

7d51b9a

remove limit in script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the adversarially-mined Winogrande evaluation #13

Implement the adversarially-mined Winogrande evaluation #13

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading

cfoster0 commented Oct 1, 2020

cfoster0 commented Oct 22, 2020 •

edited

Loading

Implement the adversarially-mined Winogrande evaluation #13

Implement the adversarially-mined Winogrande evaluation #13

Comments

StellaAthena commented Sep 16, 2020 • edited by jon-tow Loading

cfoster0 commented Oct 1, 2020

cfoster0 commented Oct 22, 2020 • edited Loading

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading

cfoster0 commented Oct 22, 2020 •

edited

Loading