Implement the SuperGLUE evaluation #22

StellaAthena · 2020-09-16T16:59:28Z

From the GPT-3 paper

In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark [WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5).
On WSC, performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the
original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting. WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another.
This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.
Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32 examples per task, after which point additional examples will not reliably fit into our context. When sweeping over values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

StellaAthena · 2020-10-26T18:23:26Z

It looks like almost all of the eval code is written, we are just missing RTE. Once that's done, we can move this to DONE.

anishthite · 2020-10-26T18:53:21Z

RTE is there. https://github.com/EleutherAI/lm_evaluation_harness/blob/master/lm_eval/tasks/superglue.py#L286

StellaAthena · 2020-10-26T18:55:14Z

@anishthite Right. But that is missing the evaluation code. The rest of the SuperGLUE task has the evaluation code written.

anishthite · 2020-10-26T19:46:06Z

Sorry, I missed the evaluation part

thefazzer · 2021-01-06T16:38:50Z

Happy to take the SuperGLUE implementation

zphang · 2021-01-26T21:46:05Z

StellaAthena · 2021-01-26T22:39:30Z

Closing for now as free-form generation is a future problem

Add `axg` and `axb` to SuperGLUE

pminervini · 2024-03-11T14:07:51Z

@StellaAthena

Closing for now as free-form generation is a future problem

I think the harness has free-form generation now, right?

StellaAthena · 2024-03-11T15:48:32Z

@StellaAthena

Closing for now as free-form generation is a future problem

I think the harness has free-form generation now, right?

Yes, we could implement freeform generation now.

adding trasnslate direct

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

This was linked to pull requests Sep 16, 2020

Add coqa extraction #1

Merged

LM Eval Refactor; GPT-3; GLUE tasks #3

Merged

SuperGLUE part 1 #4

Merged

StellaAthena moved this from To do to Review in progress in Implementing Evaluations Sep 16, 2020

StellaAthena moved this from Review in progress to In progress in Implementing Evaluations Sep 16, 2020

zphang mentioned this issue Oct 5, 2020

MultiRC #35

Merged

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena assigned anishthite Oct 23, 2020

StellaAthena closed this as completed Oct 23, 2020

Implementing Evaluations automation moved this from In progress to Data integrated, Eval not done Oct 23, 2020

StellaAthena moved this from Data integrated, Eval not done to Done in Implementing Evaluations Oct 26, 2020

StellaAthena moved this from Done to Data integrated, Eval not done in Implementing Evaluations Oct 26, 2020

StellaAthena reopened this Oct 26, 2020

Implementing Evaluations automation moved this from Data integrated, Eval not done to In progress Oct 26, 2020

StellaAthena moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 26, 2020

StellaAthena removed the Eval Set label Dec 23, 2020

StellaAthena closed this as completed Jan 4, 2021

StellaAthena reopened this Jan 5, 2021

StellaAthena unassigned anishthite Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

StellaAthena assigned thefazzer Jan 6, 2021

thefazzer moved this from Data integrated, Eval not done to In progress in Implementing Evaluations Jan 11, 2021

StellaAthena linked a pull request Jan 21, 2021 that will close this issue

SuperGLUE commitmentbank implementation (new framework) #85

Merged

StellaAthena moved this from In progress to Done in Implementing Evaluations Jan 21, 2021

StellaAthena moved this from Done to Data integrated, Eval not done in Implementing Evaluations Jan 21, 2021

StellaAthena assigned zphang Jan 21, 2021

StellaAthena removed the good first issue Good for newcomers label Jan 21, 2021

StellaAthena linked a pull request Jan 26, 2021 that will close this issue

SuperGLUE (excluding WSC) #91

Merged

StellaAthena closed this as completed Jan 26, 2021

Implementing Evaluations automation moved this from In Progress to Done Jan 26, 2021

StellaAthena moved this from Done to Deferred Pending Generation in Implementing Evaluations Jan 26, 2021

leogao2 moved this from Deferred Pending Generation to Done in Implementing Evaluations Jan 28, 2021

StellaAthena pushed a commit that referenced this issue Apr 29, 2022

Merge pull request #22 from bigscience-workshop/add-axg

256de63

Add `axg` and `axb` to SuperGLUE

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#22 from bigscience-workshop/add-axg

e149812

Add `axg` and `axb` to SuperGLUE

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#22 from bigscience-workshop/add-axg

a7abcfb

Add `axg` and `axb` to SuperGLUE

StellaAthena reopened this Mar 11, 2024

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #22 from JessicaOjo/afri_mgsm

27224f3

adding trasnslate direct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the SuperGLUE evaluation #22

Implement the SuperGLUE evaluation #22

StellaAthena commented Sep 16, 2020 •

edited

Loading

StellaAthena commented Oct 26, 2020

anishthite commented Oct 26, 2020

StellaAthena commented Oct 26, 2020

anishthite commented Oct 26, 2020

thefazzer commented Jan 6, 2021

zphang commented Jan 26, 2021 •

edited by AIproj

Loading

StellaAthena commented Jan 26, 2021

pminervini commented Mar 11, 2024

StellaAthena commented Mar 11, 2024

Implement the SuperGLUE evaluation #22

Implement the SuperGLUE evaluation #22

Comments

StellaAthena commented Sep 16, 2020 • edited Loading

StellaAthena commented Oct 26, 2020

anishthite commented Oct 26, 2020

StellaAthena commented Oct 26, 2020

anishthite commented Oct 26, 2020

thefazzer commented Jan 6, 2021

zphang commented Jan 26, 2021 • edited by AIproj Loading

StellaAthena commented Jan 26, 2021

pminervini commented Mar 11, 2024

StellaAthena commented Mar 11, 2024

StellaAthena commented Sep 16, 2020 •

edited

Loading

zphang commented Jan 26, 2021 •

edited by AIproj

Loading