Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the SuperGLUE evaluation #22

Open
1 of 2 tasks
StellaAthena opened this issue Sep 16, 2020 · 9 comments · Fixed by #1, #3, #4, #85 or #91
Open
1 of 2 tasks

Implement the SuperGLUE evaluation #22

StellaAthena opened this issue Sep 16, 2020 · 9 comments · Fixed by #1, #3, #4, #85 or #91
Assignees
Labels
feature request A feature that isn't implemented yet.

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark [WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5).
On WSC, performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the
original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting. WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another.
This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.
Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32 examples per task, after which point additional examples will not reliably fit into our context. When sweeping over values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
This was linked to pull requests Sep 16, 2020
@StellaAthena StellaAthena moved this from To do to Review in progress in Implementing Evaluations Sep 16, 2020
@StellaAthena StellaAthena moved this from Review in progress to In progress in Implementing Evaluations Sep 16, 2020
@zphang zphang mentioned this issue Oct 5, 2020
@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
Implementing Evaluations automation moved this from In progress to Data integrated, Eval not done Oct 23, 2020
@StellaAthena StellaAthena moved this from Data integrated, Eval not done to Done in Implementing Evaluations Oct 26, 2020
@StellaAthena StellaAthena moved this from Done to Data integrated, Eval not done in Implementing Evaluations Oct 26, 2020
@StellaAthena
Copy link
Member Author

It looks like almost all of the eval code is written, we are just missing RTE. Once that's done, we can move this to DONE.

@StellaAthena StellaAthena reopened this Oct 26, 2020
Implementing Evaluations automation moved this from Data integrated, Eval not done to In progress Oct 26, 2020
@anishthite
Copy link
Member

@StellaAthena
Copy link
Member Author

@anishthite Right. But that is missing the evaluation code. The rest of the SuperGLUE task has the evaluation code written.

@StellaAthena StellaAthena moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 26, 2020
@anishthite
Copy link
Member

Sorry, I missed the evaluation part

@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@thefazzer
Copy link
Contributor

Happy to take the SuperGLUE implementation

@thefazzer thefazzer moved this from Data integrated, Eval not done to In progress in Implementing Evaluations Jan 11, 2021
@StellaAthena StellaAthena linked a pull request Jan 21, 2021 that will close this issue
@StellaAthena StellaAthena moved this from In progress to Done in Implementing Evaluations Jan 21, 2021
@StellaAthena StellaAthena moved this from Done to Data integrated, Eval not done in Implementing Evaluations Jan 21, 2021
@StellaAthena StellaAthena removed the good first issue Good for newcomers label Jan 21, 2021
@zphang
Copy link
Contributor

zphang commented Jan 26, 2021

Updated with #91

  • BoolQ
  • Commitment Bank
  • COPA
  • MultiRC
  • Words in Context
  • SG Winograd Schema Challenge (requires free-form generation)
  • RTE (GLUE)
  • ReCoRD

@StellaAthena StellaAthena linked a pull request Jan 26, 2021 that will close this issue
@StellaAthena
Copy link
Member Author

Closing for now as free-form generation is a future problem

Implementing Evaluations automation moved this from In Progress to Done Jan 26, 2021
@StellaAthena StellaAthena moved this from Done to Deferred Pending Generation in Implementing Evaluations Jan 26, 2021
@leogao2 leogao2 moved this from Deferred Pending Generation to Done in Implementing Evaluations Jan 28, 2021
StellaAthena pushed a commit that referenced this issue Apr 29, 2022
Add `axg` and `axb` to SuperGLUE
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
@pminervini
Copy link
Contributor

@StellaAthena

Closing for now as free-form generation is a future problem

I think the harness has free-form generation now, right?

@StellaAthena
Copy link
Member Author

@StellaAthena

Closing for now as free-form generation is a future problem

I think the harness has free-form generation now, right?

Yes, we could implement freeform generation now.

@StellaAthena StellaAthena reopened this Mar 11, 2024
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment