Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the CoQA evaluation #17

Closed
1 of 2 tasks
StellaAthena opened this issue Sep 16, 2020 · 1 comment · Fixed by #1 or #53
Closed
1 of 2 tasks

Implement the CoQA evaluation #17

StellaAthena opened this issue Sep 16, 2020 · 1 comment · Fixed by #1 or #53
Assignees
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@anishthite anishthite moved this from To do to In progress in Implementing Evaluations Sep 17, 2020
@anishthite
Copy link
Member

Creating docs and doc->text is done for CoQA, the only thing left to complete is the evaluate function in https://github.com/EleutherAI/lm_evaluation_harness/blob/master/lm_eval/tasks/coqa.py

@leogao2 leogao2 moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Sep 30, 2020
@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena linked a pull request Oct 23, 2020 that will close this issue
@anishthite anishthite linked a pull request Oct 24, 2020 that will close this issue
@StellaAthena StellaAthena moved this from Data integrated, Eval not done to Done in Implementing Evaluations Oct 26, 2020
@leogao2 leogao2 moved this from Done to Data integrated, Eval not done in Implementing Evaluations Dec 1, 2020
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@thefazzer thefazzer self-assigned this Jan 31, 2021
@thefazzer thefazzer moved this from To do to In Progress in Implementing Evaluations Jan 31, 2021
@leogao2 leogao2 moved this from In Progress to Done, evaluations in Implementing Evaluations Feb 8, 2021
@leogao2 leogao2 moved this from Done, evaluations to Deferred Pending Generation in Implementing Evaluations Feb 8, 2021
@leogao2 leogao2 moved this from Deferred to In Progress in Implementing Evaluations Feb 11, 2021
@leogao2 leogao2 closed this as completed Feb 14, 2021
Implementing Evaluations automation moved this from In Progress to Done, evaluations Feb 14, 2021
StellaAthena pushed a commit that referenced this issue Apr 29, 2022
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
4 participants