Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the DROP evaluation #19

Closed
1 of 2 tasks
StellaAthena opened this issue Sep 16, 2020 · 6 comments · Fixed by #53
Closed
1 of 2 tasks

Implement the DROP evaluation #19

StellaAthena opened this issue Sep 16, 2020 · 6 comments · Fixed by #53
Assignees
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@cfoster0
Copy link
Contributor

cfoster0 commented Oct 1, 2020

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/drop

@anishthite anishthite self-assigned this Oct 2, 2020
@anishthite anishthite moved this from To do to In progress in Implementing Evaluations Oct 2, 2020
@anishthite anishthite moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 4, 2020
@anishthite
Copy link
Member

anishthite commented Oct 4, 2020

added in ec4d361 , ended up not using HuggingFace's implementation since they only included answer spans as labels, leaving out a ton of labels

@anishthite
Copy link
Member

Also, should we prepend 'Passage: ' for each passage? It seems OA did not do this, not sure if it was intentional.

@StellaAthena
Copy link
Member Author

My vote is “no” because I see no reason to do so.

@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena
Copy link
Member Author

@anishthite I don't see a line for this dataset in lm_eval/tasks/init.py. Am I missing something, or is this not quite finished?

@anishthite anishthite linked a pull request Oct 24, 2020 that will close this issue
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@jon-tow jon-tow self-assigned this Feb 7, 2021
@jon-tow jon-tow moved this from To do, Evaluations to Implement to Deferred Pending Generation in Implementing Evaluations Feb 8, 2021
@leogao2 leogao2 moved this from Deferred to To do, Evaluations to Implement in Implementing Evaluations Feb 11, 2021
@jon-tow jon-tow moved this from To do, Evaluations to Implement to In Progress in Implementing Evaluations Feb 12, 2021
@StellaAthena
Copy link
Member Author

@jon-tow How is the work on this coming?

@leogao2 leogao2 closed this as completed Mar 7, 2021
Implementing Evaluations automation moved this from In Progress to Done, evaluations Mar 7, 2021
StellaAthena pushed a commit that referenced this issue Apr 29, 2022
Add E2E NLG Cleaned, update required Transformers version
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
Add E2E NLG Cleaned, update required Transformers version
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
Add E2E NLG Cleaned, update required Transformers version
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
Development

Successfully merging a pull request may close this issue.

5 participants