Implement the DROP evaluation #19

StellaAthena · 2020-09-16T16:53:45Z

From the GPT-3 paper

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

cfoster0 · 2020-10-01T04:42:10Z

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/drop

anishthite · 2020-10-04T20:28:19Z

added in ec4d361 , ended up not using HuggingFace's implementation since they only included answer spans as labels, leaving out a ton of labels

anishthite · 2020-10-04T20:46:01Z

Also, should we prepend 'Passage: ' for each passage? It seems OA did not do this, not sure if it was intentional.

StellaAthena · 2020-10-04T22:18:56Z

My vote is “no” because I see no reason to do so.

StellaAthena · 2020-10-23T04:34:38Z

@anishthite I don't see a line for this dataset in lm_eval/tasks/init.py. Am I missing something, or is this not quite finished?

StellaAthena · 2021-02-18T13:58:43Z

@jon-tow How is the work on this coming?

Add E2E NLG Cleaned, update required Transformers version

format task name, clean ups

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

anishthite self-assigned this Oct 2, 2020

anishthite moved this from To do to In progress in Implementing Evaluations Oct 2, 2020

anishthite moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 4, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

anishthite linked a pull request Oct 24, 2020 that will close this issue

Update drop to be consistent with gpt3 paper #53

Merged

StellaAthena closed this as completed in #53 Oct 24, 2020

StellaAthena reopened this Jan 5, 2021

StellaAthena unassigned anishthite Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021

jon-tow self-assigned this Feb 7, 2021

jon-tow moved this from To do, Evaluations to Implement to Deferred Pending Generation in Implementing Evaluations Feb 8, 2021

leogao2 moved this from Deferred to To do, Evaluations to Implement in Implementing Evaluations Feb 11, 2021

jon-tow moved this from To do, Evaluations to Implement to In Progress in Implementing Evaluations Feb 12, 2021

jon-tow mentioned this issue Feb 28, 2021

Implement DROP evaluation #155

Merged

leogao2 closed this as completed Mar 7, 2021

Implementing Evaluations automation moved this from In Progress to Done, evaluations Mar 7, 2021

StellaAthena pushed a commit that referenced this issue Apr 29, 2022

Merge pull request #19 from kasnerz/kasnerz/generation_tasks

9d60dd7

Add E2E NLG Cleaned, update required Transformers version

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#19 from kasnerz/kasnerz/generation_tasks

04237e6

Add E2E NLG Cleaned, update required Transformers version

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#19 from kasnerz/kasnerz/generation_tasks

5d28dfe

Add E2E NLG Cleaned, update required Transformers version

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #19 from JessicaOjo/africamgsm

facf38c

format task name, clean ups

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the DROP evaluation #19

Implement the DROP evaluation #19

StellaAthena commented Sep 16, 2020 •

edited

Loading

cfoster0 commented Oct 1, 2020

anishthite commented Oct 4, 2020 •

edited

Loading

anishthite commented Oct 4, 2020

StellaAthena commented Oct 4, 2020

StellaAthena commented Oct 23, 2020

StellaAthena commented Feb 18, 2021

Implement the DROP evaluation #19

Implement the DROP evaluation #19

Comments

StellaAthena commented Sep 16, 2020 • edited Loading

cfoster0 commented Oct 1, 2020

anishthite commented Oct 4, 2020 • edited Loading

anishthite commented Oct 4, 2020

StellaAthena commented Oct 4, 2020

StellaAthena commented Oct 23, 2020

StellaAthena commented Feb 18, 2021

StellaAthena commented Sep 16, 2020 •

edited

Loading

anishthite commented Oct 4, 2020 •

edited

Loading