Implement the CoQA evaluation #17

StellaAthena · 2020-09-16T16:50:23Z

From the GPT-3 paper

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

anishthite · 2020-09-17T19:59:16Z

Creating docs and doc->text is done for CoQA, the only thing left to complete is the evaluate function in https://github.com/EleutherAI/lm_evaluation_harness/blob/master/lm_eval/tasks/coqa.py

WinoBias support added

Afrixnli translate test

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

anishthite moved this from To do to In progress in Implementing Evaluations Sep 17, 2020

StellaAthena assigned anishthite Sep 17, 2020

leogao2 moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Sep 30, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena linked a pull request Oct 23, 2020 that will close this issue

Add coqa extraction #1

Merged

StellaAthena closed this as completed Oct 23, 2020

anishthite linked a pull request Oct 24, 2020 that will close this issue

Update drop to be consistent with gpt3 paper #53

Merged

StellaAthena moved this from Data integrated, Eval not done to Done in Implementing Evaluations Oct 26, 2020

leogao2 moved this from Done to Data integrated, Eval not done in Implementing Evaluations Dec 1, 2020

StellaAthena reopened this Jan 5, 2021

StellaAthena unassigned anishthite Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021

thefazzer self-assigned this Jan 31, 2021

thefazzer moved this from To do to In Progress in Implementing Evaluations Jan 31, 2021

leogao2 moved this from In Progress to Done, evaluations in Implementing Evaluations Feb 8, 2021

leogao2 moved this from Done, evaluations to Deferred Pending Generation in Implementing Evaluations Feb 8, 2021

leogao2 moved this from Deferred to In Progress in Implementing Evaluations Feb 11, 2021

leogao2 closed this as completed Feb 14, 2021

Implementing Evaluations automation moved this from In Progress to Done, evaluations Feb 14, 2021

StellaAthena pushed a commit that referenced this issue Apr 29, 2022

Merge pull request #17 from tomlimi/tomlimi/gender_bias

449905d

WinoBias support added

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#17 from tomlimi/tomlimi/gender_bias

9fbe986

WinoBias support added

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#17 from tomlimi/tomlimi/gender_bias

9d56760

WinoBias support added

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #17 from JessicaOjo/afrixnli_translate_test

348e304

Afrixnli translate test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the CoQA evaluation #17

Implement the CoQA evaluation #17

StellaAthena commented Sep 16, 2020 •

edited

Loading

anishthite commented Sep 17, 2020

Implement the CoQA evaluation #17

Implement the CoQA evaluation #17

Comments

StellaAthena commented Sep 16, 2020 • edited Loading

anishthite commented Sep 17, 2020

StellaAthena commented Sep 16, 2020 •

edited

Loading