Implement the SQuAD evaluation #20

StellaAthena · 2020-09-16T16:56:18Z

From the GPT-3 paper

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

cfoster0 · 2020-10-01T04:38:29Z

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/squad_v2

cfoster0 · 2020-10-22T05:06:56Z

I'll take this.

cfoster0 · 2020-10-22T05:34:01Z

Similar to RACE #21 , HuggingFace splits the questions into one-per-passage, as opposed to the multi-question-per-passage setup from the paper.

Added bigscience-LAMA evaluation

Afri mgsm modefied

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

cfoster0 mentioned this issue Oct 22, 2020

Add SQuAD v2 dataset #47

Merged

StellaAthena assigned cfoster0 Oct 22, 2020

StellaAthena moved this from To do to Data integrated, Eval not done in Implementing Evaluations Oct 22, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena linked a pull request Oct 23, 2020 that will close this issue

Add SQuAD v2 dataset #47

Merged

StellaAthena closed this as completed Oct 23, 2020

StellaAthena reopened this Jan 5, 2021

StellaAthena unassigned cfoster0 Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021

leogao2 assigned cfoster0 Feb 8, 2021

leogao2 moved this from To do, Evaluations to Implement to In Progress in Implementing Evaluations Feb 8, 2021

cfoster0 mentioned this issue Feb 10, 2021

Implement SQuADv2 evaluation #140

Merged

anishthite moved this from In Progress to To do, Evaluations to Implement in Implementing Evaluations Feb 17, 2021

anishthite moved this from To do, Evaluations to Implement to In Progress in Implementing Evaluations Feb 17, 2021

leogao2 self-assigned this Mar 28, 2021

leogao2 closed this as completed in #140 Mar 28, 2021

Implementing Evaluations automation moved this from In Progress to Done, evaluations Mar 28, 2021

StellaAthena added a commit that referenced this issue Apr 29, 2022

Merge pull request #20 from JanKalo/master

372ca6f

Added bigscience-LAMA evaluation

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#20 from JanKalo/master

92662a6

Added bigscience-LAMA evaluation

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#20 from JanKalo/master

95f900d

Added bigscience-LAMA evaluation

NoushNabi mentioned this issue Jan 16, 2024

Add causalLM OpenVino models #1290

Merged

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #20 from JessicaOjo/afri_mgsm

f720ce8

Afri mgsm modefied

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the SQuAD evaluation #20

Implement the SQuAD evaluation #20

StellaAthena commented Sep 16, 2020 •

edited

Loading

cfoster0 commented Oct 1, 2020 •

edited

Loading

cfoster0 commented Oct 22, 2020

cfoster0 commented Oct 22, 2020

Implement the SQuAD evaluation #20

Implement the SQuAD evaluation #20

Comments

StellaAthena commented Sep 16, 2020 • edited Loading

cfoster0 commented Oct 1, 2020 • edited Loading

cfoster0 commented Oct 22, 2020

cfoster0 commented Oct 22, 2020

StellaAthena commented Sep 16, 2020 •

edited

Loading

cfoster0 commented Oct 1, 2020 •

edited

Loading