Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the SQuAD evaluation #20

Closed
1 of 2 tasks
StellaAthena opened this issue Sep 16, 2020 · 3 comments · Fixed by #47 or #140
Closed
1 of 2 tasks

Implement the SQuAD evaluation #20

StellaAthena opened this issue Sep 16, 2020 · 3 comments · Fixed by #47 or #140
Assignees
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@cfoster0
Copy link
Contributor

cfoster0 commented Oct 1, 2020

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/squad_v2

@cfoster0
Copy link
Contributor

I'll take this.

@cfoster0
Copy link
Contributor

Similar to RACE #21 , HuggingFace splits the questions into one-per-passage, as opposed to the multi-question-per-passage setup from the paper.

@StellaAthena StellaAthena moved this from To do to Data integrated, Eval not done in Implementing Evaluations Oct 22, 2020
@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena linked a pull request Oct 23, 2020 that will close this issue
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@leogao2 leogao2 moved this from To do, Evaluations to Implement to In Progress in Implementing Evaluations Feb 8, 2021
@anishthite anishthite moved this from In Progress to To do, Evaluations to Implement in Implementing Evaluations Feb 17, 2021
@anishthite anishthite moved this from To do, Evaluations to Implement to In Progress in Implementing Evaluations Feb 17, 2021
@leogao2 leogao2 self-assigned this Mar 28, 2021
Implementing Evaluations automation moved this from In Progress to Done, evaluations Mar 28, 2021
StellaAthena added a commit that referenced this issue Apr 29, 2022
Added bigscience-LAMA evaluation
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
Added bigscience-LAMA evaluation
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
Added bigscience-LAMA evaluation
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
3 participants