Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the RACE evaluation #21

Closed
2 tasks done
StellaAthena opened this issue Sep 16, 2020 · 2 comments · Fixed by #46 or #104
Closed
2 tasks done

Implement the RACE evaluation #21

StellaAthena opened this issue Sep 16, 2020 · 2 comments · Fixed by #46 or #104
Assignees
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school English examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@cfoster0
Copy link
Contributor

cfoster0 commented Oct 1, 2020

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/race

@leogao2 leogao2 moved this from To do to In progress in Implementing Evaluations Oct 5, 2020
@leogao2 leogao2 moved this from In progress to To do in Implementing Evaluations Oct 5, 2020
@leogao2
Copy link
Contributor

leogao2 commented Oct 5, 2020

One big issue with HF's implementation of this dataset: it makes a separate document for each question; meanwhile, in the GPT3 paper it is shown that one document is made per passage.

https://github.com/huggingface/datasets/blob/master/datasets/race/race.py#L106

@leogao2 leogao2 moved this from To do to In progress in Implementing Evaluations Oct 5, 2020
@leogao2 leogao2 moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 5, 2020
@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena linked a pull request Oct 23, 2020 that will close this issue
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@leogao2 leogao2 moved this from To do to In Progress in Implementing Evaluations Jan 29, 2021
@leogao2 leogao2 moved this from In Progress to Done in Implementing Evaluations Jan 30, 2021
@leogao2 leogao2 closed this as completed Jan 30, 2021
@StellaAthena StellaAthena linked a pull request Jan 30, 2021 that will close this issue
StellaAthena pushed a commit that referenced this issue Apr 29, 2022
Fixed issue with write_out for datasets without a training split
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
Fixed issue with write_out for datasets without a training split
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
Fixed issue with write_out for datasets without a training split
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
4 participants