Implement the ARC Challenge evaluation #15

StellaAthena · 2020-09-16T16:42:50Z

From the GPT-3 paper

ARC [CCE+18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the “Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot setting, and 51.5% in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline (55.9%) from UnifiedQA [KKS+20]. On the “Easy” version of the dataset (questions which either of the mentioned baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a fine-tuned RoBERTa baseline from [KKS+20]. However, both of these results are still much worse than the overall SOTAs achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by 27% on the challenge set and 22% on the easy set.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

cfoster0 · 2020-10-01T04:37:18Z

Source: https://allenai.org/data/arc

Fix `mlsum` task names after split update

…m-task-name Fix `mlsum` task names after split update

add swahili to afrimmlu

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

leogao2 moved this from To do to In progress in Implementing Evaluations Oct 5, 2020

leogao2 moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 5, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena closed this as completed Oct 23, 2020

StellaAthena reopened this Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021

jon-tow self-assigned this Feb 4, 2021

jon-tow mentioned this issue Feb 4, 2021

Implement ARC evaluation #124

Merged

jon-tow closed this as completed Feb 5, 2021

Implementing Evaluations automation moved this from To do, Evaluations to Implement to Done Feb 5, 2021

StellaAthena pushed a commit that referenced this issue Apr 29, 2022

Merge pull request #15 from bigscience-workshop/update-mlsum-task-name

9cd7023

Fix `mlsum` task names after split update

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#15 from bigscience-workshop/update-mlsu…

abd3578

…m-task-name Fix `mlsum` task names after split update

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#15 from bigscience-workshop/update-mlsu…

4306bdc

…m-task-name Fix `mlsum` task names after split update

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #15 from JessicaOjo/africamgsm

ddc2674

add swahili to afrimmlu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the ARC Challenge evaluation #15

Implement the ARC Challenge evaluation #15

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading

cfoster0 commented Oct 1, 2020

Implement the ARC Challenge evaluation #15

Implement the ARC Challenge evaluation #15

Comments

StellaAthena commented Sep 16, 2020 • edited by jon-tow Loading

cfoster0 commented Oct 1, 2020

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading