Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the HendrycksTest evaluation #120

Closed
leogao2 opened this issue Feb 3, 2021 · 0 comments
Closed

Implement the HendrycksTest evaluation #120

leogao2 opened this issue Feb 3, 2021 · 0 comments

Comments

@leogao2
Copy link
Contributor

leogao2 commented Feb 3, 2021

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

https://arxiv.org/abs/2009.03300

@leogao2 leogao2 added this to To do, Evaluations to Implement in Implementing Evaluations via automation Feb 3, 2021
@leogao2 leogao2 moved this from To do, Evaluations to Implement to In Progress in Implementing Evaluations Feb 8, 2021
@leogao2 leogao2 closed this as completed Mar 26, 2021
Implementing Evaluations automation moved this from In Progress to Done, evaluations Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
Development

No branches or pull requests

2 participants