Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the News Article Generation evaluation #29

Closed
2 tasks
StellaAthena opened this issue Sep 16, 2020 · 0 comments
Closed
2 tasks

Implement the News Article Generation evaluation #29

StellaAthena opened this issue Sep 16, 2020 · 0 comments
Labels
feature request A feature that isn't implemented yet.

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

This is rather intensive to carry out due to the need for human input and may be skipped.

From the GPT-3 paper:

To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional
sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles
from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative
language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to
distinguish the two is a potentially important measure of quality. In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model. Participants were asked to select whether the article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”.
The articles we selected were not in the models’ training data and the model outputs were formatted and selected programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a 160M parameter model with no context and increased output randomness.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@StellaAthena StellaAthena removed the good first issue Good for newcomers label Jan 21, 2021
pruksmhc pushed a commit to pruksmhc/lm-evaluation-harness that referenced this issue May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet.
Projects
No open projects
Implementing Evaluations
  
To do, Evaluations to Implement
Development

No branches or pull requests

1 participant