Implement the News Article Generation evaluation #29

StellaAthena · 2020-09-16T17:16:55Z

This is rather intensive to carry out due to the need for human input and may be skipped.

From the GPT-3 paper:

To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional
sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles
from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative
language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to
distinguish the two is a potentially important measure of quality. In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model. Participants were asked to select whether the article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”.
The articles we selected were not in the models’ training data and the model outputs were formatted and selected programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a 160M parameter model with no context and increased output randomness.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

…ence Multi Target Support

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena removed the Eval Set label Dec 23, 2020

StellaAthena closed this as completed Jan 4, 2021

StellaAthena reopened this Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

StellaAthena removed the good first issue Good for newcomers label Jan 21, 2021

pruksmhc pushed a commit to pruksmhc/lm-evaluation-harness that referenced this issue May 10, 2022

Merge pull request EleutherAI#29 from bigscience-workshop/multi-refer…

55416c2

…ence Multi Target Support

StellaAthena closed this as completed Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the News Article Generation evaluation #29

Implement the News Article Generation evaluation #29

StellaAthena commented Sep 16, 2020 •

edited

Loading

Implement the News Article Generation evaluation #29

Implement the News Article Generation evaluation #29

Comments

StellaAthena commented Sep 16, 2020 • edited Loading

StellaAthena commented Sep 16, 2020 •

edited

Loading