Beemo (Benchmark of expert-edited machine-generated outputs) is a novel benchmark of 2195 texts generated by ten instruction-finetuned language models (LMs) and edited by expert annotators for various use cases, ranging from creative writing to text summarization. We make one of the first attempts to address more practical machine-generated text detection scenarios, where the user polishes the model output to make it more human-like.
Refer to our HuggingFace page for general statistics and initial evaluation results and stay tuned for more details in our upcoming paper💥.
Our benchmark is named after BMO (abbreviated from "Be MOre", phonetically spelled "Beemo"), one of the main characters of Adventure Time.
- 📊 Curated by: University of Oslo, MIT Lincoln Laboratory, Penn State University, and Toloka.
- 🌐 Language(s): English
- 🗞️ Paper: TBA
- 🪪 License: MIT
17.09.2024
: the initial release and evaluation of 11 detectors on Beemo.
The Beemo's creation approach involves:
- 🦾 Text Generation: prompting an instruction-finetuned LM;
- 👩🏻🔬 Text Editing: editing the LM's output by an expert annotator;
- ✅ Peer-reviewing: peer-reviewing the annotator's edits.
🦾 Text Generation
The No Robots 🙅♂️🤖 dataset is used as the source of prompts and corresponding human-written texts across the following categories: Generation, Rewrite, Summarize, Open QA, and Closed QA. We randomly sample each prompt to generate an output with one of ten open-source instruction-finetuned LMs using the default 🤗 HuggingFace inference hyperparameters.
Name | Base | SFT corpus | License | Paper |
---|---|---|---|---|
HuggingFaceH4/zephyr-7b-beta | Mistral-7B-v0.1 | UltraChat, UltradFeedback | MIT | Tunstall et al., (2023) |
allenai/tulu-2-7b | Llama 2 7B | human-written and synthetic | AI2 ImpACT | Ivison et. al, (2023) |
allenai/tulu-2-13b | Llama 2 13B | human-written and synthetic | AI2 ImpACT | Ivison et. al, (2023) |
google/gemma-2b-it | Gemma 2B | human-written and synthetic | Gemma license | Team, Gemma, et al., (2024) |
google/gemma-7b-it | Gemma 7B | human-written and synthetic | Gemma license | Team, Gemma, et al., (2024) |
meta-llama/Llama-2-7b-chat-hf | Llama 2 7B | Misc. | Llama license | Touvron et al., (2023) |
meta-llama/Llama-2-13b-chat-hf | Llama 2 13B | Misc. | Llama license | Touvron et al., (2023) |
meta-llama/Llama-2-70b-chat-hf | Llama 2 70B | Misc. | Llama license | Touvron et al., (2023) |
mistralai/Mistral-7B-Instruct-v0.1 | Mistral-7B-v0.1 | Misc. | Apache-2.0 | Jiang et. al, (2023) |
mistralai/Mixtral-8x7B-Instruct-v0.1 | Mixtral 8x7B | Misc. | Apache-2.0 | Jiang et al., (2024) |
Table 1: Overview of the instruction-finetuned LMs used to create Beemo. |
👩🏻🔬 Text Editing
The machine-generated texts are edited by an in-house team of expert annotators, who are well experienced in editing and annotating generated content. Each annotator is given detailed category-specific annotation guidelines before performing the task. The annotation task is to (1) carefully read a given prompt and the LM response and (2) refine the output by correcting factual inconsistency, removing hallucinations, and improving style, coherence, and fluency. The percentage of required edits ranges between 20% and 40%. The annotator is asked to label the text as:
- "Perfect" if it does not require any changes and aligns with the prompt intent or
- "Rejected" if it requires more significant improvements or does not follow the prompt closely.
We discard the "Perfect" and "Rejected" examples and create Beemo using only the edited texts.
✅ Peer-reviewing
Each edited machine-generated response undergoes a peer-reviewing and quality control stage based on manual validation and automatic quality criteria.
An experienced lead editor performs the manual validation. The editor communicates with the team of expert annotators daily, suggests areas for improvement regarding specific categories, and provides recommendations and feedback in a group chat.
If an edited text does not pass any of the automatic quality criteria listed below, it is returned to the team of expert annotators and lead editor for revision.
- Estimating the number of edits using the
difflib
library: no less than 20% of the text should be edited (Python Docs - difflib). - Tracking time spent on editing: no less than 2 minutes should be spent editing one machine-generated text.
- Adversarial filtering: at least one commercial AI detector should recognize the edited text as human-written.
- The prompts (
prompt
) and human-written texts (human_output
) from No Robots 🙅♂️🤖 are under the original dataset's license: CC-BY-NC-4.0. - The machine-generated texts (
model_output
) are subject to the underlying instruction-finetuned LLMs' licensing terms. - The expert-edited machine-generated texts (
human_edits
) are available under the MIT license, unless otherwise specified in the underlying instruction-finetuned LLMs' licensing terms.
- Vladislav Mikhailov ([email protected])
- Ekaterina Artemova ([email protected])