Beemo

Dataset Description

Beemo (Benchmark of expert-edited machine-generated outputs) is a novel benchmark of 2195 texts generated by ten instruction-finetuned language models (LMs) and edited by expert annotators for various use cases, ranging from creative writing to text summarization. We make one of the first attempts to address more practical machine-generated text detection scenarios, where the user polishes the model output to make it more human-like.

Refer to our HuggingFace page for general statistics and initial evaluation results and stay tuned for more details in our upcoming paper💥.

Our benchmark is named after BMO (abbreviated from "Be MOre", phonetically spelled "Beemo"), one of the main characters of Adventure Time.

📊 Curated by: University of Oslo, MIT Lincoln Laboratory, Penn State University, and Toloka.
🌐 Language(s): English
🗞️ Paper: TBA
🪪 License: MIT

🔥Updates

17.09.2024: the initial release and evaluation of 11 detectors on Beemo.

Dataset Creation

The Beemo's creation approach involves:

🦾 Text Generation: prompting an instruction-finetuned LM;
👩🏻‍🔬 Text Editing: editing the LM's output by an expert annotator;
✅ Peer-reviewing: peer-reviewing the annotator's edits.

🦾 Text Generation

The No Robots 🙅‍♂️🤖 dataset is used as the source of prompts and corresponding human-written texts across the following categories: Generation, Rewrite, Summarize, Open QA, and Closed QA. We randomly sample each prompt to generate an output with one of ten open-source instruction-finetuned LMs using the default 🤗 HuggingFace inference hyperparameters.

Instruction-finetuned LMs

Name	Base	SFT corpus	License	Paper
HuggingFaceH4/zephyr-7b-beta	Mistral-7B-v0.1	UltraChat, UltradFeedback	MIT	Tunstall et al., (2023)
allenai/tulu-2-7b	Llama 2 7B	human-written and synthetic	AI2 ImpACT	Ivison et. al, (2023)
allenai/tulu-2-13b	Llama 2 13B	human-written and synthetic	AI2 ImpACT	Ivison et. al, (2023)
google/gemma-2b-it	Gemma 2B	human-written and synthetic	Gemma license	Team, Gemma, et al., (2024)
google/gemma-7b-it	Gemma 7B	human-written and synthetic	Gemma license	Team, Gemma, et al., (2024)
meta-llama/Llama-2-7b-chat-hf	Llama 2 7B	Misc.	Llama license	Touvron et al., (2023)
meta-llama/Llama-2-13b-chat-hf	Llama 2 13B	Misc.	Llama license	Touvron et al., (2023)
meta-llama/Llama-2-70b-chat-hf	Llama 2 70B	Misc.	Llama license	Touvron et al., (2023)
mistralai/Mistral-7B-Instruct-v0.1	Mistral-7B-v0.1	Misc.	Apache-2.0	Jiang et. al, (2023)
mistralai/Mixtral-8x7B-Instruct-v0.1	Mixtral 8x7B	Misc.	Apache-2.0	Jiang et al., (2024)
Table 1: Overview of the instruction-finetuned LMs used to create Beemo.

👩🏻‍🔬 Text Editing

The machine-generated texts are edited by an in-house team of expert annotators, who are well experienced in editing and annotating generated content. Each annotator is given detailed category-specific annotation guidelines before performing the task. The annotation task is to (1) carefully read a given prompt and the LM response and (2) refine the output by correcting factual inconsistency, removing hallucinations, and improving style, coherence, and fluency. The percentage of required edits ranges between 20% and 40%. The annotator is asked to label the text as:

"Perfect" if it does not require any changes and aligns with the prompt intent or
"Rejected" if it requires more significant improvements or does not follow the prompt closely.

We discard the "Perfect" and "Rejected" examples and create Beemo using only the edited texts.

✅ Peer-reviewing

Each edited machine-generated response undergoes a peer-reviewing and quality control stage based on manual validation and automatic quality criteria.

🕵🏻 Manual Validation

An experienced lead editor performs the manual validation. The editor communicates with the team of expert annotators daily, suggests areas for improvement regarding specific categories, and provides recommendations and feedback in a group chat.

🔍 Automatic Quality Criteria

If an edited text does not pass any of the automatic quality criteria listed below, it is returned to the team of expert annotators and lead editor for revision.

Estimating the number of edits using the difflib library: no less than 20% of the text should be edited (Python Docs - difflib).
Tracking time spent on editing: no less than 2 minutes should be spent editing one machine-generated text.
Adversarial filtering: at least one commercial AI detector should recognize the edited text as human-written.

License

The prompts (prompt) and human-written texts (human_output) from No Robots 🙅‍♂️🤖 are under the original dataset's license: CC-BY-NC-4.0.
The machine-generated texts (model_output) are subject to the underlying instruction-finetuned LLMs' licensing terms.
The expert-edited machine-generated texts (human_edits) are available under the MIT license, unless otherwise specified in the underlying instruction-finetuned LLMs' licensing terms.

Contact us

Vladislav Mikhailov ([email protected])
Ekaterina Artemova ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
beemo.gif		beemo.gif
dataset.parquet		dataset.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beemo

Dataset Description

🔥Updates

Dataset Creation

Instruction-finetuned LMs

🕵🏻 Manual Validation

🔍 Automatic Quality Criteria

License

Contact us

About

Releases

Packages

License

Toloka/beemo

Folders and files

Latest commit

History

Repository files navigation

Beemo

Dataset Description

🔥Updates

Dataset Creation

Instruction-finetuned LMs

🕵🏻 Manual Validation

🔍 Automatic Quality Criteria

License

Contact us

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Packages