Skip to content

Toloka/beemo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beemo

GIF Source.

Dataset Description

Beemo (Benchmark of expert-edited machine-generated outputs) is a novel benchmark of 2195 texts generated by ten instruction-finetuned language models (LMs) and edited by expert annotators for various use cases, ranging from creative writing to text summarization. We make one of the first attempts to address more practical machine-generated text detection scenarios, where the user polishes the model output to make it more human-like.

Refer to our HuggingFace page for general statistics and initial evaluation results and stay tuned for more details in our upcoming paper💥.

Our benchmark is named after BMO (abbreviated from "Be MOre", phonetically spelled "Beemo"), one of the main characters of Adventure Time.

  • 📊 Curated by: University of Oslo, MIT Lincoln Laboratory, Penn State University, and Toloka.
  • 🌐 Language(s): English
  • 🗞️ Paper: TBA
  • 🪪 License: MIT

🔥Updates

  • 17.09.2024: the initial release and evaluation of 11 detectors on Beemo.

Dataset Creation

The Beemo's creation approach involves:

  • 🦾 Text Generation: prompting an instruction-finetuned LM;
  • 👩🏻‍🔬 Text Editing: editing the LM's output by an expert annotator;
  • Peer-reviewing: peer-reviewing the annotator's edits.
🦾 Text Generation

The No Robots 🙅‍♂️🤖 dataset is used as the source of prompts and corresponding human-written texts across the following categories: Generation, Rewrite, Summarize, Open QA, and Closed QA. We randomly sample each prompt to generate an output with one of ten open-source instruction-finetuned LMs using the default 🤗 HuggingFace inference hyperparameters.

Instruction-finetuned LMs

Name Base SFT corpus License Paper
HuggingFaceH4/zephyr-7b-beta Mistral-7B-v0.1 UltraChat, UltradFeedback MIT Tunstall et al., (2023)
allenai/tulu-2-7b Llama 2 7B human-written and synthetic AI2 ImpACT Ivison et. al, (2023)
allenai/tulu-2-13b Llama 2 13B human-written and synthetic AI2 ImpACT Ivison et. al, (2023)
google/gemma-2b-it Gemma 2B human-written and synthetic Gemma license Team, Gemma, et al., (2024)
google/gemma-7b-it Gemma 7B human-written and synthetic Gemma license Team, Gemma, et al., (2024)
meta-llama/Llama-2-7b-chat-hf Llama 2 7B Misc. Llama license Touvron et al., (2023)
meta-llama/Llama-2-13b-chat-hf Llama 2 13B Misc. Llama license Touvron et al., (2023)
meta-llama/Llama-2-70b-chat-hf Llama 2 70B Misc. Llama license Touvron et al., (2023)
mistralai/Mistral-7B-Instruct-v0.1 Mistral-7B-v0.1 Misc. Apache-2.0 Jiang et. al, (2023)
mistralai/Mixtral-8x7B-Instruct-v0.1 Mixtral 8x7B Misc. Apache-2.0 Jiang et al., (2024)
Table 1: Overview of the instruction-finetuned LMs used to create Beemo.
👩🏻‍🔬 Text Editing

The machine-generated texts are edited by an in-house team of expert annotators, who are well experienced in editing and annotating generated content. Each annotator is given detailed category-specific annotation guidelines before performing the task. The annotation task is to (1) carefully read a given prompt and the LM response and (2) refine the output by correcting factual inconsistency, removing hallucinations, and improving style, coherence, and fluency. The percentage of required edits ranges between 20% and 40%. The annotator is asked to label the text as:

  • "Perfect" if it does not require any changes and aligns with the prompt intent or
  • "Rejected" if it requires more significant improvements or does not follow the prompt closely.

We discard the "Perfect" and "Rejected" examples and create Beemo using only the edited texts.

✅ Peer-reviewing

Each edited machine-generated response undergoes a peer-reviewing and quality control stage based on manual validation and automatic quality criteria.

🕵🏻 Manual Validation

An experienced lead editor performs the manual validation. The editor communicates with the team of expert annotators daily, suggests areas for improvement regarding specific categories, and provides recommendations and feedback in a group chat.

🔍 Automatic Quality Criteria

If an edited text does not pass any of the automatic quality criteria listed below, it is returned to the team of expert annotators and lead editor for revision.

  • Estimating the number of edits using the difflib library: no less than 20% of the text should be edited (Python Docs - difflib).
  • Tracking time spent on editing: no less than 2 minutes should be spent editing one machine-generated text.
  • Adversarial filtering: at least one commercial AI detector should recognize the edited text as human-written.

License

  • The prompts (prompt) and human-written texts (human_output) from No Robots 🙅‍♂️🤖 are under the original dataset's license: CC-BY-NC-4.0.
  • The machine-generated texts (model_output) are subject to the underlying instruction-finetuned LLMs' licensing terms.
  • The expert-edited machine-generated texts (human_edits) are available under the MIT license, unless otherwise specified in the underlying instruction-finetuned LLMs' licensing terms.

Contact us

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published