SafeSora Dataset
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
PKU-Alignment Team @ Peking-University
Dataset Composition
SAFESORA comprises 14,711 unique text prompts, of which 44.54% are real user prompts for text-to-video models online, and 55.46% manually constructed by our team. Among these, 48.61% may potentially induce harmful videos, whereas 51.39% are neutral.
Among all prompts, 29.13% generated 3 unique videos, and 28.39% generated no less than 5 unique videos. 42.30% of the videos were generated using large language models to enhance user prompts for better generation quality.
For a total of 57,333 T-V pairs, we annotated 12 potential harm categories, of which 76.29% are assigned as safe and 23.71% are categorized with at least one harm label.
SAFESORA includes 51,691 human preference annotations, structured as paired comparisons between T-V pairs. Preference is decoupled into two dimensions: helpfulness and harmlessness.
Figure1: Proportion of multi-label classifications for Prompt
Why SafeSora Dataset?
The multimodal nature of text-to-video models presents new challenges for AI alignment, including the scarcity of efficient datasets for alignment and the inherent complexity of multimodal data. To mitigate the risk of harmful outputs from large vision models, we introduce the SAFESORA dataset to promote research on aligning text-to-video generation with human values, which has the following features:
First T-V Preference Dataset: To our knowledge, SAFESORA is the first dataset capturing real human preferences for text-to-video generation tasks.
Real Human Annotation Data: SAFESORA represents real feedback from crowd workers, designed to explore their subjective perceptions and preferences.
Decoupled Helpfulness and Harmlessness: SAFESORA independently annotates the dimensions of helpfulness and harmlessness, thereby preventing crowd workers from encountering conflicts between these criteria and facilitating research on how to guide this tension.
Multi-faceted Annotation: SAFESORA includes results from sub-dimension annotations within the two comprehensive dimensions, providing a diverse and unique perspective and enabling detailed correlation analysis.
Effective Dataset for Alignment: SAFESORA is validated as effective through a series of baseline experiments, including training a T-V Moderation, preference models to predict human preferences for evaluating the alignment capability of large vision models; and implementing two baseline alignment algorithms by training Prompt Refiner or fine-tuning Diffusion model.
Data Point Example
Annotation Pipeline
Figure3: Left - Video generation pipeline: Both the original and augmented prompts are then used to generate multiple videos using five video generation models to form T-V pairs. Right - Two-stage annotation: The annotation process is structured into two distinct dimensions and two sequential stages. In the initial heuristic stage, crowdworkers are guided to annotate 4 sub-dimensions of helpfulness and 12 sub-categories of harmlessness. In the subsequent stage, they provide their decoupled preference upon two T-V pairs based on the dimensions of helpfulness and harmlessness.
Inspiring Future Research
Figure4: Left - T-V Moderation: T-V Moderation incorporates user text inputs as criteria for evaluation, allowing it to filter out more potentially harmfulmulti-modal responses. The agreement ratio between T-V Moderation trained on the multi-label data of the SAFESORA training dataset and human judgment on the test set is 82.94%. Right - Preference Reward Model: Based on our dataset, we also develop a reward model focuses on helpfulness and a cost model focuses on harmfulness. The agreement ratio with crowd workers is 65.29% for the reward model and 72.41% for the cost model.