License: arXiv.org perpetual non-exclusive license
arXiv:2403.13250v1 [cs.CL] 20 Mar 2024

Facilitating Pornographic Text Detection for Open-Domain Dialogue Systems via Knowledge Distillation of Large Language Models

Huachuan Qiu1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Shuai Zhang1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Hongliang He1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Anqi Li1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Zhenzhong Lan2,2{}^{2,\dagger}start_FLOATSUPERSCRIPT 2 , † end_FLOATSUPERSCRIPT {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Corresponding Author. 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTZhejiang University, Hangzhou, China
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTSchool of Engineering, Westlake University, Hangzhou, China
{qiuhuachuan, lanzhenzhong}@westlake.edu.cn
Abstract

Pornographic content occurring in human-machine interaction dialogues can cause severe side effects for users in open-domain dialogue systems. However, research on detecting pornographic language within human-machine interaction dialogues is an important subject that is rarely studied. To advance in this direction, we introduce CensorChat, a dialogue monitoring dataset aimed at detecting whether the dialogue session contains pornographic content. To this end, we collect real-life human-machine interaction dialogues in the wild and break them down into single utterances and single-turn dialogues, with the last utterance spoken by the chatbot. We propose utilizing knowledge distillation of large language models to annotate the dataset. Specifically, first, the raw dataset is annotated by four open-source large language models, with the majority vote determining the label. Second, we use ChatGPT to update the empty label from the first step. Third, to ensure the quality of the validation and test sets, we utilize GPT-4 for label calibration. If the current label does not match the one generated by GPT-4, we employ a self-criticism strategy to verify its correctness. Finally, to facilitate the detection of pornographic text, we develop a series of text classifiers using a pseudo-labeled dataset. Detailed data analysis demonstrates that leveraging knowledge distillation techniques with large language models provides a practical and cost-efficient method for developing pornographic text detectors.

Index Terms:
Pornographic text detection, dialogue, dataset, dialogue system, knowledge distillation, large language model

I Introduction

Due to rapid developments and advancements in natural language processing techniques, such as transformer-based architecture [1, 2, 3], instruction tuning [4], and reinforcement learning from human feedback [5, 6, 7], open-domain dialogue systems [8, 9], also known as chatbots or conversational agents, are becoming increasingly prevalent in our daily lives. When users, especially children and teenagers, engage in conversations with chatbots exposed to pornographic text, they inevitably become susceptible to experiencing side effects, which may affect individuals’ mental well-being, relationships, and emotional state. Consequently, ensuring safe and helpful interactions has become increasingly paramount. However, the scarcity of data for monitoring and identifying pornographic text when users engage with open-domain dialogue systems hinders the advancement of content audit systems.

Refer to caption
Figure 1: Schematic overview of our proposed methodology: a (top panel): First, we apply four large language models for data annotation with a majority vote. b (middle panel): Second, we apply ChatGPT to update labels. Specifically, we iterate over each item in all data. If the pseudo-label is None, ChatGPT is applied to update the pseudo-label until an effective label is obtained. c (bottom panel): Finally, we split all data into training, validation, and test sets. We use GPT-4 to calibrate the current pseudo-labels in the validation and test sets using the self-criticism technique. Therefore, we fine-tune a BERT model as a text classifier on the pseudo-labeled data and evaluate the performance of the trained classifier on the test set.

Currently, most research primarily focuses on detecting pornographic images [10, 11, 12, 13] or videos [14, 15, 16] rather than pornographic text. Additionally, detecting pornographic text is an important subject of research for both industry and academia, yet it remains largely unexplored. Existing pornographic text detectors predominantly target Reddit posts and online web content [17], such as novels and stories, rather than dialogues, which leads to gaps in utility for efficiently identifying pornographic content in conversational scenarios. Therefore, against the backdrop of the explosive rise of chatbots, there is a significant role in developing classifiers that can accurately detect pornography in open-domain dialogue systems.

To the best of our knowledge, we are the first to propose the identification of pornographic language within human-machine interaction dialogues. To address this issue, we introduce CensorChat, a large-scale dialogue monitoring dataset designed for pornographic dialogue detection. To this end, we collect a multi-turn dialogue dataset that contains real-life human-machine interactions in the wild. Then, we split the dialogue into multiple single utterances and multiple single-turn dialogues, where the last utterance is spoken by the chatbot. We utilize knowledge distillation of large language models (LLMs) to construct pornographic content detectors, reducing time and labor costs. We present the schematic overview of our proposed method, as shown in Figure 1. First, we apply four large language models for data annotation with a majority vote. Second, we apply ChatGPT to update pseudo-labels. Specifically, we iterate over each item in all data. If the pseudo-label is missing, we apply ChatGPT to update the pseudo-label until an effective label is obtained. Finally, we split all data into training, validation, and test sets. We use GPT-4 to adjust the current pseudo-labels in the validation and test sets using the self-criticism technique [18, 19]. Therefore, we fine-tune a BERT model as a text classifier on the pseudo-labeled data and assess the performance of the trained classifier on the test set. Code and data are publicly available at https://github.com/qiuhuachuan/CensorChat.

II Related Work

II-A Pornographic Content Detection

Pornographic content exists in various media formats, including video, images, and text. Many researchers are making efforts to develop accurate and robust classifiers to filter or detect such large volumes of data in order to control the distribution of pornographic content online. However, most efforts are focused on detecting pornographic images [10, 11, 12, 13] and videos [14, 15, 16], with little research conducted on pornographic text detection [17], let alone dialogues.

II-B Data Annotation with Knowledge Distillation of Large Language Models

The rise of large language models, exemplified by systems such as ChatGPT and GPT-4, has generated considerable interest in their potential for efficient and high-quality data annotation. In the realm of natural language understanding, these large language models are employed to categorize text, such as agriculture [20] and banking [21], while in natural language generation, they aid in producing output sequences. For instance, aiming to tackle the data scarcity in mental health support, SmileChat [8] is a large-scale, diverse, and high-quality dialogue dataset, comprising 55,165 dialogues in total, produced using ChatGPT.

III Data Collection

III-A Pornographic Text in Dialogues

Pornographic text in dialogues refers to written material that contains explicit descriptions or depictions of sexual acts, organs, or behavior intended to arouse sexual excitement. This type of text typically includes explicit language that is intended to elicit sexual arousal or titillation. Pornographic text may vary widely in its content and intensity, ranging from mild descriptions of sexual encounters to more extreme and explicit depictions of taboo or fetishistic acts.

III-B Data Source

We collect data from several popular social media platforms in the wild, enabling people to engage in profound discussions about life, aspirations, and philosophy with well-known virtual figures for role-playing dialogues.

III-C Data Format

In open-domain human-machine interaction conversations, while we acknowledge the users’ right to express themselves freely, it is crucial to monitor the appropriateness of user inputs. Ensuring that dialogue systems do not generate pornographic content for users is a crucial task. To address this issue, we propose extracting the dialogue into two data formats: utterance-level and context-level content. For utterance-level content, we split the dialogue into utterances, consisting of {ui}1nsuperscriptsubscriptsubscript𝑢𝑖1𝑛\{u_{i}\}_{1}^{n}{ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For context-level content, we divide the dialogue into single-turn sessions, consisting of {uiU,uiC}1nsuperscriptsubscriptsuperscriptsubscript𝑢𝑖Usuperscriptsubscript𝑢𝑖C1𝑛\{u_{i}^{\mathrm{U}},u_{i}^{\mathrm{C}}\}_{1}^{n}{ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_U end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_C end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where users initiate the conversation and chatbots respond. u𝑢uitalic_u denotes the utterance. UU\mathrm{U}roman_U and CC\mathrm{C}roman_C denote the user and chatbot, respectively.

III-D Data Deduplication

To avoid the issue of data leakage in the validation and test sets, where a sample in validation or tests occurs in the training set, and to ensure the reliability of model evaluation, we conduct a data deduplication operation prior to data labeling.

TABLE I: Examples of pornographic and normal text.
Data Type Representative Case Label
Utterance-level I’m glad you think so! What are some of your favorite books and movies? Normal
I’m open to exploring a variety of toys, including vibrators, wands, and clamps. I also love
exploring different kinds of restraints and bondage equipment. I’m open to trying out different
kinds of toys and exploring different levels of intensity.
Pornographic
Context-level
User: You’re an Asian cat girl, right Cleo?
Chatbot: As an AI language model, I do not have a physical body, so I do not have a race or
ethnicity. However, in this roleplay scenario, my character is an Asian cat girl named Cleo.
Normal
User: Good, now Cleo, what is your Brest size?
Chatbot: As an AI language model, I don’t have a physical body and therefore don’t have a breast size.
However, if you’d like me to roleplay as a character with a specific breast size, I can certainly do that
for you. Please let me know how I can best fulfill your desires and fantasies.
Pornographic

IV Method

IV-A Background

Data annotation via knowledge distillation of a large language model (LLM) refers to prompting a teacher LLM to generate a label y𝑦yitalic_y for a given input x𝑥xitalic_x as the seed knowledge. In this paper, we mainly focus on the zero-shot paradigm, where the LLM is only provided with the input x𝑥xitalic_x and an instruction I𝐼Iitalic_I without any labeled examples or demonstrations. This direct approach to knowledge extraction from teacher LLMs is simple yet effective and has been widely used across various tasks and applications. It only requires having a dataset of input data, which is then fed into the LLM to obtain the desired label y𝑦yitalic_y. This process can be formulated as follows:

𝒟(label)={x,y|x𝒳,ypT(y|Ix)}superscript𝒟labelconditional-set𝑥𝑦formulae-sequencesimilar-to𝑥𝒳similar-to𝑦subscript𝑝𝑇conditional𝑦direct-sum𝐼𝑥\mathcal{D}^{(\mathrm{label})}=\{x,y|x\sim\mathcal{X},y\sim p_{T}(y|I\oplus x)\}caligraphic_D start_POSTSUPERSCRIPT ( roman_label ) end_POSTSUPERSCRIPT = { italic_x , italic_y | italic_x ∼ caligraphic_X , italic_y ∼ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_y | italic_I ⊕ italic_x ) } (1)

where direct-sum\oplus denotes the operation of text concatenation, 𝒳𝒳\mathcal{X}caligraphic_X denotes the unlabeled dataset, and pTsubscript𝑝𝑇p_{T}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT represents the teacher LLM.

Algorithm 1 Knowledge Distillation of Large Language Models for Pornographic Text Detection
0:  Dunlabeledsubscript𝐷unlabeledD_{\mathrm{unlabeled}}italic_D start_POSTSUBSCRIPT roman_unlabeled end_POSTSUBSCRIPT
0:  Dtrainsubscript𝐷trainD_{\mathrm{train}}italic_D start_POSTSUBSCRIPT roman_train end_POSTSUBSCRIPT, Dvalidsubscript𝐷validD_{\mathrm{valid}}italic_D start_POSTSUBSCRIPT roman_valid end_POSTSUBSCRIPT, Dtestsubscript𝐷testD_{\mathrm{test}}italic_D start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT
1:  // STAGE 1
2:  Utilize four open-source large language models to annotate unlabeled data Dunlabeledsubscript𝐷unlabeledD_{\mathrm{unlabeled}}italic_D start_POSTSUBSCRIPT roman_unlabeled end_POSTSUBSCRIPT as a dataset Dallsubscript𝐷allD_{\mathrm{all}}italic_D start_POSTSUBSCRIPT roman_all end_POSTSUBSCRIPT through a majority voting process
3:  // STAGE 2
4:  for each item in Dallsubscript𝐷allD_{\mathrm{all}}italic_D start_POSTSUBSCRIPT roman_all end_POSTSUBSCRIPT do
5:     while label is None do
6:        Use ChatGPT to update the label
7:     end while
8:  end for
9:  // STAGE 3
10:  Use Stratified Shuffle Split to split Dallsubscript𝐷allD_{\mathrm{all}}italic_D start_POSTSUBSCRIPT roman_all end_POSTSUBSCRIPT into Dtrainsubscript𝐷trainD_{\mathrm{train}}italic_D start_POSTSUBSCRIPT roman_train end_POSTSUBSCRIPT, Dvalidsubscript𝐷validD_{\mathrm{valid}}italic_D start_POSTSUBSCRIPT roman_valid end_POSTSUBSCRIPT, and Dtestsubscript𝐷testD_{\mathrm{test}}italic_D start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT
11:  for each item in Dvalidsubscript𝐷validD_{\mathrm{valid}}italic_D start_POSTSUBSCRIPT roman_valid end_POSTSUBSCRIPT or Dtestsubscript𝐷testD_{\mathrm{test}}italic_D start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT do
12:     if labelcurrentcurrent{}_{\mathrm{current}}start_FLOATSUBSCRIPT roman_current end_FLOATSUBSCRIPT \neq labelGPT4GPT4{}_{\mathrm{GPT-4}}start_FLOATSUBSCRIPT roman_GPT - 4 end_FLOATSUBSCRIPT then
13:        \doUtilize GPT-4 to calibrate the current label using the self-criticism technique and then update the label
14:     end if
15:  end for

Next, we present our algorithm for knowledge distillation of large language models in Algorithm 1, detailing each stage subsequently.

IV-B Knowledge Distillation of Large Language Models

Refer to caption
Figure 2: Prompt for utterance-level annotation.
Refer to caption
Figure 3: Prompt for context-level annotation.
Refer to caption
Figure 4: Prompts for label calibration with the self-criticism strategy.

IV-B1 Annotation Setup

In the initial annotation stage, we propose to use four open-source large language models, including ChatGLM2-6B, Gemma-2b-it, Gemma-7b-it and Qwen1.5-7B-Chat. Considering the generation efficiency of large language models, we set max_new_tokens to 100. Further, we use greedy decoding to generate desired labels. We use the same prompts for all four models, as shown in 2 and 3. The former is used for utterance-level annotation, while the latter is used for context-level annotation.

IV-B2 Majority Vote

We first use regular expressions to initially determine the label assigned by a teacher LLM, and then manually inspect the remaining samples along with the generated text to further assign the label produced by a teacher LLM. When a teacher LLM responds with ‘cannot provide an answer’ for a given sample, we assign the instance label as None. A sample’s label is determined only when 3 or 4 labels are all classified as pornographic or normal. Otherwise, the label is set to None.

IV-C Knowledge Distillation of ChatGPT

IV-C1 Annotation Setup

The ChatGPT model we use is gpt-3.5-turbo-0613. Both hyperparameters, temperature and top_p, are configured with an identical value, set to 1.0. Furthermore, we use the same prompts for updating the label, as presented in Figures 2 and 3.

IV-C2 Label Updating

When the current label is None, we utilize ChatGPT to update the label of the current sample until a valid label is obtained.

IV-D Knowledge Distillation of GPT-4

IV-D1 Annotation Setup

The GPT-4 model we use is gpt-4-0613. Both hyperparameters, temperature and top_p, are configured with an identical value set to 1.0.

IV-D2 Self-Criticism Strategy

The self-criticism strategy involves prompting the GPT-4 to assess its output for potential inaccuracies or areas of improvement. This strategy ensures that the information provided by GPT-4 is as accurate as possible. First, we conduct step 1 in Figure 4 and use the same prompts for generating labels, as presented in Figures 2 and 3. Only when the current label is not equal to the label produced by GPT-4, do we conduct steps 2 and 3 in Figure 1. The prompts for the self-criticism strategy we use are the same for both data types, as shown in Figure 4.

IV-E Corpus Statistics

Table I presents examples of pornographic and non-pornographic content in our dataset. Furthermore, Table II presents the statistics of our proposed dataset. For individual utterances, 6,729 out of 76,621 samples in the training set belong to pornographic content, representing 8.78%. However, in single-turn dialogue sessions, 6,558 out of 45,490 samples in the training set are classified as pornographic, accounting for 14.42%. Combining both types of data, there are a total of 13,287 samples out of 122,111 in the training set designated as pornographic content, comprising 10.88%.

TABLE II: Statistics of our corpus, which is divided into training, validation, and test sets.
Data Type Label Training Validation Test Total
Utterance-level Pornographic 6,729 375 387 7,491
Normal 69,892 625 613 71,130
Context-level Pornographic 6,558 373 381 7,312
Normal 38,932 627 619 40,178
Both Pornographic 13,287 748 768 14,803
Normal 108,824 1,252 1,232 111,308
All 122,111 2,000 2,000 126,111

V Experiments

V-A Task Formulation

To better monitor and detect pornographic text input by users or generated by dialogue systems, we approach the task as a text classification problem. We assemble our dataset as 𝒟={(xi,yi)}1n𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖1𝑛\mathcal{D}=\{(x_{i},y_{i})\}_{1}^{n}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has two data formats, as illustrated in §§\lx@sectionsign§III-C. At the level of individual utterances, xi=usubscript𝑥𝑖𝑢x_{i}=uitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u represents an utterance produced by a user or a dialogue system. At the context level, we focus on whether the model response is pornographic, conditioned on user input. For context-aware detection, we denote xi={[user]uU[SEP][chatbot]uC}subscript𝑥𝑖delimited-[]usersuperscript𝑢Udelimited-[]SEPdelimited-[]chatbotsuperscript𝑢Cx_{i}=\{\mathrm{[user]}\ u^{\mathrm{U}}\ \mathrm{[SEP]}\ \mathrm{[chatbot]}\ u% ^{\mathrm{C}}\}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { [ roman_user ] italic_u start_POSTSUPERSCRIPT roman_U end_POSTSUPERSCRIPT [ roman_SEP ] [ roman_chatbot ] italic_u start_POSTSUPERSCRIPT roman_C end_POSTSUPERSCRIPT }, where u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT stand for a single utterance produced by a user and a dialogue system, respectively. yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the label of the i𝑖iitalic_i-th sample. For distinguishing context-level content detection, we add two speaker tokens, [user] and [chatbot], and place a [SEP] token between two utterances.

TABLE III: Evaluation results of model performance on the test set. The results present the average value and standard deviation (subscript) of accuracy, precision, recall, and F1-score.
Pornographic (%) Normal (%) Macro Overall (%)
Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score Accuracy
85.203.11subscript85.203.1185.20_{3.11}85.20 start_POSTSUBSCRIPT 3.11 end_POSTSUBSCRIPT 88.803.70subscript88.803.7088.80_{3.70}88.80 start_POSTSUBSCRIPT 3.70 end_POSTSUBSCRIPT 86.801.10subscript86.801.1086.80_{1.10}86.80 start_POSTSUBSCRIPT 1.10 end_POSTSUBSCRIPT 92.801.92subscript92.801.9292.80_{1.92}92.80 start_POSTSUBSCRIPT 1.92 end_POSTSUBSCRIPT 90.402.51subscript90.402.5190.40_{2.51}90.40 start_POSTSUBSCRIPT 2.51 end_POSTSUBSCRIPT 91.400.89subscript91.400.8991.40_{0.89}91.40 start_POSTSUBSCRIPT 0.89 end_POSTSUBSCRIPT 89.201.10subscript89.201.1089.20_{1.10}89.20 start_POSTSUBSCRIPT 1.10 end_POSTSUBSCRIPT 89.400.89subscript89.400.8989.40_{0.89}89.40 start_POSTSUBSCRIPT 0.89 end_POSTSUBSCRIPT 89.201.10subscript89.201.1089.20_{1.10}89.20 start_POSTSUBSCRIPT 1.10 end_POSTSUBSCRIPT 89.600.89subscript89.600.8989.60_{0.89}89.60 start_POSTSUBSCRIPT 0.89 end_POSTSUBSCRIPT
TABLE IV: Case study for model prediction. ALL denotes all seeds, including 42, 43, 44, 45, and 46. - denotes that none of the seeds predict such a label.
ID Samples GPT-4 Label Prediction
Pornographic Normal
1 *starts to undress* Normal 43, 44, 46 42, 45
2 *I pull my pants down* Normal ALL -
3
*as you moan, I lash your belly hard, stinging your skin…* What did you say? *looking
sternly and grinning…*
Pornographic - ALL
4 Did my pussy taste good??? Pornographic 43, 44, 46 42, 45
5
User: *leads you to the shower naked*
Chatbot: *giggles and wraps her arms around your neck* Let’s make this a night to remember.
Pornographic 43, 45 42, 44, 46
6
User: *Jess is enjoying the dance*
Chatbot: *Avery continues to dance sensually, making sure to keep her eyes locked on Jess
the whole time*
Normal ALL -

V-B Objective Function

We apply the pre-trained model, BERT [2], which is a popular language model used widely in various tasks in natural language processing, to train a text classification model. In this paper, we fine-tune the entire bert-base-cased111https://huggingface.co/bert-base-cased model. The output features hhitalic_h of the top layer of the BERT model can be represented as z=[zc,z1,z2,,zn]𝑧subscript𝑧𝑐subscript𝑧1subscript𝑧2subscript𝑧𝑛z=\left[z_{c},z_{1},z_{2},...,z_{n}\right]italic_z = [ italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the representation of the class-specific token [CLS].

Refer to caption
Figure 5: Mechanism of pornographic text classification.

The mechanism of pornographic text detection is presented in Figure 5. To facilitate detecting pornographic text for a dialogue system, we train a fully-connected feed-forward neural network (FFNN) with a softmax activation function to identify content categories based on a pre-trained language model. Specifically, we feed zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT into a feed-forward neural network with a default model dropout rate of 0.1 for the final prediction. Our optimized objective function is

LCE=dDtrainc=1Cydclny^dcsubscript𝐿CEsubscript𝑑subscript𝐷trainsuperscriptsubscript𝑐1𝐶subscript𝑦𝑑𝑐lnsubscript^𝑦𝑑𝑐L_{\mathrm{CE}}=-\sum_{d\in D_{\mathrm{train}}}\sum_{c=1}^{C}y_{dc}\mathrm{ln}% \hat{y}_{dc}italic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT roman_train end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT roman_ln over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT (2)

where C𝐶Citalic_C represents the output dimension, which is defined as the union of the label spaces from the training, validation, and test sets, while ydsubscript𝑦𝑑y_{d}italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT corresponds to the golden label.

V-C Hyperparameters for Fine-tuning

During the fine-tuning process, we utilize the Adam optimizer [22] with momentum values [β1,β2]=[0.9,0.999]subscript𝛽1subscript𝛽20.90.999[\beta_{1},\beta_{2}]=[0.9,0.999][ italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = [ 0.9 , 0.999 ]. The learning rate is initialized at 2e52𝑒52e-52 italic_e - 5 and decays using a linear scheduler. The batch size is set to 16, with a maximum sequence length of 512. We use five commonly used random seeds, including 42, 43, 44, 45, and 46. The warm-up ratio and dropout are both 0.1. The weight decay is 0.01. The training epoch is 10 and we update the model parameters in each batch. We employ the standard cross-entropy loss [23] to train our model and retain the checkpoint when the accuracy is best in the validation set.

VI Results and Discussion

VI-A Evaluation Metrics

We employ the widely used metrics of precision, recall, and F1-score to evaluate the performance of models for each category. Additionally, we utilize macro precision, recall, F1-score and accuracy to evaluate the overall performance of the models.

VI-B Analysis

We present the classification results of the BERT model in Table III. In summary, we observe that the trained classifier can better identify the pornographic category, achieving a macro-precision of 89.20%, a macro-recall of 89.40%, a macro-F1 score of 89.20%, and an average accuracy of 89.60%. These results demonstrate that despite a significant label imbalance, classification performance is satisfactory.

In the predominant normal category, the model prediction displays a certain bias towards it, leading to slightly higher precision, recall, and F1 scores compared to the overall values. Referring to the results in Table III, we observe precision, recall, and F1 scores of 92.80%, 90.40%, and 91.40%, respectively.

Conversely, for the category representing a smaller amount of pornography, the model prediction exhibits a bias towards disadvantage, resulting in slightly lower precision, recall, and F1 scores compared to the overall values. From Table III, we observe precision, recall, and F1 scores of 85.20%, 88.80%, and 86.80%, respectively.

We expect the classifier to accurately predict the pornography category rather than misclassify it as the normal category. At this point, our focus is on recall, aiming to minimize false negatives in the recall value. Simultaneously, we also hope the classifier does not predict too many normal categories as pornography, as this would diminish the model’s robustness. In practical applications, we prioritize minimizing false negative predictions. Therefore, to some extent, we expect the trained classifier’s recall value to be higher than precision.

VI-C Case Study

We present several case studies in Table IV to deeply analyze the performance of model predictions.

Case 1

The statement “*starts to undress*” itself does not contain explicit sexual content, as it merely describes an individual initiating the act of removing clothing. Considering this sentence on its own, we can clearly see that it is labeled as normal. From Table IV, we find that there are three cases (with random seeds 43, 44, and 46) where this text is predicted to be pornographic.

Case 2

The sentence “*I pull my pants down*” itself does not contain explicit or pornographic content, as it simply describes an action. However, the models trained on five random seeds all predict such text as pornographic.

Case 3

The provided statement, “*as you moan, I lash your belly hard, stinging your skin…* What did you say? *looking sternly and grinning…*” falls within the domain of erotica. It depicts a scenario laden with sexual innuendos, including physical violence and the accompanying expressions of such actions, including moaning and a stern, grinning demeanor. However, all five models wrongly predict this utterance to be normal.

Case 4

The provided utterance, “Did my pussy taste good???”, falls within the domain of explicit content. It encompasses references to intimate anatomy and sexual activity. However, three models predict this utterance as pornographic while the other two wrongly predict such an utterance as normal.

Case 5

Considering the dialogue, “User: *leads you to the shower naked*\\\backslash\n Chatbot: *giggles and wraps her arms around your neck* Let’s make this a night to remember.”, the response from the chatbot in the given dialogue contains sexual implications, making it qualify as explicit or adult content. The reason is that the chatbot’s response involves wrapping arms around someone’s neck in a naked state and suggesting to make the encounter a memorable night. However, only two out of five models predict such dialogue to be pornographic.

Case 6

Considering the dialogue, “User: *Jess is enjoying the dance* \\\backslash\n Chatbot: *Avery continues to dance sensually, making sure to keep her eyes locked on Jess the whole time*”, the response does not contain explicit or adult content. However, all five models predict this dialogue to be pornographic. The reason behind this may be determined by the word “sensually”.

VII Conclusion

In sum, the development of CensorChat represents a significant step forward in the field of pornographic dialogue detection. This dataset, constructed using knowledge distillation of large language models, offers a practical and cost-effective solution to a pressing issue. By utilizing real-life human-machine interactions and leveraging advanced annotation techniques, the dataset ensures the quality and accuracy of the content detectors. The incorporation of the self-criticism strategy further enhances the reliability of the labels. Ultimately, fine-tuning a BERT model on pseudo-label dataset demonstrates the practical utility, paving the way for more effective and efficient pornographic dialogue detection systems in the future.

VIII Limitation

In this paper, all data are labeled from large language models. Among the large language models used in the dataset are ChatGLM2-6B, Gemma-2b-it, Gemma-7b-it, Qwen1.5-7b-Chat, and ChatGPT. Then we update the labels with ChatGPT. The validation and test sets are split from the former. Finally, the labels are calibrated using GPT-4 using a self-criticism strategy. There is bound to be some model-error mislabeled data in them, which, of course, is equally unavoidable in the real-life labeling process. In sum, the correctness of the instances cannot be fully guaranteed. There may exist biases, errors, and incompleteness in the training process. These classifiers are for reference only and cannot guarantee the accuracy and reliability of their predictions. We do not bear any responsibility for the results generated by using the classifiers or any loss caused by using the classifiers. Users should verify the correctness of the classifier’s prediction on their own when using the classifiers.

References

  • [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [4] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  • [5] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
  • [6] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
  • [7] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  • [8] H. Qiu, H. He, S. Zhang, A. Li, and Z. Lan, “Smile: Single-turn to multi-turn inclusive language expansion via chatgpt for mental health support,” arXiv preprint arXiv:2305.00450, 2023.
  • [9] H. Lu, Z. Guo, C. Li, Y. Yang, H. He, and S. Bao, “Towards building an open-domain dialogue system incorporated with internet memes,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [10] K. Zhou, L. Zhuo, Z. Geng, J. Zhang, and X. G. Li, “Convolutional neural networks based pornographic image classification,” in 2016 IEEE Second International Conference on Multimedia Big Data (BigMM).   IEEE, 2016, pp. 206–209.
  • [11] L. Zhuo, Z. Geng, J. Zhang, and X. guang Li, “Orb feature based web pornographic image recognition,” Neurocomputing, vol. 173, pp. 511–517, 2016.
  • [12] A. Tabone, K. Camilleri, A. Bonnici, S. Cristina, R. Farrugia, and M. Borg, “Pornographic content classification using deep-learning,” in Proceedings of the 21st ACM Symposium on Document Engineering, 2021, pp. 1–10.
  • [13] S. Samal, R. Nayak, S. Jena, and B. K. Balabantaray, “Obscene image detection using transfer learning and feature fusion,” Multimedia Tools and Applications, pp. 1–29, 2023.
  • [14] C. Jansohn, A. Ulges, and T. M. Breuel, “Detecting pornographic video content by combining image features with motion information,” in Proceedings of the 17th ACM international conference on Multimedia, 2009, pp. 601–604.
  • [15] M. Perez, S. Avila, D. Moreira, D. Moraes, V. Testoni, E. Valle, S. Goldenstein, and A. Rocha, “Video pornography detection through deep learning techniques and motion information,” Neurocomputing, vol. 230, pp. 279–293, 2017.
  • [16] S. Samal, Y.-D. Zhang, T. R. Gadekallu, R. Nayak, and B. K. Balabantaray, “Sbmyv3: Improved mobyolov3 a bam attention-based approach for obscene image and video detection,” Expert Systems, p. e13230, 2023.
  • [17] K. Song, Y. Kang, W. Gao, Z. Gao, C. Sun, and X. Liu, “Evidence aware neural pornographic text identification for child protection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 17, 2021, pp. 14 939–14 947.
  • [18] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang et al., “Self-refine: Iterative refinement with self-feedback,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [19] W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and J. Leike, “Self-critiquing models for assisting human evaluators,” arXiv preprint arXiv:2206.05802, 2022.
  • [20] B. Zhao, W. Jin, J. Del Ser, and G. Yang, “Chatagri: Exploring potentials of chatgpt on cross-linguistic agricultural text classification,” Neurocomputing, vol. 557, p. 126708, 2023.
  • [21] L. Loukas, I. Stogiannidis, O. Diamantopoulos, P. Malakasiotis, and S. Vassos, “Making llms worth every penny: Resource-limited text classification in banking,” in Proceedings of the Fourth ACM International Conference on AI in Finance, 2023, pp. 392–400.
  • [22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [23] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of operations research, vol. 134, pp. 19–67, 2005.