CN112395414B

CN112395414B - Text classification method, training method of classification model, training device of classification model, medium and training equipment

Info

Publication number: CN112395414B
Application number: CN201910759761.8A
Authority: CN
Inventors: 马腾岳; 周蕾蕾
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2024-06-04
Anticipated expiration: 2039-08-16
Also published as: CN112395414A

Abstract

The embodiment of the disclosure discloses a text classification method and training method, device, medium and equipment of a classification model. The text classification method comprises the following steps: acquiring a text to be processed; according to the predefined word slot categories, marking the word slot categories of the text to be processed; and carrying out field classification on the text to be processed according to the result of the word slot category labeling to obtain the field category of the text to be processed. The method and the device can realize accurate domain classification of sentences, thereby improving the accuracy of domain classification.

Description

Text classification method, training method of classification model, training device of classification model, medium and training equipment

Technical Field

The present disclosure relates to speech technology, and in particular, to a text classification method and training method, apparatus, medium and device for classification model.

Background

Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), is a technology that converts human speech into a computer-readable form of input. In speech recognition, after converting human speech into text, semantic understanding of the text is required to be able to convert the text into a computer-readable input form.

Wherein short text classification is a key step in semantic understanding. Short text classification refers to determining domain category information to which sentences in text belong, such as: "play baby song", it is a "music" field; "weather today" belongs to the "weather" field.

Disclosure of Invention

In order to solve at least one technical problem in the prior art, the embodiment of the disclosure provides a technical scheme for text classification and a technical scheme for classification model training.

According to an aspect of an embodiment of the present disclosure, there is provided a text classification method including:

Acquiring a text to be processed;

According to the predefined word slot categories, marking the word slot categories of the text to be processed;

and carrying out field classification on the text to be processed according to the result of the word slot category labeling to obtain the field category of the text to be processed.

According to another aspect of an embodiment of the present disclosure, there is provided a training method of a classification model, including:

acquiring a first data set, wherein the samples in the first data set are marked with field category information;

According to the predefined word slot categories, carrying out word slot category labeling on the samples in the first data set;

And training a domain classification model by using the first data set according to the result of the word slot class annotation.

According to still another aspect of the embodiments of the present disclosure, there is provided a text classification apparatus including:

the first acquisition module is used for acquiring a text to be processed;

the labeling module is used for labeling word slot types of the text to be processed acquired by the first acquisition module according to the predefined word slot types;

And the classification module is used for carrying out field classification on the text to be processed according to the word slot category labeling result obtained by the labeling module to obtain the field category of the text to be processed.

According to still another aspect of the embodiments of the present disclosure, there is provided a training apparatus for a classification model, including:

the second acquisition module is used for acquiring a first data set, and the samples in the first data set are marked with field category information;

the labeling module is used for labeling word slot types of the samples in the first data set acquired by the second acquisition module according to the predefined word slot types;

And the first training module is used for training a domain classification model by using the first data set according to the word slot class labeling result obtained by the labeling module.

According to a further aspect of the disclosed embodiments, there is provided a computer readable storage medium storing a computer program for performing the method of any one of the embodiments described above.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic device including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method according to any of the embodiments.

Based on the text classification method and device, the computer readable storage medium and the electronic equipment provided by the embodiment of the disclosure, the text to be processed is labeled according to the word slot type defined in advance, and the field classification is performed on the text to be processed according to the result of the word slot type labeling. Because the domain classification of the text to be processed is completed according to the word slot class marked by the text to be processed without considering specific words, the accurate domain classification of sentences can be realized, and the accuracy of the domain classification is improved.

According to the training method and device for the classification model, the computer readable storage medium and the electronic equipment, according to the word slot type defined in advance, the word slot type marking is carried out on the samples in the first data set, the field type information is marked on the samples in the first data set, and then the field classification model is trained according to the word slot type marking result by utilizing the first data set. When the trained domain classification model of the method of the embodiment is used for classifying the domain of the text to be processed, the domain classification of the text to be processed is completed according to the word slot class marked by the text to be processed, and for sentences of the words which do not appear in the training sample in the text to be processed, the accurate domain classification can be performed on the sentences according to the word slot class of the sentences, so that the accuracy of the domain classification is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a scene graph to which the present disclosure applies.

Fig. 2 is a flow chart illustrating a text classification method according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flow chart illustrating a text classification method according to another exemplary embodiment of the present disclosure.

Fig. 4 is a flow chart illustrating a text classification method according to still another exemplary embodiment of the present disclosure.

Fig. 5 is a flow chart of a training method of a classification model according to an exemplary embodiment of the present disclosure.

Fig. 6 is a flow chart of a training method of a classification model according to another exemplary embodiment of the present disclosure.

Fig. 7 is a flow chart illustrating a training method of a classification model according to still another exemplary embodiment of the present disclosure.

Fig. 8 is a schematic structural view of a text classification apparatus according to an exemplary embodiment of the present disclosure.

Fig. 9 is a schematic structural view of a text classification apparatus according to another exemplary embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of a training apparatus for classification models according to an exemplary embodiment of the present disclosure.

Fig. 11 is a schematic structural view of a training apparatus for classification models according to another exemplary embodiment of the present disclosure.

Fig. 12 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the present disclosure may be applicable to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments that include any of the foregoing, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Summary of the application

In the process of realizing the method, the inventor finds that the existing field classification method is based on directly extracting the characteristics of the original sentences in the text and then classifying the short text through the field classification model, so that in the case that the training samples are small samples (namely, the number of the training samples is small and the coverage class is not comprehensive), the field classification of the sentences is easy to make mistakes for the sentences with the words which do not appear in the training samples, thereby influencing the accuracy of the field classification.

For example, assuming that "Liu Dehua ice rain" exists in the training sample, the classification method based on the above-described prior art can correctly classify the text "Liu Dehua ice rain", but for the blue-white porcelain "Zhou Jielun text that does not exist in the training sample, the classification method based on the above-described prior art is prone to classification errors.

When the field classification is carried out on the text, the method is completed according to the word slot class marked on the text, and specific words are not considered, so that for sentences of the words which exist in the text and do not exist in a training sample, accurate field classification can be carried out on the sentences according to the word slot class of the sentences, and the accuracy of the field classification is improved.

Exemplary System

The embodiment of the disclosure can be applied to scenes with voice interaction with robots, children toys, sound equipment and the like, and can also be applied to scenes such as searching. Fig. 1 is a scene graph to which the present disclosure applies. As shown in fig. 1, when the embodiment of the disclosure is applied to a voice interaction scene, an audio acquisition module (such as a microphone) acquires an original audio signal, and voice processed by a front-end signal processing module is subjected to voice recognition to obtain text information; and carrying out semantic understanding and domain classification on the text information, searching in an information base of the corresponding domain based on the domain classification result, and outputting a search result. For example, "blue and white porcelain of Zhou Jielun" for the user's voice, the music domain may be categorized based on embodiments of the present disclosure, and "blue and white porcelain of Zhou Jielun" searched for from the music database and returned.

In addition, when the embodiment of the disclosure is applied to a search scene, a user can input text information, such as "Libai's still, the server performs semantic understanding and field classification on the text information, and outputs a search result after searching in a corresponding category of information base based on the classification result, such as" Libai's still "is classified into the poetry field, and the server searches poetry through the keyword" Libai's still "in the poetry database and returns the poetry to the user.

Exemplary method

Fig. 2 is a flow diagram of text classification provided by an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 2, and the text classification method of the embodiment includes the following steps:

And step 101, acquiring a text to be processed.

The text to be processed can be text input by a user, such as 'i want to listen to Zhou Jielun songs'; or text information obtained by performing voice recognition on voice input by a user. The voice input by the user may be an original audio signal collected by an audio collection module (such as a microphone, etc.), or may be a voice processed by a front-end signal processing module.

The processing of the audio signal by the front-end signal processing module may include, but is not limited to: voice activity detection (Voice Activity Detection, VAD), noise reduction, acoustic echo cancellation (Acoustic Echo Cancellaction, AEC), dereverberation processing, sound source localization, beam Forming (BF), etc.

The voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection and voice boundary detection, and refers to detecting the existence of voice in an audio signal in a noise environment, and accurately detecting the starting position of a voice segment in the audio signal, and is generally used in voice processing systems such as voice coding and voice enhancement, so as to reduce the voice coding rate, save the communication bandwidth, reduce the energy consumption of mobile equipment, improve the recognition rate, and the like.

And 102, marking word slot types of the text to be processed according to the predefined word slot (slot) types.

And step 103, carrying out field classification on the text to be processed according to the result of the word slot category labeling, and obtaining the field category of the text to be processed.

Based on the text classification method provided by the embodiment of the disclosure, according to the predefined word slot category, the text to be processed is labeled with the word slot category, and according to the result of the word slot category labeling, the text to be processed is classified into the field. Because the domain classification of the text to be processed is completed according to the word slot class marked by the text to be processed without considering specific words, the accurate domain classification of sentences can be realized, and the accuracy of the domain classification is improved.

In the embodiment of the disclosure, word slots of the whole domain category may be predefined. For example, as shown in table 1 below, a word slot example defined for an embodiment of the present disclosure:

TABLE 1

Slot category	Meaning of	Examples of the examples
			artist	Name of person	Zhou Jielun, liu Dehua … …
title	Name of work	Blue and white porcelain, ice rain … …
			poi	Position of	Zhongguancun and Xituanmen … …
time	Time of	Today, october III, tuesday … …
			location	Location of site	Beijing, nanjing … …
……	……	……

In step 102 of the embodiment shown in fig. 2, the text to be processed is labeled with word slot categories according to predefined word slot categories, which may be, for example:

For the text to be processed "Liu Dehua ice rain", word slot category labeling is performed based on step 102, so as to obtain: [ Liu Dehua: artist [ ice rain ]: title ];

for the text to be processed, navigation to Zhongguancun, word slot category labeling is performed based on step 102, and the result is that: navigating to [ Zhongguancun: poi ];

For the text to be processed, "today's weather", word slot category labeling is performed based on step 102, so as to obtain: today: time ] weather.

In some embodiments, in step 102, the text to be processed may be input into a sequence labeling model, and word slot categories of the text to be processed may be labeled through the sequence labeling model.

In some alternative examples, the sequence labeling model may be implemented by a hidden markov model (Hidden Markov Model, HMM), a maximum entropy model (Maximum Entropy Model, maxEnt), a conditional random field algorithm (conditional random field algorithm, CRF), a neural network, such as a Convolutional Neural Network (CNN), a cyclic neural network (RNN), or the like, and embodiments of the present disclosure do not limit the implementation of the sequence labeling model.

In this embodiment, when the word slot class labeling is performed on the text to be processed by the pre-trained sequence labeling model, and the sequence labeling model performs word slot class labeling on the text to be processed by HMM, maxEnt, CRF or the like, since the sequence labeling model outputs a sequence, the output sequence itself has some context correlations, and by using these context correlations, the sequence labeling model can achieve higher performance than the conventional classification method in labeling the text to be processed as the input sequence, thereby improving accuracy and efficiency of word slot class, and improving the efficiency of classification of the whole text. Fig. 3 is a flow diagram of text classification provided by another exemplary embodiment of the present disclosure. As shown in fig. 3, step 103 may include the following steps, based on the embodiment shown in fig. 2, described above:

step 1031, determining sentence patterns corresponding to the text to be processed according to the result of the word slot category labeling.

In some embodiments, the result of the word slot category labeling obtained in step 102 may be directly used as a sentence pattern corresponding to the text to be processed. For example, the results "[ Liu Dehua ] can be directly annotated with word slot categories: artist [ ice rain ]: title ] "," navigate to [ Zhongguancun: poi ] "," [ today: time as sentence pattern corresponding to the text to be processed.

In other embodiments, for the result of the word slot category labeling obtained in step 102, the labeled word slot category may be used to replace the corresponding word in the text to be processed, so as to obtain the sentence pattern corresponding to the text to be processed. For example, result for word slot class label "[ Liu Dehua: artist [ ice rain ]: title ] "," navigate to [ Zhongguancun: poi ] "," [ today: weather of time replaces corresponding words in the text to be processed with marked word slot categories to obtain [ title ], "weather of navigation to [ poi ]," [ time ] as sentence pattern corresponding to the text to be processed ".

The embodiment of the disclosure does not limit the sentence pattern form corresponding to the text to be processed, as long as the labeled word slot category can be embodied.

Step 1032, determining the domain category of the text to be processed based on the sentence pattern corresponding to the text to be processed.

According to the method and the device for classifying the text, the sentence patterns corresponding to the text to be processed are determined according to the word slot class labeling results, and because the sentence patterns corresponding to the text to be processed comprise the labeling results and the structural relation of the whole text to be processed, the domain class of the text to be processed is determined based on the sentence patterns corresponding to the text to be processed, so that more accurate domain classes can be obtained, and the accuracy and the efficiency of classifying the whole text are improved.

Fig. 4 is a flow diagram of text classification provided by yet another exemplary embodiment of the present disclosure. As shown in fig. 4, step 1032 may include the following steps, based on the embodiment shown in fig. 3, described above:

and 10321, extracting features in sentence patterns corresponding to the text to be processed to obtain text features of the text to be processed.

The text in the embodiments of the present disclosure may be specifically expressed in a manner of a feature vector or a feature map, and the embodiments of the present disclosure do not limit the expression manner of the text feature.

Step 10322, performing domain classification on the text to be processed based on the text characteristics of the text to be processed, to obtain domain categories of the text to be processed.

For example, based on the text feature "[ title ]", of the text to be processed, the domain classification of the text to be processed may be performed, and the score of the text to be processed classified into each domain, for example, the music domain class: 0.95; navigation domain category: 0.10; weather field category: 0.10; …; and selecting the domain type with the highest score as the domain type of the text to be processed.

In some embodiments, in step 10321, a sentence pattern corresponding to the text to be processed may be segmented and then features may be extracted using a sliding window with a fixed length of N, where N is an integer greater than 0, for example, 2, 3, 4, etc., based on the N-gram per word. Wherein each labeled word slot class is calculated as 1 word. For example, for the sentence "navigate to [ poi ]", n=4; "navigate to", n=3; "navigation", n=2.

For example, by using an N-ary model, segmenting a sentence corresponding to a text to be processed with n=2 to 4, extracting features, and extracting features of [ title ] "of the sentence" [ title ], the following possible text features can be obtained: [ title ] of [ artist ], [ title ] of [ artist ]; for the sentence "navigate to [ poi ]" feature extraction, several possible text features are available: navigating to [ poi ], navigating to [ poi ]; the feature extraction is performed on the weather of the sentence "[ time ] to obtain the following possible text features: weather of [ time ], day of [ time ], weather of [ time ], day of weather of [ time ], weather of weather.

In addition, in the case that the training sample is a small sample, for example, in the category of music domain, a word such as "Liu Dehua" often appears, and if the existing domain classification method is adopted to directly perform feature extraction on an original sentence in a text and then perform short text classification through a domain classification model, the text that does not belong to the category of music domain, namely "how high" Liu Dehua, is also classified into the category of music, so that a relatively serious overfitting phenomenon is generated for the domain classification of short text. According to the embodiment of the disclosure, the text to be processed is abstracted into the sentence pattern, and the text feature is only related to the sentence pattern and is irrelevant to the specific 'Liu Dehua', so that the correct classification of the text to be processed can be realized, and the overfitting phenomenon of the field classification of the short text is reduced.

In some embodiments, in step 10322, the text feature of the text to be processed may be input into a domain classification model, and the text to be processed is domain classified by the domain classification model to obtain a domain class of the text to be processed.

In some alternative examples, the domain classification model may be implemented by a support vector machine (Support Vector Machine, SVM), a maximum entropy model (Maximum Entropy Model, maxEnt), a neural network, etc., where the neural network may be, for example, a Convolutional Neural Network (CNN), a cyclic neural network (RNN), etc., and embodiments of the present disclosure do not limit the implementation of the domain classification model.

In the embodiment, the text to be processed is subjected to field classification through the pre-trained field classification model, so that the accuracy and efficiency of field classification results are improved, and the whole text classification efficiency is improved.

Before the text classification method of each embodiment of the disclosure is disclosed, the domain classification model and the sequence labeling model may be trained in advance, and then the corresponding operations are performed based on the trained domain classification model and sequence labeling model.

Fig. 5 is a flow chart of a training method of a classification model according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 5, and the training method of the classification model of the embodiment includes the following steps:

Step 201, a first data set is acquired.

Wherein the first dataset comprises samples of at least one domain category, each sample being annotated with domain category information. The field category information of each sample label is relatively accurate field category information.

Step 202, labeling word slot categories on the samples in the first data set according to the predefined word slot categories.

And 203, training a domain classification model by using the first data set according to the result of the word slot class labeling.

According to the training method of the classification model provided by the embodiment of the disclosure, according to the predefined word slot types, the word slot types of the samples in the first data set are marked, the samples in the first data set are marked with the field type information, and then according to the result of the word slot type marking, the field classification model is trained by using the first data set. When the domain classification model trained by the method of the embodiment is used for classifying the domain of the text to be processed, the domain classification of the text to be processed is completed according to the word slot class marked by the text to be processed without considering specific words, and under the condition that the training sample is a small sample, the sentences of the words which do not appear in the training sample in the text to be processed can still be accurately classified according to the word slot class of the sentences, so that the accuracy of the domain classification is improved.

In some of these embodiments, step 203 comprises: determining sentence patterns corresponding to the samples in the first data set according to the result of the word slot category labeling; and training a domain classification model based on the sentence pattern corresponding to the sample in the first data set.

In some of these alternative examples, training the domain classification model based on the sentence pattern corresponding to the sample in the first dataset may include: extracting features in sentence patterns corresponding to samples in the first data set to obtain text features of the samples in the first data set; a domain classification model is trained based on textual features of the samples in the first dataset.

Fig. 6 is a flow chart of a training method of a classification model according to another exemplary embodiment of the present disclosure. As shown in fig. 6, the training method of the classification model of the present embodiment includes the following steps:

step 301, a first dataset is acquired, wherein samples in the first dataset are labeled with domain category information.

Wherein, the samples in the first data set may be, for example: liu Dehua ice rain, playing a song Liu Dehua, …, navigating to Zhongguancun, I want to go to the Siemens navigation, …, today's weather, beijing is rainy today, …, etc. The domain category information of the sample label is used to identify the domain category of the sample, which may include, for example, but not limited to: music, poetry, navigation, weather, etc. After training the domain classification model by using the sample marked with the domain category information, the domain classification model can classify the text belonging to the domain category of the sample.

Step 302, labeling word slot types on the samples in the first data set according to the predefined word slot types.

Wherein, word slot category labels, for example: [ Liu Dehua: artist [ ice rain ]: title ], navigate to [ Zhongguancun: poi ], [ today: time ], weather, and so forth.

Step 303, determining a sentence pattern corresponding to the sample in the first data set according to the result of the word slot category labeling.

The sentence pattern may be, for example, a result of labeling a word slot category: [ Liu Dehua: artist [ ice rain ]: title ], navigate to [ Zhongguancun: poi ], [ today: time ] weather; or the result obtained by respectively replacing corresponding words in the text to be processed with the marked word slot categories is also possible: [ title ] of [ artist ], navigation to [ poi ], [ time ], and so on.

And 304, extracting the characteristics in the sentence patterns corresponding to the samples in the first data set to obtain the text characteristics of the samples in the first data set.

The features in the sentence patterns corresponding to the samples can be text features obtained by extracting features from the sentence patterns corresponding to the samples in a preset feature extraction mode. For example, by using an N-ary model, n=2 to 4, and extracting features from [ title ] "of the sentence" [ artist ] corresponding to the sample, the following possible text features can be obtained: [ title ] of [ artist ], [ title ] of [ artist ]; feature extraction is performed on the sentence pattern 'navigation to [ poi ]' corresponding to the sample, and the following possible text features can be obtained: navigating to [ poi ], navigating to [ poi ]; the feature extraction is carried out on the weather of the sentence pattern "[ time ] corresponding to the sample, and the following possible text features can be obtained: weather of [ time ], day of [ time ], weather of [ time ], day of weather of [ time ], weather of weather.

Step 305, training a domain classification model based on text features of the samples in the first dataset.

In this embodiment, according to a predefined word slot class, a word slot class label is performed on a sample in a first data set, according to a word slot class label result, a sentence pattern corresponding to the sample in the first data set is determined, then features in the sentence pattern corresponding to the sample in the first data set are extracted, a domain classification model is trained based on text features of the sample in the first data set, and when a domain classification model trained by using the method of this embodiment is used to classify a domain of a text to be processed, since the domain classification of the text to be processed is completed according to the word slot class of the label of the text to be processed and features in the corresponding sentence pattern, sentences of words which do not appear in the training sample in the text to be processed can still be classified accurately according to the word slot class of the sentences, thereby improving accuracy and efficiency of domain classification results and improving overall text classification efficiency.

In some of these embodiments, step 305 may include: inputting the text characteristics of the samples in the first data set into a field classification model, and carrying out field prediction on the samples in the first data set through the field classification model to obtain field class prediction information of the samples in the first data set; and training the domain classification model according to the difference between the domain category prediction information of the samples in the first data set and the domain category information of the sample labels in the first data set.

The above step 305 may be an iteratively performed process. In some of these alternative examples, parameters of the domain classification model may be adjusted according to a difference between domain category prediction information of the samples in the first data set and domain category information of the sample labels in the first data set until a training completion condition is met, e.g., the difference between the domain category prediction information of the samples in the first data set and the domain category information of the sample labels in the first data set is less than a preset threshold, or a number of training times for the domain classification model reaches a preset number of times.

In some embodiments, in step 202 or 302 of the foregoing embodiment shown in fig. 5-6, the samples in the first data set may be input into a trained sequence labeling model, and the word slot class of the samples in the first data set may be labeled by the trained sequence labeling model.

In the embodiment, the word slot categories of the samples in the first data set are marked through the trained sequence marking model, so that the accuracy and the efficiency of word slot category marking are improved.

After the training of the domain classification model in the embodiment shown in fig. 6 is completed, the method can be used to implement the domain classification of the text to be processed according to the result of the word slot class marking in the embodiment 103 shown in fig. 2-5, and the operation of obtaining the domain class of the text to be processed can be referred to the description in the embodiment shown in fig. 2-4, and the details are not repeated here.

Fig. 7 is a flow chart illustrating a training method of a classification model according to still another exemplary embodiment of the present disclosure. As shown in fig. 7, the training method of the classification model of the present embodiment includes the following steps:

step 401, a first data set and a second data set are acquired.

Wherein the samples in the first dataset are labeled with domain category information; the samples in the second data set are labeled with word slot class information according to a predefined word slot class.

The samples in the first dataset and the labeled domain category information thereof can be referred to the description of 301 in the embodiment shown in fig. 6, and will not be described herein.

The samples in the second data set may be, for example: liu Dehua ice rain, playing a song Liu Dehua, …, navigating to Zhongguancun, I want to go to the Siemens navigation, …, today's weather, beijing is rainy today, …, etc. The word slot class information of the sample label in the second data set may be word slots of predefined domain-wide class, for example artist, title, poi, time, location, and so on, and specifically, see table 1 above. After training the sequence labeling model by using the sample labeled with the word slot class information, the sequence labeling model can label the text with the corresponding word slot class.

Step 402, training a sequence annotation model using the second data set.

Step 403, inputting the samples in the first data set into a sequence labeling model, and labeling word slot types of the samples in the first data set through the sequence labeling model.

Step 406, determining the sentence pattern corresponding to the sample in the first data set according to the result of the word slot category labeling.

And step 405, extracting features in sentence patterns corresponding to the samples in the first data set to obtain text features of the samples in the first data set.

Step 406, training a domain classification model based on the text features of the samples in the first dataset.

In the embodiment, the sequence labeling model is trained by utilizing the sample data set in advance, the training sequence labeling model is used for labeling the slot information of the text to be processed, and the accuracy and the efficiency of the slot information labeling are improved, so that the whole text classification efficiency is improved.

In some of these embodiments, step 402 may include:

Inputting the samples in the second data set into a sequence labeling model, and carrying out word slot class prediction on the samples in the second data set through the sequence labeling model to obtain word slot class prediction information of the samples in the second data set;

Training the sequence labeling model according to the difference between the word slot class prediction information of the samples in the second data set and the word slot class information of the sample labels in the second data set.

The above step 402 may be an iteratively performed process. In some optional examples, parameters of the sequence annotation model may be adjusted according to a difference between word slot class prediction information of the sample in the second data set and word slot class information of the sample annotation in the second data set until a training completion condition is met, e.g., a difference between the word slot class prediction information of the sample in the second data set and the word slot class information of the sample annotation in the second data set is less than a preset threshold, or a number of training times of the sequence annotation model reaches a preset number of times.

After the training of the sequence labeling model and the domain classification model in the embodiment shown in fig. 7 is completed, the training may be used to correspondingly implement the operations 102 and 103 in the embodiments shown in fig. 2-5, and the relevant parts may be referred to the description in the embodiments shown in fig. 2-4, which are not repeated herein.

Any of the methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including, but not limited to: terminal equipment, servers, etc. Or any of the methods provided by the embodiments of the present disclosure may be performed by a processor, such as a processor, by invoking corresponding instructions stored in a memory to perform any of the methods referred to by the embodiments of the present disclosure. And will not be described in detail below.

Exemplary apparatus

Fig. 8 is a schematic structural view of a text classification apparatus according to an exemplary embodiment of the present disclosure. The text classification device can be arranged in electronic equipment such as terminal equipment and a server, and can execute the text classification method of any embodiment of the disclosure. As shown in fig. 8, the text classification apparatus includes: the first obtaining module 501, the labeling module 502 and the classifying module 503. Wherein:

a first obtaining module 501, configured to obtain text to be processed.

The labeling module 502 is configured to label the word slot category for the text to be processed acquired by the first acquisition module 501 according to a predefined word slot category.

In some embodiments, the labeling module 502 may include a sequence labeling model for inputting the text to be processed into the sequence labeling model, and labeling word slot categories of the text to be processed by the sequence labeling model.

And the classification module 503 is configured to perform domain classification on the text to be processed according to the result of the word slot class annotation obtained by the annotation module 502, so as to obtain a domain class of the text to be processed.

Based on the text classification device provided by the embodiment of the disclosure, the text to be processed is labeled according to the word slot type defined in advance, and the field classification is performed on the text to be processed according to the result of the word slot type labeling. Because the domain classification of the text to be processed is completed according to the word slot class marked by the text to be processed without considering specific words, the accurate domain classification of sentences can be realized, and the accuracy of the domain classification is improved.

Fig. 9 is a schematic structural view of a text classification apparatus according to another exemplary embodiment of the present disclosure. On the basis of the embodiment shown in fig. 8, the classification module 503 includes: the first determining unit 5031 determines a sentence pattern corresponding to the text to be processed according to the result of the word slot category labeling; the second determining unit 5032 determines the domain category of the text to be processed based on the sentence pattern corresponding to the text to be processed.

In some of these embodiments, the second determining unit 5032 may include: the extraction subunit is used for extracting the characteristics in the sentence patterns corresponding to the text to be processed to obtain the text characteristics of the text to be processed; and the classifying subunit is used for carrying out field classification on the text to be processed based on the text characteristics of the text to be processed to obtain the field category of the text to be processed.

In some optional examples, the classification subunit may include a domain classification model, configured to input text features of the text to be processed into the domain classification model, and perform domain classification on the text to be processed through the domain classification model to obtain a domain class of the text to be processed.

Fig. 10 is a schematic structural diagram of a training apparatus for classification models according to an exemplary embodiment of the present disclosure. The training device of the classification model can be arranged in electronic equipment such as terminal equipment and a server, and can execute the text classification method of any embodiment of the disclosure. As shown in fig. 10, the training device for the classification model includes: the system comprises a second acquisition module 601, a labeling module 602 and a first training module 603. Wherein:

the second obtaining module 601 is configured to obtain a first data set, where samples in the first data set are labeled with domain category information.

The labeling module 602 is configured to label the word slot class for the sample in the first dataset acquired by the second acquisition module according to a predefined word slot class.

The first training module 603 is configured to train the domain classification model by using the first data set according to the result of word slot class labeling obtained by the labeling module.

According to the training device of the classification model provided by the embodiment of the disclosure, according to the predefined word groove category, the word groove category marking is carried out on the samples in the first data set, the field category information is marked on the samples in the first data set, and then according to the result of the word groove category marking, the field classification model is trained by using the first data set. When the domain classification model trained by the method of the embodiment is used for classifying the domain of the text to be processed, the domain classification of the text to be processed is completed according to the word slot class marked by the text to be processed without considering specific words, and under the condition that the training sample is a small sample, the sentences of the words which do not appear in the training sample in the text to be processed can still be accurately classified according to the word slot class of the sentences, so that the accuracy of the domain classification is improved.

Fig. 11 is a schematic structural view of a training apparatus for classification models according to another exemplary embodiment of the present disclosure. On the basis of the embodiment shown in fig. 10, the first training module 603 includes: a third determining unit 6031, configured to determine, according to the result of the word slot class labeling, a sentence pattern corresponding to the sample in the first data set; the first training unit 6032 is configured to train the domain classification model based on the sentence pattern corresponding to the sample in the first dataset.

Referring again to fig. 11, in some embodiments, the first training unit 6032 may include: the extraction subunit is used for extracting the characteristics in the sentence patterns corresponding to the samples in the first data set to obtain the text characteristics of the samples in the first data set; a training subunit for training a domain classification model based on the text features of the samples in the first dataset.

In some of these alternative examples, the training subunit is specifically configured to: inputting the text characteristics of the samples in the first data set into a field classification model, and carrying out field prediction on the samples in the first data set through the field classification model to obtain field class prediction information of the samples in the first data set; and training the domain classification model according to the difference between the domain category prediction information of the samples in the first data set and the domain category information of the sample labels in the first data set.

In some of these alternative examples, the labeling module 602 is specifically configured to: and inputting the samples in the first data set into a sequence labeling model, and labeling word slot types of the samples in the first data set through the sequence labeling model.

Referring again to fig. 11, in the training apparatus for a classification model provided in still another exemplary embodiment, further includes: a second obtaining module 604, configured to obtain a second data set, where a sample in the second data set is labeled with word slot category information according to a predefined word slot category; a second training module 605 is configured to train the sequence annotation model with the second data set.

In some of these embodiments, the second training module 605 may include: the prediction unit 6051 is configured to input the samples in the second data set into a sequence labeling model, and perform word slot class prediction on the samples in the second data set through the sequence labeling model to obtain word slot class prediction information of the samples in the second data set; and a second training unit 6052, configured to train the sequence labeling model according to a difference between the word slot class prediction information of the samples in the second data set and the word slot class information of the sample labels in the second data set.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present disclosure is described with reference to fig. 12. The electronic device may be either or both of the first device and the second device, or a stand-alone device independent thereof, which may communicate with the first device and the second device to receive the acquired input signals therefrom.

Fig. 12 illustrates a block diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 12, the electronic device includes one or more processors 701 and memory 702.

The processor 701 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device to perform the desired functions.

Memory 702 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 701 to implement the methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device may further include: input device 703 and output device 704, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is a first device or a second device, the input means 703 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 703 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

In addition, the input device 703 may also include, for example, a keyboard, a mouse, and the like.

The output device 704 may output various information to the outside, including the determined distance information, direction information, and the like. The output device 704 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 12, components such as buses, input/output interfaces, and the like are omitted for simplicity. In addition, the electronic device may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a method according to various embodiments of the present disclosure described in the "exemplary methods" section of the present description.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a method according to various embodiments of the present disclosure described in the above section "exemplary method" of the present disclosure.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1.A text classification method, comprising:

Acquiring a text to be processed;

performing field classification on the text to be processed according to the word slot category labeling result to obtain the field category of the text to be processed;

The step of classifying the text to be processed according to the result of the word slot category marking comprises the following steps:

determining a sentence pattern corresponding to the text to be processed according to the result of the word slot category labeling;

and determining the domain category of the text to be processed based on the sentence pattern corresponding to the text to be processed.

2. The method of claim 1, wherein the determining the domain category of the text to be processed based on the sentence pattern corresponding to the text to be processed comprises:

Extracting characteristics in sentence patterns corresponding to the text to be processed to obtain text characteristics of the text to be processed;

And carrying out field classification on the text to be processed based on the text characteristics of the text to be processed to obtain the field classification of the text to be processed.

3. The method of claim 2, wherein the performing domain classification on the text to be processed based on the text features of the text to be processed comprises:

inputting the text characteristics of the text to be processed into a domain classification model, and performing domain classification on the text to be processed through the domain classification model to obtain the domain category of the text to be processed.

4. A method according to any one of claims 1 to 3, wherein said labeling the text to be processed for word slot categories according to predefined word slot categories comprises:

And inputting the text to be processed into a sequence labeling model, and labeling word slot types of the text to be processed through the sequence labeling model.

5. A method of training a classification model, comprising:

According to the result of the word slot category labeling, the first data set is utilized to train a field classification model;

The training field classification model by utilizing the first data set according to the result of the word slot category labeling comprises the following steps:

determining sentence patterns corresponding to the samples in the first data set according to the result of the word slot category labeling;

and training the domain classification model based on sentence patterns corresponding to the samples in the first data set.

6. The method of claim 5, wherein the training the domain classification model based on the sentence pattern corresponding to the samples in the first dataset comprises:

extracting features in sentence patterns corresponding to samples in the first data set to obtain text features of the samples in the first data set;

The domain classification model is trained based on textual features of samples in the first dataset.

7. The method of claim 6, wherein the training the domain classification model based on textual features of samples in the first dataset comprises:

Inputting the text characteristics of the samples in the first data set into the domain classification model, and carrying out domain prediction on the samples in the first data set through the domain classification model to obtain domain class prediction information of the samples in the first data set;

and training the domain classification model according to the difference between the domain category prediction information of the samples in the first data set and the domain category information of the sample labels in the first data set.

8. The method according to any one of claims 5 to 7, wherein the labeling of the word slot class for the samples in the first dataset according to a predefined word slot class comprises:

and inputting the samples in the first data set into a sequence labeling model, and labeling word slot types of the samples in the first data set through the sequence labeling model.

9. The method of claim 8, wherein the inputting the samples in the first dataset into a sequence annotation model, before annotating word slot categories of the samples in the first dataset with the sequence annotation model, further comprises:

Acquiring a second data set, wherein a sample in the second data set is marked with word slot category information according to the predefined word slot category;

And training the sequence annotation model by using the second data set.

10. The method of claim 9, wherein the training the sequence annotation model with the second dataset comprises:

Inputting the samples in the second data set into the sequence labeling model, and carrying out word slot class prediction on the samples in the second data set through the sequence labeling model to obtain word slot class prediction information of the samples in the second data set;

And training the sequence labeling model according to the difference between the word slot class prediction information of the samples in the second data set and the word slot class information of the sample labels in the second data set.

11. A domain classification device, comprising:

the first acquisition module is used for acquiring a text to be processed;

The classification module is used for carrying out field classification on the text to be processed according to the word slot category labeling result obtained by the labeling module to obtain the field category of the text to be processed;

The classifying module is specifically used for determining sentence patterns corresponding to the text to be processed according to the result of the word slot category labeling when the text to be processed is subjected to field classification; and determining the domain category of the text to be processed based on the sentence pattern corresponding to the text to be processed.

12. A training apparatus for classification models, comprising:

The first training module is used for training a domain classification model by utilizing the first data set according to the word slot class labeling result obtained by the labeling module;

The first training module is specifically configured to determine, according to a result of the word slot class labeling, a sentence pattern corresponding to a sample in the first data set when the first data set is used to train a domain classification model; and training the domain classification model based on sentence patterns corresponding to the samples in the first data set.

13. A computer readable storage medium storing a computer program for executing the method of any one of the preceding claims 1 to 10.

14. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor being configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of the preceding claims 1 to 10.