CN114444609A - Data processing method and device, electronic equipment and computer readable storage medium - Google Patents
Data processing method and device, electronic equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN114444609A CN114444609A CN202210118785.7A CN202210118785A CN114444609A CN 114444609 A CN114444609 A CN 114444609A CN 202210118785 A CN202210118785 A CN 202210118785A CN 114444609 A CN114444609 A CN 114444609A
- Authority
- CN
- China
- Prior art keywords
- data
- training
- standard
- processed
- loss value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 19
- 238000003860 storage Methods 0.000 title claims abstract description 17
- 238000000034 method Methods 0.000 claims abstract description 97
- 238000012549 training Methods 0.000 claims description 252
- 230000014509 gene expression Effects 0.000 claims description 71
- 238000003062 neural network model Methods 0.000 claims description 48
- 238000000605 extraction Methods 0.000 claims description 47
- 238000012545 processing Methods 0.000 claims description 38
- 238000013145 classification model Methods 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 24
- 238000013473 artificial intelligence Methods 0.000 abstract description 21
- 239000013598 vector Substances 0.000 description 71
- 238000005516 engineering process Methods 0.000 description 22
- 238000010586 diagram Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 12
- 239000013604 expression vector Substances 0.000 description 11
- 230000009471 action Effects 0.000 description 6
- 238000013519 translation Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000004821 distillation Methods 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/28—Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application provides a data processing method and device, electronic equipment and a computer readable storage medium, and relates to the technical field of artificial intelligence, multimedia, games and cloud. The method comprises the following steps: acquiring data to be processed, wherein the data to be processed is data in a first mode; extracting first data characteristics of data to be processed; and matching the first data characteristic with at least one second data characteristic in a target database, and determining target standard data matched with the data to be processed from the candidate standard data according to a matching result corresponding to each second data characteristic, wherein the target database comprises at least one candidate standard data and the second data characteristic of each candidate standard data, and the candidate standard data are data in a second mode. Based on the method provided by the embodiment of the application, the matching between the data in different modes can be simply and quickly realized.
Description
Technical Field
The present application relates to the field of artificial intelligence, multimedia technology, games, and cloud technology, and in particular, to a data processing method, apparatus, electronic device, and computer-readable storage medium.
Background
With the development and popularization of speech recognition technology, speech recognition applications have appeared in various application scenarios, for example, currently, most electronic devices are equipped with artificial intelligence AI speech assistants, which can recognize collected speech data based on the speech recognition technology to obtain corresponding text content, and can execute corresponding functions based on the recognized text content.
In the prior art, most of voice recognition technologies are realized through complex voice recognition models, usually, a voice encoder is used for obtaining coding characteristics of voice data, and then a classification network is used for predicting the type of the voice data, so that the requirements can be met to a certain extent, but the scheme is complex to realize, high in cost, poor in expansibility and incapable of increasing the type, and when the type of the voice data is large, the recognition accuracy rate is difficult to guarantee.
Disclosure of Invention
The embodiment of the application provides a data processing method and device, electronic equipment and a computer readable storage medium, and based on the method, matching among data in different modes can be simply and quickly realized. The technical scheme provided by the embodiment of the application is as follows:
in one aspect, an embodiment of the present application provides a data processing method, where the method includes:
acquiring data to be processed, wherein the data to be processed is data in a first mode;
extracting a first data characteristic of data to be processed;
matching the first data characteristic with at least one second data characteristic in a target database to obtain a matching result corresponding to each second data characteristic, wherein the target database comprises at least one candidate standard data and the second data characteristic of each candidate standard data, and the candidate standard data are data in a second mode;
and determining target standard data matched with the data to be processed from the candidate standard data according to the matching result corresponding to each second data characteristic.
In another aspect, an embodiment of the present application provides a data processing apparatus, including:
the data processing device comprises a to-be-processed data acquisition module, a data processing module and a data processing module, wherein the to-be-processed data acquisition module is used for acquiring data to be processed, and the data to be processed is data in a first mode;
the characteristic acquisition module is used for extracting first data characteristics of the data to be processed;
the data identification module is used for matching the first data characteristics with at least one second data characteristics in the target database to obtain matching results corresponding to the second data characteristics, and determining target standard data matched with the data to be processed from the candidate standard data according to the matching results corresponding to the second data characteristics;
the target database comprises at least one candidate standard data and a second data characteristic of each candidate standard data, and the candidate standard data are data of a second modality.
Optionally, the data identification module is further configured to: determining the data type of the data to be processed according to the first data characteristic; accordingly, the data identification module may be configured to:
and when the data type of the data to be processed is a specified type, matching the first data characteristic with at least one second data characteristic in the target database.
Optionally, the data of the first modality and the data of the second modality are data of different modalities, the data of the first modality includes at least one of text, voice, video or image, and the data of the second modality includes at least one of text, voice, video or image.
Optionally, the candidate standard data is a standard expression matched with first standard data in a standard database, the first standard data is data of a first modality, and one first standard data corresponds to at least one standard expression.
Optionally, the feature obtaining module is further configured to: when the newly added first standard data exist in the standard database, at least one standard expression corresponding to the newly added first standard data is obtained; extracting second data characteristics of each standard expression corresponding to the newly added first standard data; and storing each standard expression corresponding to the newly added first standard data and the second data characteristic association corresponding to each standard expression into the target database.
Optionally, the first data feature is obtained by extracting through a first feature extraction network; the second data characteristic of the candidate standard data is extracted through a second characteristic extraction network; the first feature extraction network and the second feature extraction network are obtained by training a model training module in the following way:
acquiring a training data set, wherein the training data set comprises a first training set, and each first sample in the first training set comprises first data of a first modality and second data of a second modality matched with the first data;
performing iterative training on an initial neural network model based on a training data set until a training total loss value meets a preset training end condition, wherein the neural network model comprises a first network model and a second network model, the first network model meeting the training end condition is used as a first feature extraction network, and the second network model meeting the training end condition is used as a second feature extraction network; the training process comprises the following steps:
inputting each first data into a first network model to obtain the characteristics of each first data, and inputting each second data into a second network model to obtain the characteristics of each second data;
determining a first training loss value based on the matching degree of the features of the first data and the features of the second data in each first sample and the matching degree of the features of the first data and the features of the second data in each first negative example; wherein the first negative example comprises first data of one first sample and second data of another first sample;
and if the first training loss value does not meet the first preset condition, adjusting model parameters of the first network model and the second network model, wherein the training total loss value meets the preset training ending condition and the first training loss value meets the first preset condition.
Optionally, the model training module is configured to, when inputting each first data into the first network model and obtaining the feature of each first data:
for each first data, performing the following operations on the first data through the first network model to obtain the characteristics of the first data:
dividing the first data into at least two subdata to obtain a subdata sequence corresponding to the first data; extracting features of each subdata in the subdata sequence based on a dictionary, and obtaining features of first data based on the features of each subdata, wherein the dictionary comprises a plurality of data elements, the number of feature values included in the features of each subdata is equal to the number of the elements in the dictionary, and one feature value represents the probability that the subdata contains the data element corresponding to the position of the feature value in the dictionary;
the model training module is further to: for each second datum, determining, based on the dictionary, that the second datum corresponds to a data feature of the dictionary, the data feature characterizing a probability that the second datum corresponds to a respective data element in the dictionary;
the model training module, when determining the first training loss value, is to: and determining a first training loss value based on the matching degree between the characteristics of the sub data of the first data in each first sample and the data characteristics of the second data corresponding to the dictionary, the matching degree between the characteristics of the first data in each first sample and the characteristics of the second data, and the matching degree between the characteristics of the first data in each first negative example and the characteristics of the second data.
Optionally, the model training module, when determining the first training loss value, may be configured to:
determining the difference degree between the characteristics of the first data and the characteristics of the second data of each first sample to obtain a first loss value;
for each piece of first data, determining a first similarity corresponding to the first data and a second similarity corresponding to the first data, wherein the first similarity is the similarity between the characteristics of the first data and the characteristics of second data matched with the first data, and the second similarity is the similarity between the first data and the second data in a first negative example where the first data is located;
acquiring reference labels corresponding to the first data, wherein the reference labels comprise similarity labels corresponding to the first similarity and similarity labels corresponding to the second similarity;
determining a second loss value based on the prediction similarity and the reference label corresponding to each first data, wherein the prediction similarity comprises the first similarity and the second similarity, and the second loss value represents the difference between the prediction similarity and the reference label corresponding to each first data;
and determining a first training loss value according to the first loss value and the second loss value.
Optionally, the candidate standard data is a standard expression of a second modality corresponding to the first standard data of the specified type; the initial neural network model further comprises a classification model; the training data set further comprises a second training set, each second sample in the second training set comprising third data of the first modality and fourth data of the second modality matching the third data, wherein the third data in the second training data set comprises third data of the specified type and third data of the non-specified type, and each second sample further comprises a type label of the third data in the sample; after obtaining the neural network model with the first training loss value satisfying the first preset condition, the model training module is further configured to perform the following training process:
continuing to repeatedly execute training operation on the neural network model based on the second training set until the second training loss value meets a second preset condition, wherein the training total loss value meets the preset training end condition and the second training loss value meets the second preset condition; the training operation comprises:
inputting each third data into the first network model to obtain the characteristics of each third data, inputting each fourth data into the second network model to obtain the characteristics of each fourth data, and inputting the characteristics of each third data into the classification model to obtain the prediction type corresponding to each third data;
determining a second training loss value based on the matching degree of the features of the third data and the features of the fourth data in each second sample, the matching degree of the features of the third data and the features of the fourth data in each second negative example, and the matching degree between the type label and the prediction type of each third data;
and if the second training loss value does not meet the second preset condition, adjusting the model parameters of the neural network model.
Optionally, the data of the first modality is speech, the data of the second modality is text, and the data elements are phonemes.
Optionally, the specified type is instruction type speech.
On the other hand, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement the method provided in any optional embodiment of the present application.
On the other hand, the embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method provided in any optional embodiment of the present application.
In another aspect, the present application further provides a computer program product including a computer program, where the computer program is executed by a processor to implement the method provided in any optional embodiment of the present application.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
the data processing method provided by the embodiment of the application provides a novel data processing thought, when the data to be processed is processed by adopting the method, after the data characteristics of the data to be processed are obtained, the complex and tedious characteristic recognition of the data characteristics can be avoided, but the matching between different modal data can be simply and quickly realized through a characteristic matching mode, the calculated amount can be greatly reduced, and the data processing efficiency is improved. In addition, because the target database stores candidate standard data, the identification accuracy can be well ensured by determining the standard data of the second modality which is matched with the data to be processed of the first modality by the method of the embodiment of the application.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a data processing system according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a training process of a neural network model according to an embodiment of the present disclosure;
FIG. 4 is a schematic flow chart of a pre-training phase according to an embodiment of the present disclosure;
fig. 5 is a schematic flow chart of a fine tuning training phase according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a method for constructing an instruction vector library according to an embodiment of the present disclosure;
fig. 7 is a schematic view illustrating a recognition process of a voice command according to an embodiment of the present application;
FIGS. 8a and 8b are schematic diagrams of a user interface provided in an example of the present application;
fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device to which the embodiment of the present application is applied.
Detailed Description
Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B". When describing a plurality of (two or more) items, if the relationship between the plurality of items is not explicitly defined, the plurality of items may refer to one, more or all of the plurality of items, for example, for the description of "parameter a includes a1, a2, A3", parameter a includes a1, a2 or A3, and parameter a includes at least two of the three items of parameters a1, a2, A3.
It should be noted that, in the alternative embodiment of the present application, the related data such as the user information (e.g. the voice data corresponding to the user) needs to obtain user permission or consent when the above embodiment of the present application is applied to a specific product or technology, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region. That is, if data related to the user is involved in the embodiment of the present application, the data needs to be obtained through approval of the user and in compliance with relevant laws, regulations and standards of countries and regions.
Optionally, the data processing method provided in the embodiment of the present application may be implemented based on an Artificial Intelligence (AI) technology. For example, feature extraction of data to be processed, feature extraction of candidate standard data, and feature extraction of data in a training data set may be implemented by a trained neural network model. AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. As artificial intelligence technology has been researched and developed in a wide variety of fields, it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will play an increasingly important role.
Optionally, the data processing according to the embodiment of the present application may be implemented based on Cloud technology (Cloud technology), for example, the data computation involved in the training of the neural network model and the data computation involved in processing the data to be processed may be implemented by using Cloud technology. The cloud technology is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize the calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Cloud computing refers to a delivery and use mode of an IT infrastructure, and refers to acquiring required resources in an on-demand and easily-extensible manner through a network; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. With the development of diversification of internet, real-time data stream, and connection devices, and the promotion of demands for search services, social networks, mobile commerce, open collaboration, and the like, cloud computing has been rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.
For better understanding and description of the solutions provided by the embodiments of the present application, some related terms of art related to the embodiments of the present application will be described below.
Two-classification cross entropy error: i.e. cross entropy loss, is an objective function/loss function in deep learning, and is used to measure the similarity between the prediction result distribution (the prediction result output by the neural network) and the true label (i.e. the sample label), and the error is calculated based on the sample prediction result (usually the probability between 0 and 1) and the true label (0 or 1). Assuming that the probability that the prediction result of the sample is true is y, and the true label is y', the corresponding error L can be expressed as: l ═ y 'log (y) - (1-y') log (1-y).
Speech-translated text pair: the voice audio (i.e., voice signal/voice data) and the corresponding translated text (text data) constitute a pair.
Speech-instruction pair: the instruction voice and the natural language text expressing the instruction intention, such as the voice "mark article A" and the text "there is article A" can be a pair, and the voice "mark article A" and the text "mark article A" can also be a pair.
MFCC (Mel-Frequency Cepstral Coefficients, Mel Cepstral coefficient) characteristics: the MFCC features are an audio feature of speech processing that can be used for neural network model input.
Phoneme: the basic acoustic unit of speech is a speech unit divided according to the natural attributes of speech, and is analyzed according to the pronunciation action in syllables, and one action constitutes one phoneme.
CNN (Convolutional Neural Networks): CNN is a deep learning network structure that can capture local information of inputs.
Transformer network: a network structure of deep learning based on an attention mechanism can be applied to sequence input of texts, voices and the like.
CTC (Connectionist Temporal classification) error: which may also be referred to as CTC loss, is an objective function used in deep learning to allow models to automatically learn alignments.
MSE (Mean Squared Error): is a loss function in deep learning, which is used to calculate the distance between two vectors.
The following describes technical solutions of various alternative embodiments provided in the present application and technical effects produced by the technical solutions of the present application. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps and the like in different embodiments is not repeated.
Fig. 1 shows a flowchart of a data processing method provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a user terminal or a server, and may also be completed by interaction between the user terminal and the server. For example, the data to be processed may be a voice command of the user, and the user terminal may conveniently and quickly identify the specific content of the voice command of the user (target standard data, i.e., standard text expression of the user's intention) by executing the method provided in the embodiment of the present application, and may execute a corresponding operation according to the identification result. For another example, the method may also be executed by a server, where the server may receive a voice instruction of a user sent by a user terminal, and by executing the method provided in the embodiment of the present application, identify specific content of the voice instruction of the user and execute a corresponding operation. The user terminal comprises but is not limited to a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, a wearable electronic device, an AR/VR device and the like. The server may be a cloud server or a physical server, may be one server, or may be a server cluster.
The method provided by the embodiment of the present application can be applied to any application scenario that needs to identify data of one modality matching data of another modality according to the data of the other modality, where the "modality" in the embodiment of the present application refers to a form of data, and a data style presented to people, that is, a kind of data, for example, voice data is data of one modality, and text data is data of another modality. For example, the method provided by the embodiment of the present application may be implemented as a functional module/plug-in of an application program, for example, the application program may be a game application, and by applying the data processing method provided by the embodiment of the present application to the game application, a user may initiate a voice instruction when playing a game, and a recognition result (target standard data) of the voice instruction of the user may be quickly determined by the functional module, and the game server may execute a corresponding operation according to the recognition result and may display the operation result to the user.
The method provided by the embodiment of the present application can be applied to any games with different modality data matching requirements, which may include, but are not limited to, games of action type, adventure type, simulation type, role playing type, leisure type, and the like, for example, the method may be used in a tactical competition type game or competition type game, in which a player may collect various game resources (such as virtual game props) on a game map in a virtual game scene through operation, and optionally, the player may perform collection of game resources by initiating a voice instruction, and based on the solution provided by the embodiment of the present application, the method may further include, by matching a feature of a voice instruction of the player (i.e. a first data feature) with an intention text feature (i.e. a standard text expression feature corresponding to a standard voice instruction, i.e., the second data characteristic in the embodiment of the present application), and quickly and accurately determine the intention of the player (i.e., the matched standard text expression, i.e., the target standard data), so that corresponding operations can be performed according to the intention, and collection of game resources corresponding to the voice command of the player is completed.
The following describes a data processing method provided in the embodiment of the present application with reference to a flowchart shown in fig. 1. As shown in fig. 1, the data processing method provided in the embodiment of the present application may include the following steps S110 to S140.
Step S110: and acquiring data to be processed, wherein the data to be processed is data of a first mode.
Step S120: first data features of the data to be processed are extracted.
The embodiment of the present application is not limited to what form of data is specifically data in the first modality. Optionally, the data of the first modality may include, but is not limited to, at least one of text, voice, video, or images. For example, in some application scenarios, the data to be processed may be acquired voice data of a user, and in other application scenarios, the data to be processed may be text data input by the user. As another example, the data to be processed may include both text and image types of data.
The first data feature of the data to be processed may be extracted through a trained first feature extraction network, where the input of the first feature extraction network may be the data to be processed, or may be data that meets the format requirement of the network input data after the data to be processed is preprocessed. For example, the initial feature of the data to be processed may be extracted by a preprocessing method, and the extracted initial feature is input into the above-mentioned feature extraction network, and the first data feature with better feature expression capability is further extracted through the network.
As an example, the data to be processed may be voice data, the voice data may be first subjected to time-frequency transformation to obtain an audio feature (such as a mel-frequency spectrum feature or a MFCC feature) of the voice data, the audio feature may be used as an input of a first feature extraction network, and a high-level feature representation (also referred to as a voice vector representation or a voice representation vector) of the voice data, that is, the first data feature in this example, is obtained through the network.
Step S130: and matching the first data characteristic with at least one second data characteristic in the target database to obtain a matching result corresponding to each second data characteristic.
Step S140: and determining target standard data matched with the data to be processed from the candidate standard data according to the matching result corresponding to each second data characteristic.
In an embodiment of the present application, the target database includes at least one candidate standard data and a second data feature of each candidate standard data, and the candidate standard data is data of a second modality.
Likewise, the embodiment of the present application is not limited to a specific form of the data of the second modality, and may include, but is not limited to, at least one of text, voice, video, or image. It will be appreciated that the data of the first modality and the data of the second modality are not data of the same modality. For example, the data of the first modality may be speech data and the data of the second modality may be text data.
It should be noted that, in the embodiment of the present application, the data of the first modality and the data of the second modality may only contain one type of data, and the data of the first modality or the data of the second modality may also include two or more types of data, and when at least one of the data of the first modality or the data of the second modality includes two types of data, the data of the first modality and the data of the second modality are data of different modalities, it is understood that at least one type of data in the data of the first modality and the data of the second modality is different, for example, the data of the first modality may be data including text and images, and the data of the second modality may be voice data. Taking the data to be processed as an example including two types of data, the first data feature of the data to be processed may be obtained by fusing (e.g., splicing or adding) features of the two types of data included in the data to be processed, for example, the data to be processed includes voice data and text data, for the data to be processed, the feature of the voice data and the feature of the text data may be extracted separately, and the data feature of the data to be processed may be obtained by fusing the features of the two parts of data.
The standard data (the candidate standard data described above and the first standard data described below) in the embodiment of the present application may be understood as reference data or reference data, which is a standard expression of information, and the standard data may be configured in advance according to application requirements.
In the embodiment of the present application, the second data feature of each candidate standard data may also be obtained by extracting a trained neural network model, and specifically, the second data feature of each candidate standard data may be obtained by respectively performing feature extraction on each candidate standard data through a second feature extraction network. Similarly, the input of the second feature extraction network may be candidate standard data, or the candidate standard data may be preprocessed and then the preprocessed data is input into the second feature extraction network to obtain the second data feature of the candidate standard data. For example, the candidate standard data may be text data, and the initial feature representation of the text data may be obtained by word Embedding (Embedding) or one-hot coding, and the initial feature representation is input to the second feature extraction network to obtain a high-level feature representation of the text data, that is, the second data feature.
As an alternative, the candidate standard data in the target database may be standard expressions matched with first standard data in the standard database, the first standard data being data of a first modality, and one first standard data corresponding to at least one standard expression.
A first standard datum and a standard representation of the standard datum can be understood as a standard description/representation of two different data forms of a message, for example, the first standard datum is a datum in a voice form, the standard representation is a datum in a text form, and the first standard datum and the corresponding standard representation can be understood as a voice representation and a text representation of the same message. For example, as an example, the information is "hello", the first standard data is "hello" speech data, and the corresponding standard expression is the text content "hello".
As an example, a voice instruction library may be preconfigured in the game application, the voice instruction library may store standard text expressions (also may be understood as recognition results of the voice instructions) corresponding to various standard voice instructions (first standard data in the application scenario) supported in the game application, the game player may input the voice instructions (data to be processed) at the client of the game application, and the game server may find out the voice recognition result matching the voice instruction currently input by the user from the standard text expressions corresponding to the standard voice instructions by performing the above steps S120 to S140.
Optionally, the matching result corresponding to one second data feature may be the matching degree, such as the similarity, of the first data feature and the second data feature, and after the matching results of the first data feature and the second data features are obtained, the candidate standard data corresponding to the second data feature with the highest matching degree may be used as the target standard data, or the candidate standard data corresponding to the set number of matching degrees ranked in the order from high to low according to the matching degree may be used as the target standard data, or the candidate standard data corresponding to the second data features with the matching degree greater than the set value may be used as the target standard data.
The data processing method provided by the embodiment of the application is a novel data processing mode, when the method is used for processing the data to be processed, complex data identification can be carried out on the data characteristics of the data to be processed, and matching among different modal data can be conveniently and quickly realized through a mode of matching the characteristics. Furthermore, because the target database stores candidate standard data, the identification accuracy can be well ensured by determining the standard data of the second modality which is matched with the data to be processed of the first modality by the method of the embodiment of the application.
In addition, in practical application, when corresponding operation needs to be executed based on the data to be processed, the operation can be directly executed based on the standard data because the standard data matched with the data to be processed is determined, and compared with the existing modes that corresponding identification results are obtained by further identifying data characteristics, the identification results do not need to be further processed into normalized data, and the actual application requirements can be better met.
In an optional embodiment of the present application, the data processing method may further include:
when the newly added first standard data exist in the standard database, at least one standard expression corresponding to the newly added first standard data is obtained;
extracting second data characteristics of each standard expression corresponding to the newly added first standard data;
and storing each standard expression corresponding to the newly added first standard data and the second data characteristic association corresponding to each standard expression into the target database.
When new first standard data exists in the standard database (for example, as the game application is continuously updated and optimized, the game application has more functions and can support the input of more voice commands), the standard expression corresponding to the newly added standard data can be obtained, the second data feature of the standard expression is extracted, and the standard expression (namely the newly added candidate standard data) and the corresponding second data feature are stored in the target database in an associated manner, so that the target database is expanded.
After the first data feature of the data to be processed is obtained, if the prior art is adopted (for example, the first data feature is processed by adopting the neural network model to obtain the matching data of the second mode corresponding to the data to be processed), when new first standard data appears, the neural network model needs to be retrained, so that if the data to be processed is the standard expression corresponding to the newly added data, the neural network model can support the identification of the data to be processed, and the scheme is complex to implement and high in cost. According to the scheme provided by the embodiment of the application, when new first standard data appears, only the standard expression corresponding to the new first standard data and the second data feature of the standard expression need to be added into the target database, and the target standard data corresponding to the data to be processed can be determined from the candidate standard data by matching the first data feature of the data to be processed with the second data feature in the updated target database, so that the method is simple to implement and low in cost.
In an optional embodiment of the present application, the data processing method may further include: determining the data type of the data to be processed according to the first data characteristic; in this case, the matching the first data characteristic with at least one second data characteristic in the target database in step S130 may include:
and when the data type of the data to be processed is a specified type, matching the first data characteristic with at least one second data characteristic in the target database.
In some application scenarios, only specific processing of certain specified type(s) of data may be required. In order to meet the application requirement, as an alternative, before the data features of the data to be processed and the data features of the candidate standard data are subjected to matching processing, the data type of the data to be processed can be judged in advance, and when the data type of the data to be processed is a specified type, the matching processing is performed, so that unnecessary data processing is reduced, and the computing resources are saved. In this alternative, the candidate standard data in the target database may be a standard expression corresponding to each first standard data of a specified type. In practical application, the specified type may be one type, or may be at least two types, and may be configured according to actual requirements.
It should be noted that, when the alternative is actually implemented, the determination of the data type of the data to be processed may be performed first, and when the data type is a specified type, the subsequent matching processing may be performed. Or the judgment of the data type and the matching processing are both executed, and then the subsequent further processing mode is determined according to the matching result and the judgment result of the data type. For example, after extracting a first data feature of the to-be-processed data, a data type of the to-be-processed data may be determined based on the feature (and the first data feature may be matched with a second data feature in the target database, and then target standard data corresponding to the to-be-processed data may be determined according to the determined data type and a matching result corresponding to each second data feature).
The data type is not a specified type; the matching degree corresponding to each second data characteristic is smaller than a set value.
In the embodiment of the application, the data type of the data to be processed is determined according to the first data feature of the data to be processed, and may also be implemented by a neural network model, for example, by a classification model. Specifically, the first data feature of the data to be processed may be input into a trained classification model, the probability that the data type of the data to be processed belongs to the specified type may also be obtained through the model, and whether the data to be processed is the specified type may be determined according to the probability. The classification model may be a binary classification model, that is, the classification category corresponding to the model includes two types, a designated type and a non-designated type, the output of the model may include a first probability and a second probability that the data type of the data to be processed belongs to the designated type and the non-designated type, and it may be determined whether the data type is the designated type according to the first probability and the second probability, for example, the first probability is greater than the set probability, and the data type of the data to be processed is determined to be the designated type.
As an alternative, the specified types may include at least two types, and the classification model may be a multi-classification model, where the classification categories corresponding to the model include non-specified types and each specified type, for example, two specified types, which are denoted as a first type and a second type, and then the classification categories corresponding to the classification model may be three types. The probability that the data to be processed respectively belongs to the unspecified type, the first type and the second type can be predicted through the model, and the type corresponding to the maximum probability value can be determined as the data type of the data to be processed. Optionally, for the scheme, the target database may include a plurality of sub-libraries, each sub-library corresponds to a standard expression (data of a second modality) corresponding to a first standard data (data of a first modality) of a specified type, whether the data to be processed is of the specified type may be identified through the classification model, and if the data to be processed is of the instruction type, which specified type may be identified specifically, and accordingly, the first data feature of the data to be processed may be matched with the second data feature of the candidate standard data in the sub-library corresponding to the specified type, without matching the first data feature with the second data feature in each sub-library, which may further reduce the data processing amount.
As can be seen from the foregoing description, in the embodiment of the present application, for data of a first modality (e.g., data to be processed), data features of the data may be extracted through a first feature extraction network; for the data of the second modality (such as each candidate standard data), the data features can be extracted through a second feature extraction network; the first feature extraction network and the second feature extraction network are obtained by training a neural network model based on a training data set.
In this embodiment, the neural network model includes a first network model and a second network model, and the first network model and the second network model may be iteratively trained based on a training data set, where the trained first network model is used as the first feature extraction network, and the trained second network model is used as the second feature extraction network. The model structures of the first network model and the second network model are not limited in the embodiments of the present application, and may be configured according to application requirements, for example, both the first network model and the second network model may adopt a CNN-based model. Alternatively, the model structure may be configured according to the form of data to be processed by the model, for example, the data of the first modality is voice data, the first network model may adopt a network structure capable of well extracting the characteristics of the voice data, such as a Wac2vec model, and if the data of the second modality is text data, the second network model may adopt a network structure having a good effect on the text data, for example, a structure based on a transform network, such as a Bert (Bidirectional Encoder with transform based) model.
Optionally, the neural network model including the first network model and the second network model in this embodiment of the application may be obtained by training in the following manner:
acquiring a training data set, wherein the training data set comprises a first training set, and each first sample in the first training set comprises first data of a first modality and second data of a second modality matched with the first data;
performing iterative training on an initial neural network model based on the training data set until a training total loss value meets a preset training end condition, taking a first network model meeting the training end condition as a first feature extraction network, and taking a second network model meeting the training end condition as a second feature extraction network; the training process may include the steps of:
inputting each first data into a first network model to obtain the characteristic of each first data, and inputting each second data into a second network model to obtain the characteristic of each second data;
determining a first training loss value based on the matching degree of the features of the first data and the features of the second data in each first sample and the matching degree of the features of the first data and the features of the second data in each first negative example; wherein the first negative example comprises first data of one first sample and second data of another first sample;
and if the first training loss value does not meet the first preset condition, adjusting model parameters of the first network model and the second network model, wherein the training total loss value meets the preset training ending condition and the first training loss value meets the first preset condition.
When the neural network model is trained, the first data and the second data of each first sample in the first training set are data of two modalities that match each other, the first sample may also be referred to as a positive example, that is, a positive sample, the first negative example (a negative sample) is first data and second data in different first samples, that is, data of two modalities that do not match, and for any first data, the data may respectively constitute a negative example by a plurality of other second data (second data other than the second data that match the first data). In the training process, a training loss value is determined based on the degree of matching between the data features of the positive samples and the degree of matching between the data features of the negative samples.
The embodiment of the present application is not limited to the specific form of the loss function selected in the training process, and the purpose of the model training is to make the similarity between the features of the first data and the second data that match each other as large as possible, and the similarity between the features of the first data and the second data that do not match each other as small as possible.
For positive samples, the degree of difference between the features of the first data learned through the first network model and the features of the second data learned through the second network model (for example, 1 minus the similarity, or the mean square error between the two features, etc.) may be calculated to obtain corresponding training losses, and for negative samples, an optional way may be to calculate the degree of matching between the features of the first data learned through the first network model and the features of the second data learned through the second network model to obtain corresponding training losses, and through continuous training learning, the degree of matching between the features of the positive samples learned through the model may be made higher (i.e., the difference is smaller) and the degree of matching between the features of the negative samples is made lower. The manner of calculating the degree of matching or the degree of difference is different for different loss functions.
In an optional embodiment of the application, the determining the first training loss value based on the matching degree of the feature of the first data and the feature of the second data in each first sample and the matching degree of the feature of the first data and the feature of the second data in each first negative example may include:
determining the difference degree between the characteristics of the first data and the characteristics of the second data of each first sample to obtain a first loss value;
for each piece of first data, determining a first similarity corresponding to the first data and a second similarity corresponding to the first data, wherein the first similarity is the similarity between the characteristics of the first data and the characteristics of second data matched with the first data, and the second similarity is the similarity between the characteristics of the first data and the characteristics of the second data in a first negative example where the first data is located;
acquiring reference labels corresponding to the first data, wherein the reference labels comprise similarity labels corresponding to the first similarity and similarity labels corresponding to the second similarity;
determining a second loss value based on the prediction similarity and the reference label corresponding to each first data, wherein the prediction similarity comprises the first similarity and the second similarity, and the second loss value represents the difference between the prediction similarity and the reference label corresponding to each first data;
and determining a first training loss value according to the first loss value and the second loss value.
Alternatively, the first loss value may be a sum of mean square errors between the features of the first data and the features of the second data in each positive sample, or may be a sum of difference degrees corresponding to each positive sample as the first loss value by calculating a similarity between the features of the first data and the features of the second data in each positive sample, subtracting the similarity from 1 as the difference degree. The first loss value may be as close as possible to the features of the data of the two modalities in the positive sample learned by the model.
The second loss value may also be referred to as a matching error, and is used to constrain that the similarity between the features of the two data in the positive sample learned by the model is higher than the similarity between the features of the two data in the negative sample. When calculating the partial loss, the reference label is a real label in training, that is, a result that the model is expected to learn, specifically, for each first data, a similarity label corresponding to a first similarity in the corresponding real label refers to an ideal similarity between the first data and second data that matches the first data, for example, the similarity may be 1 or a relatively high similarity, and an ideal similarity between second data that does not match the second data in the real label, for example, the similarity may be 0 or a relatively small similarity, and the reference label may be pre-configured. Based on the features of the first data and the features of the second data output by the model, a first similarity and each second similarity corresponding to each first data can be obtained through calculation, the similarities can form a similarity vector, a second loss value is obtained through calculating the difference between the similarity vector and a reference label, for example, the similarity vector can be used as a probability distribution predicted by the model, the reference label is used as a real probability distribution, namely, a label, and the second loss value is obtained through calculating cross entropy loss between the two.
In an optional embodiment of the application, the inputting each first data into the first network model to obtain a characteristic of each first data may include:
for each first data, performing the following operations on the first data through the first network model to obtain the characteristics of the first data:
dividing the first data into at least two subdata to obtain a subdata series corresponding to the first data; extracting features of each subdata in the subdata sequence based on a dictionary, wherein the dictionary comprises a plurality of data elements, the number of feature values included in the features of each subdata is equal to the number of the elements in the dictionary, and one feature value represents the probability that the subdata contains the data elements corresponding to the positions of the feature values in the dictionary; obtaining the characteristics of first data based on the characteristics of each subdata;
in this embodiment, the data processing method may further include:
for each second datum, determining, based on the dictionary, that the second datum corresponds to a data feature of the dictionary, the data feature characterizing a probability that the second datum corresponds to a respective data element in the dictionary;
accordingly, the determining the first training loss value may include:
and determining a first training loss value based on the matching degree between the characteristics of the sub data of the first data in each first sample and the data characteristics of the second data corresponding to the dictionary, the matching degree between the characteristics of the first data in each first sample and the characteristics of the second data, and the matching degree between the characteristics of the first data in each first negative example and the characteristics of the second data.
It can be seen that, in this alternative, the first training loss value further increases a loss (which may be referred to as a third loss value) corresponding to a degree of matching between the features of each subdata of the first data in each first sample and the features of the second data corresponding to the data features of the dictionary, and based on the loss, the probability that the second data matching the first data can be predicted is maximized based on the features of each subdata in the first data learned based on the first network model, that is, the third loss value is to constrain the first network model, so that the features of each subdata in the first data learned by the model can predict the second data.
Alternatively, the third loss component can employ a CTC error (also called CTC loss) that allows the model to automatically learn the alignment between data from different modalities. In the embodiment of the present application, the data elements in the dictionary are data units that can be used to represent each sub-data of the first data and the second data, and the form of the data elements may be configured as required, and optionally, the data elements may include, but are not limited to, pinyin or phonemes. Taking phonemes as an example, the data elements in the dictionary include each phoneme and a blank (a pseudo-flag, also called blank, added to the CTC loss to enable automation between the automation data). For each subdata of the first data, the dimension of the feature of the subdata (i.e. the length of the feature vector) is equal to the number of data elements in the dictionary, the position of each data element in the dictionary is fixed, the feature of the subdata characterizes the probability of the data element in each position contained in the subdata, and as an illustrative explanation, assuming that there are three elements a, b and c in the dictionary, the length of the feature of one subdata is 3, which can be expressed as (p1, p2 and p3), p1, p2 and p3 respectively indicate that the probability of a appearing at the first position is p1, the probability of b appearing at the second position is p2 and the probability of c appearing at the third position is p 3. For the second data, the data feature corresponding to the dictionary is a probability that characterizing the data feature characterizes the second data as corresponding to each data element in the dictionary. When calculating the third loss value corresponding to each positive sample, based on the feature sequences of the sub-data of the first data (i.e., the feature vectors formed by the feature values), the probability that the second data corresponds to the data features of the dictionary is determined according to the feature sequences of the sub-data, and based on the constraint of the third loss value, the probability is maximized, so that the features of the sub-data of the first data learned by the first network model can include semantic information of the second data.
Alternatively, the first data may be speech data, the second data may be text data, and the data elements in the dictionary may be phonemes. The second data may correspond to a data feature of the dictionary, and may be a phoneme sequence corresponding to the second data, that is, a sequence composed of phonemes constituting the second data, and when performing feature extraction on the voice data, feature extraction may be performed on each voice frame (i.e., sub-data) of the voice data first based on the dictionary to obtain a feature representation corresponding to each voice frame, and when calculating a CTC loss, the phoneme sequence corresponding to the text data may be used as a label, and a CTC loss may be calculated from the feature representation of each voice frame and the label, and a value of the loss represents a probability that a phoneme sequence of the text data is predicted from the feature representation of each voice frame, and the larger the probability, the smaller the value of the loss.
In an optional embodiment of the present application, the candidate standard data may be a standard expression of a second modality corresponding to a specified type of the first standard data; the initial neural network model further comprises a classification model; at this time, the training data set further includes a second training set, each second sample in the second training set includes third data of the first modality, fourth data of the second modality matched with the third data, and a type tag of the third data, where the third data in the second training data set includes third data of a specified type and third data of a non-specified type; after obtaining the neural network model with the first training loss value satisfying the first preset condition, the training process of the model may further include:
continuing to repeatedly execute training operation on the neural network model based on the second training set until a second training loss value meets a second preset condition, wherein the training total loss value meets a preset training end condition and the second training loss value meets the second preset condition; the training operation may include:
inputting each third data into the first network model to obtain the characteristics of each third data, inputting each fourth data into the second network model to obtain the characteristics of each fourth data, and inputting the characteristics of each third data into the classification model to obtain the prediction type corresponding to each third data;
determining a second training loss value based on the matching degree of the features of the third data and the features of the fourth data in each second sample, the matching degree of the features of the third data and the features of the fourth data in each second negative example, and the matching degree between the type label and the prediction type of each third data;
and if the second training loss value does not meet the second preset condition, adjusting the model parameters of the neural network model.
As can be seen from the foregoing description, in some application scenarios, it is necessary to distinguish the data type of the data to be processed, and further processing may be performed when the data type is a specified type. In order to meet the application requirement, in this optional embodiment of the present application, the neural network model may further include a classification model, in addition to the first network model and the second network model, where the classification model is cascaded with the first network model and is used to determine the type of data input to the first network model according to the features output by the first network model. In this optional embodiment, the process of training the neural network model based on the first training set may be referred to as pre-training, through the pre-training, the first network model and the second network model that substantially satisfy the application requirements may be obtained, and after the neural network model that satisfies the first preset condition is obtained through the pre-training (for convenience of description, the model is referred to as an intermediate model), the fine-tuning training may be continued on the intermediate model based on the second training data set, so as to obtain a model that can better satisfy the requirements of the specific task.
In the fine tuning training process, a part of the training loss (matching loss) may be calculated from the degree of matching between the feature of the third data and the feature of the fourth data in each second sample and the degree of matching between the feature of the third data and the feature of the fourth data in each second negative example, and a part of the training loss (classification loss) may be calculated from the type label and the prediction label of each third data, and further training of the model may be constrained based on these two training losses. The loss may be calculated according to the degree of matching between the feature of the third data and the feature of the fourth data in each second sample and the degree of matching between the feature of the third data and the feature of the fourth data in each second negative example, the matching loss (i.e., the second loss value) may be calculated in the foregoing, and of course, the first loss value and the second loss value may be calculated in the foregoing.
For the classification loss, the loss value of the part represents the similarity between the type of the third data predicted by the classification model and the true type of the third data, i.e. the type label, optionally, the type label of the third data may be 1 or 0, for example, 1 represents that the third data is data of the specified type, 0 represents that the third data is not data of the specified type, the output of the classification model may include a first probability that the third data is of the specified type and a second probability that the third data is not of the specified type, a training loss part corresponding to the classification model may be calculated according to the two probabilities output by the classification model of the type label of each third data, optionally, the loss part may be calculated by using a binary cross entropy error, and the smaller the error value represents that the predicted type is closer to the true type.
After the neural network model satisfying the training end condition is obtained by the optional embodiment, in application, the data type of the data to be processed may be identified by the trained classification model, specifically, the data to be processed may be input to the trained first network model (i.e., the first feature extraction network) to obtain a first data feature of the data to be processed, the first data feature may be input to the trained classification model to obtain a first probability that the data to be processed belongs to the specified type of data and a second probability that the data to be processed does not belong to the specified type of data, and it may be determined whether the data to be processed is the specified type of data according to the first probability and the second probability. For example, if the data to be processed is voice data and the command type is a voice command, that is, the voice data of the specified type is command-type voice, and if it is determined by the classification model that the data to be processed is command-type voice data, the feature of the data to be processed can be matched with the feature of each candidate standard data (such as text) in the target database, and a text expression matched with the voice data can be found, which can also be understood as a recognition result of the voice data.
The data processing scheme provided by the embodiment of the application provides a data matching method based on cross-modal retrieval, and the method can quickly find out data of another modality matched with the data to be processed, namely target standard data from a target database only by utilizing the data characteristics of the data to be processed without further deep recognition of the data characteristics. Compared with the prior art, the scheme of the application can greatly and effectively reduce the data calculation amount, and the accuracy is obviously improved compared with the prior art.
The method provided by the embodiment of the application can be suitable for any processing scene needing to cross modal data, for example, the method can be applied to an instruction recognition scene of an AI voice assistant, and can accurately and quickly recognize the semantic instruction of a user; the method can also be applied to the voice question and answer of the AI robot, and through the method, the text expression matched with the voice input by the user can be found, so that the answer information corresponding to the text expression can be provided for the user; the method can also be applied to a cross-modal data retrieval scene, for example, the method can be applied to a search engine or various application programs, matched audio data can be found based on text data input by a user, and corresponding music can be found and provided for the user according to search text input by the user. In addition, the first feature extraction network and the second feature extraction network provided by the embodiment of the application can also be applied to various scenes needing to extract data features, and the data features with better semantic expression capability can be extracted.
In practical implementation, in the embodiment of the present application, both the data to be processed and the candidate standard data may include at least one type of data, for example, the data of the first modality is voice, and the data of the second modality is text. It is to be understood that, when the data to be processed or the data of the second modality may be data including two types, in the training of the first feature extraction network and the second feature extraction network, the data of the first modality (the first data and the third data) and the data of the second modality (the second data and the fourth data) in the training data set should at least include data of types corresponding to the data to be processed and the candidate standard data, respectively, for example, the data to be processed may be data including data of the first type and data of the second type, and the candidate standard data is data of the third type, then at least a portion of the first data and the third data in the training data set should also include data of the first type and data of the second type, the portion of the second data corresponding to the first data and the portion of the fourth data corresponding to the third data should also be data of the third type, that is, the type of sample data in the training data set used when training the feature extraction network should correspond to the type of data processed after the network has been trained.
In order to better understand the method provided by the embodiment of the present application and the practical value of the method, the method provided by the embodiment of the present application is described below with reference to a specific scenario embodiment.
The application scene corresponding to the scene embodiment is a game scene, and the method provided by the embodiment of the application can be applied to an AI voice assistant in game application, and the A voice assistant can identify a voice instruction input by a user. In a game scene, a user can interact with the AI voice assistant through voice when playing a game, for example, the user says "mark P city" when playing the game in his user terminal, which aims to make the AI voice assistant mark the location "P city" of the map of the game virtual scene, and the voice instruction "mark article a" is to make the AI voice assistant mark "article a" in the virtual game scene.
Fig. 2 shows a schematic structural diagram of a data processing system applicable in this scenario embodiment of the present application, as shown in fig. 2, the data processing system may include a user terminal 10, a game server 20 and a training server 30, the user terminal 10 may be communicatively connected to the game server 20 through a network, the user terminal 10 may be a user terminal of any game player, and the game server 20 is configured to provide a game service for the player, where the type of the game application is not limited in this embodiment, and may be a game application that requires a user to download and install, a cloud game application, or a game application in an applet. The training server 30 may be communicatively connected to the game server 20 through a network, and the training server 30 may be configured to perform a training operation of the neural network model and provide the trained neural network model to the game server 20.
The AI voice assistant may be deployed in the game server 20 or in the user terminal 10, and optionally, in order to reduce the computing resources of the user terminal 10, the AI voice assistant is deployed at the game server 20 side for example.
In the application scenario, the data to be processed is voice data of a user, the first standard data is a pre-configured standard voice instruction, that is, a voice instruction supported by the game application, and a standard expression corresponding to the first standard data is a text expression (that is, a candidate standard data). An alternative implementation flow of the method provided by the present application in a game scenario is described below with reference to the data processing system shown in fig. 2. The data processing flow in this embodiment may include the following steps 1 to 5:
step S1: and training the neural network model.
This step may be performed by training server 30. Fig. 3 shows a schematic diagram of a training principle of the neural network model in this scenario. As shown in fig. 3, the training process may include two stages of pre-training and fine-tuning training, fig. 4 shows a schematic diagram of the pre-training stage, and fig. 5 shows a schematic diagram of the fine-tuning training stage.
As shown in fig. 3, the neural network model in the present scenario embodiment includes a speech coding module (first network model), a text coding module (second network model), and a classification module. The speech data may obtain a corresponding speech representation vector (i.e., a feature of the speech data, which may also be referred to as a vector representation) through a speech encoding module, and the text data may obtain a corresponding text representation vector (i.e., a feature of the text data) through a text encoding module. The specific network structure of the module is not limited in the application, and can be configured according to actual needs.
Alternatively, the speech coding module may adopt a structure based on the Wav2vec2 model, for example, a pooling module (pooling operation shown in fig. 4) may be followed by a Wav2vec2 model (Wav 2vec2 shown in fig. 4) as the speech coding module, and the Wav2vec2 model is composed of multiple layers of CNNs and multiple layers of transformers. When the speech coding module is used for feature extraction, speech data is firstly coded by a Wav2vec2 model to obtain a vector sequence, each value of the sequence is a vector (which can be understood as a feature vector of a speech segment), and then an average value of the vector sequence is calculated by a pooling operation to obtain a vector, namely a speech representation vector. For the text encoding module, optionally, the module may adopt a structure based on a Bert model, and the model is composed of multiple layers of transformers. The training process of the two stages of pre-training and fine-tuning training of the neural network model is described below with reference to fig. 3 to 5.
Pre-training process
The pre-training is based on a first training set comprising a large amount of speech-translated text data (i.e. a first sample comprising speech data and text data matching the speech data). For convenience of description, the pre-training process will be described below with reference to the speech data (the first data in fig. 3) in the training data set as sample speech and the text data (the second data in fig. 3) corresponding to the speech data as translated text.
As shown in fig. 3 and 4, one training process may include: inputting each sample voice to the voice coding module, obtaining a voice expression vector of each sample voice through the voice coding module, inputting each translated text to the text coding module to obtain a text expression vector of each translated text, then calculating a training loss (a first training loss value) in a pre-training stage based on the voice expression vector of each sample voice and the text expression vector of each translated text, that is, a total error shown in fig. 4, if the total error of the training satisfies a first preset condition, the pre-training stage may be ended to obtain an intermediate model, if the total error does not satisfy the first preset condition, adjusting model parameters of the voice coding module and the text coding module (an update parameter shown in fig. 4), and repeating the training process until the total error satisfies the first preset condition.
As shown in fig. 4, in the pre-training in the present scenario embodiment, a plurality of training targets, i.e. a plurality of training errors, i.e. a plurality of training losses, specifically including CTC error (third loss part), matching error (first loss part), and distillation error (second loss part), are adopted, and the first training loss value in the pre-training stage is the sum of the errors of the three parts, which have the following meanings:
(1) CTC error: the error is calculated based on the intermediate vector (the characteristic of each subdata) of the voice coding module and the label (the translated text corresponds to the data characteristic of the dictionary), and can represent the similarity between the sample voice and the translated text in a positive sample, the label is generated according to the translated text and is a Chinese pinyin sequence of the corresponding text, and pinyin is used instead of words to reduce the size of the dictionary, namely the number of data elements in the dictionary, namely the length of the label. The vector sequence output by the Wav2vec2 model is the feature of each speech segment of the speech data, the incoming CTC error, and the label are used as the input for calculating the CTC error, and the CTC error L corresponding to a positive samplectcCan be expressed as follows:
in the present embodiment, y denotes a label, x denotes a sample voice, c1,c2,…,cTThe expression vector of each speech segment (speech frame) which is output after the sample speech is coded by the Wav2vec2 model of the speech coding module is represented, pi represents a legal sequence corresponding to the sample speech x, and can be understood as a path of a label y (namely a pinyin sequence) obtained according to the expression vector of each speech segment of the sample speech, LctcThe probability that all paths of y can be obtained according to the representation vector of each speech segment of the sample speech, that is, the probability that the label y can be obtained according to the representation vector of each speech segment of the sample speech, is characterized.
For a speech-translated text pair, this error represents the likelihood that each speech segment of the sample semantics that can be output according to the Wav2vec2 model will be predicted to have a label y corresponding to the feature vector of the lexicon. During the training process, the purpose of the CTC error is to enable the vector sequence output by the Wav2vec2 model to predict the pinyin of the translated text, and enable the vectors to contain semantic information.
(2) Matching error: the vector representations corresponding to the sample speech and the translated text are used as the input of the error, and the matching error aims to make the similarity of the two vector representations of the speech-translated text (positive example) higher than the similarity of the two vector representations of the speech-negative example translated text (negative example), and optionally, the similarity can be calculated by using cosine similarity. During training, a batch of voice-translation text data is calculated at one time, each voice-translation text pair in the batch belongs to a positive example, and other voices and translation texts in the batch form negative examples in pairs, namely voice-negative example translation texts.
Alternatively, the matching error may be a multi-class cross entropy loss, specifically, after obtaining the speech vector representation of each sample speech and the text vector representation of each translated text through the model, for each sample speech, a first similarity between two vector representations of positive examples to which the speech belongs and a second similarity between two vector representations of negative examples to which the speech belongs, that is, a similarity between a speech representation vector of the speech and a text representation vector of other texts (texts in each translated text except for the translated text matching the speech) are taken as an illustrative description, assuming that there are 10 speech-translated text pairs in a batch of data, then the second similarity is 9 for each sample speech, the first similarity is one, and the 10 similarities may be distributed as the prediction result, for example, the first similarity is used as a first value of the distribution, the prediction result distribution can be represented as a distribution vector [ p1, p2, …, p10], p1 represents the first similarity, the other 9 vectors are second similarities, the true distribution (true label) corresponding to the prediction result distribution can be represented as [1, 0, …, 0], 1 represents the true label of the similarity between the sample speech and the transition text matching the sample speech, 0 represents the true label of the similarity between the sample speech and the other text, the true distribution is used as the true label, the prediction result distribution is used as the prediction distribution of the model, and the matching error corresponding to each sample speech can be calculated through multi-classification cross entropy loss.
In the training process, the similarity between the vector representations of the positive examples learned by the model can be higher than the similarity between the vector representations of the negative examples through the constraint of the matching error.
(3) Distillation error: the positive example corresponding vector representation is used as an input for the error, and the Mean Square Error (MSE) of the two vectors is calculated, which error has the purpose of bringing the two vector representations of the speech-translated text pair close together.
And calculating the three errors, averaging or summing the three error values to obtain an overall error, updating model parameters of the speech coding module and the text coding module if the overall error does not meet a first preset condition, and continuously repeating the training.
After the intermediate model satisfying the first preset condition is obtained, training of each part of the intermediate model, namely fine tuning training, may be continued based on the second training set.
Fine tuning training process
Fig. 5 shows a schematic diagram of a fine-tuning training process, as shown in fig. 3 and fig. 5, in the fine-tuning training stage, in addition to continuing training the speech coding module and the text coding module, training a classification model (the recognition rejection classification module shown in fig. 5) is also required, and a neural network model satisfying a second preset condition is obtained through the fine-tuning training, and the speech coding module and the classification module at this time may be deployed in an AI speech assistant of the game server 20 for recognizing the category of the speech data input by the game player and extracting the speech representation vector of the speech data, optionally, the text coding module may also be deployed in the game server for extracting the text representation vector of the standard text expression corresponding to the standard speech instruction, or the text coding module may also be deployed in other devices, the device extracts the text expression vector of the standard text expression corresponding to each standard voice instruction, and then provides the text expression vector for the game server to use. The structure of the classification module is not limited in this application, and optionally, the rejection classification module may be composed of a layer of full-connection layer network. The process of the fine tuning training is described below with reference to fig. 3 and 5.
In the fine tuning training stage, because the rejection classification module is used for judging whether the sample voice input to the voice coding module is an instruction or not, the second samples in the second training set include the sample voice and the translation text corresponding to the sample voice, and also include the type label of the sample voice. For example, the type tag is a tag of 1 indicating that the command voice is input, and 0 indicating that the command voice is not input.
As shown in fig. 5, in the fine tuning training stage, similarly, the speech representation vector of each sample speech is obtained by passing each sample speech (instruction speech and non-instruction speech) through the speech coding module, and the corresponding text representation vector is obtained by passing the transition text of each sample speech through the text coding module. The voice expression vector of each sample voice obtains the probability that the sample voice is the instruction voice through the rejection classification module. Optionally, there may be two objective functions (i.e., loss functions) at this stage, where the rejection classification error may adopt a binary cross entropy error, the error is calculated by the difference between the probability that the sample speech predicted by the classification model is the instruction speech and the type label of the sample speech, and the matching error may adopt the same matching error as that in the pre-training stage. Then, the two errors may be averaged or summed to be a second training loss value (the total error in fig. 5), if the total error satisfies a second preset condition, a trained model is obtained, if the total error does not satisfy the second preset condition, the model parameters of the speech coding module, the text coding module, and the classification model are updated, and the fine tuning training process is repeated.
Step S2: and constructing a target database.
Before model application and data to be processed are carried out, a target database needs to be constructed, and the target database in the application scenario comprises an instruction library and an instruction vector library shown in fig. 2. The instruction library stores standard expressions (candidate standard data) in text forms corresponding to the standard voice instructions, and the instruction vector library stores text vector representations (second data features) of the standard expressions in the instruction library. The instruction library construction may be constructed according to an instruction intention category, the instruction library containing a plurality of specific intentions (corresponding to standard voice instructions, each of which may be understood as one intention of the user), each intention may be given as one category and assigned a category id, and for each intention, one or more natural language texts expressing the intention may be constructed as standard expressions of the intention. For example, the voice command "mark article a" can be taken as a category, and the corresponding standard expression can be a plurality of standard expressions such as "mark article a", "there is article a", "mark article a", and the like.
For the instruction vector library, the text vector representation can be extracted by a trained text encoding module. As shown in fig. 6, for each standard expression in the instruction library, it may be input into a text encoding module, through which a corresponding text vector representation is obtained, and the text vector representation of the standard expression corresponding to each intention is stored into the instruction vector library.
Based on the scheme provided by the embodiment of the application, the updating and the expansion of the instruction database can be realized very conveniently and quickly, for example, when a new voice instruction is added, only the text expression corresponding to the new voice instruction needs to be added into the instruction database, and the corresponding text vector expression is extracted by the text coding module and is recorded into the instruction vector database, so that the model does not need to be retrained, and the application requirement when the type of the instruction (a standard voice instruction can be regarded as a type) in the target database is not fixed can be met well.
Optionally, in the fine-tuning training phase, the instruction speech-instruction speech translation text pair in the second training set may include a text pair formed by a standard speech instruction in the instruction library and a standard expression corresponding to the standard speech instruction.
Step S3: and processing the data to be processed.
In the application stage, for each voice input of the user (i.e. a voice instruction, that is, data to be processed in the application scenario), the processing flow shown in fig. 7 may be adopted for processing, specifically, a trained voice encoding module may be used to obtain a voice representation vector (i.e. a first data feature) of each voice input, the vector representation may be used as a query vector, a similarity (i.e. a matching score) represented by each vector of the query vector and the instruction vector library is calculated, a maximum matching score and a text vector representation (the instruction and the matching score shown in fig. 7) corresponding to the matching score may be selected, in addition, the voice representation vector is input to the rejection classification module, the rejection score (such as a probability that the voice input is a voice instruction) is obtained by the rejection classification module, and then, according to the matching score and the rejection score, the rule judgment module obtains a rejection rule according to a preconfigured judgment rule, and judging whether the standard expression corresponding to the text vector representation corresponding to the maximum similarity is the target standard data matched with the voice input. For example, if the determination rule is that the matching score is greater than the first threshold and the recognition rejection score is greater than the second threshold, the standard expression corresponding to the maximum similarity is determined as the target standard data if the condition is satisfied, that is, the voice input by the user at this time is considered as the voice command corresponding to the standard expression (the command output last in fig. 7), an action corresponding to the standard voice command corresponding to the standard expression may be performed according to the voice input by the user, for example, the standard expression is "tagged item a", and the game server may perform a corresponding tagging action, and present the tagging result to the player through a user interface of the user terminal (i.e., an interactive interface of the game application).
As an example, fig. 8a and 8b are schematic diagrams illustrating a user interface of a game scenario, during a game playing process, a player may perform a game operation by initiating a voice instruction or by a manual operation, specifically, the player may perform a control operation on a player character of the player by an input control device on or external to a terminal device of the player, as in fig. 8a, during the game, the player may click a "voice assistant" control (a control 81 in fig. 8 a) to open or close an AI voice assistant, and during a state that the AI voice assistant is opened, the player may initiate the voice instruction. In the game scene, if a player wants to mark a virtual item a in the scene, the player can control his player character (which can mark the item a through the designated virtual prop of the character) to aim at the item a, after aiming, a voice instruction of "mark the item a" can be issued, or the item a can be marked by clicking a "mark" control (control 81 in fig. 8 a), when in a voice mode, the AI voice assistant can execute the data processing method provided by the implementation of the present application (as described in any optional embodiment corresponding to steps S120 to S140 above) after receiving the voice instruction of "mark the item a", or send the voice instruction to the game server, and the game server determines that the real intention (i.e. target standard data) of the user needs to mark "item a" by executing the method, the AI voice assistant can then mark the item a in the game scene with the real intent. As shown in fig. 8b, after the item a is marked, corresponding marking information may be displayed, the marking information including, but not limited to, attribute information (such as category, etc.) of the item a, marking hint information (such as "abc in fig. 8b marks the item a" and a mark 83 floating above the item a, abc being the name of the player in the game, i.e. the nickname of the player), the distance of the player character from the item currently in the game scene (5 meters shown in the figure), and the like.
In practical applications, if a player is currently in a team game, after the player marks the item a, other players in the team where the player is located may also see corresponding prompt information on their user interfaces, such as "abc marks the item a", information on relative positions of player characters of other players and the item a in a game scene, and the like.
As an alternative, in order to accelerate the retrieval efficiency, when determining the standard expression with the highest matching score by calculating the similarity between the query vector and the vector representation of the instruction vector library, a Faiss retrieval method (a fast retrieval method) may be adopted, and accordingly, after obtaining the text vector representation of each standard expression through the text encoding module, the instruction vector library may be constructed by adopting the feature vector library construction method in the Faiss retrieval method, so as to improve the retrieval efficiency.
It can be seen that, by adopting the method provided by the embodiment of the application, when voice data is identified, a voice identification model is not needed, a text expression matched with the voice data can be quickly and accurately identified based on the similarity between the voice expression vector of the voice data and the text expression vector of the text data, a standard voice instruction corresponding to the text expression can be used as a real intention of the voice data, and corresponding operation is executed according to the intention. By adopting the method, the calculation cost can be effectively reduced, and the processing efficiency is improved. In addition, the problem of poor recognition accuracy under the condition of multiple instruction categories can be well solved through a cross-modal retrieval mode of the voice retrieval text, the actual application requirements of the instruction categories can be met, when a new instruction category exists, the model does not need to be retrained, and only a new instruction text needs to be added into a retrieval library (namely an instruction vector library). In addition, through a pre-training and fine-tuning training two-stage training mode and a multi-objective optimization scheme, the voice coding module can well learn the semantic information of the voice, and the recognition accuracy can be improved.
In practical applications, in addition to the above-mentioned staged training method provided in the embodiments of the present application, a method of combining two stages or alternatively training two stages may also be used.
In order to verify the effect of the method provided by the embodiment of the application, the method provided by the embodiment of the application and the existing scheme are compared and tested in a game scene. In the test, 1469 pieces of data marked manually are adopted, the existing scheme adopts a method of Automatic Speech Recognition (ASR) and Natural Speech understanding (NLU) and adopts accuracy, recall rate and false recall rate as effect evaluation indexes, wherein the false recall rate refers to the proportion of an instruction Recognition result given by a model for non-instruction Speech. The higher the accuracy and recall rate, the better, and the lower the false recall rate, the better. Table 1 below shows test results of the solutions of the embodiments of the present application and the existing solutions, and it should be noted that, during the test, the test data is selected from screened data, and the non-instruction voice data is much like a voice instruction, so the call error rate of both the solutions is relatively high.
TABLE 1 comparison of data effects in manual labeling
Model (model) | Rate of accuracy | Recall rate | Rate of false calls |
Existing solutions | 38.02% | 61.97% | 64.20% |
This application | 42.86% | 76.40% | 60.34% |
According to the test result, compared with the existing scheme, the accuracy and the recall rate corresponding to the neural network model provided by the embodiment of the application are obviously and greatly improved, and the false recall rate is obviously reduced.
Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application also provides a data processing apparatus, as shown in fig. 9, the data processing apparatus 100 may include a to-be-processed data obtaining module 110, a feature obtaining module 120, and a data identifying module 130.
A to-be-processed data obtaining module 110, configured to obtain to-be-processed data, where the to-be-processed data is data in a first modality;
a feature obtaining module 120, configured to extract a first data feature of the data to be processed;
the data identification module 130 is configured to match the first data feature with at least one second data feature in the target database to obtain a matching result corresponding to each second data feature, and determine target data matched with the to-be-processed data from each candidate data according to the matching result corresponding to each second data feature;
the target database comprises at least one candidate standard data and a second data characteristic of each candidate standard data, and the candidate standard data are data of a second modality.
Optionally, the data identification module is further configured to: determining the data type of the data to be processed according to the first data characteristic; accordingly, the data recognition module, when matching the first data characteristic with at least one second data characteristic in the target database, is configured to:
and when the data type of the data to be processed is a specified type, matching the first data characteristic with at least one second data characteristic in the target database.
Optionally, the data of the first modality and the data of the second modality are data of different modalities, the data of the first modality includes at least one of text, voice, video or image, and the data of the second modality includes at least one of text, voice, video or image.
Optionally, the candidate standard data is a standard expression matched with first standard data in a standard database, the first standard data is data of a first modality, and one first standard data corresponds to at least one standard expression.
Optionally, the feature obtaining module is further configured to: when the newly added first standard data exist in the standard database, at least one standard expression corresponding to the newly added first standard data is obtained; extracting second data characteristics of each standard expression corresponding to the newly added first standard data; and storing each standard expression corresponding to the newly added first standard data and the second data characteristic association corresponding to each standard expression into the target database.
Optionally, the first data feature is obtained by extracting through a first feature extraction network; the second data characteristic of the candidate standard data is extracted through a second characteristic extraction network; the first feature extraction network and the second feature extraction network are obtained by training a model training module in the following mode:
acquiring a training data set, wherein the training data set comprises a first training set, and each first sample in the first training set comprises first data of a first modality and second data of a second modality matched with the first data;
performing iterative training on an initial neural network model based on a training data set until a training total loss value meets a preset training end condition, wherein the neural network model comprises a first network model and a second network model, the first network model meeting the training end condition is used as a first feature extraction network, and the second network model meeting the training end condition is used as a second feature extraction network; the training process comprises the following steps:
inputting each first data into a first network model to obtain the characteristics of each first data, and inputting each second data into a second network model to obtain the characteristics of each second data;
determining a first training loss value based on the matching degree of the features of the first data and the features of the second data in each first sample and the matching degree of the features of the first data and the features of the second data in each first negative example; wherein the first negative example comprises first data of one first sample and second data of another first sample;
if the first training loss value does not meet the first preset condition, adjusting model parameters of the first network model and the second network model, wherein the total training loss value meets a preset training ending condition, including the first training loss value meeting the first preset condition.
Optionally, the model training module is configured to execute, when inputting each first data into the first network model and obtaining the feature of each first data:
for each first data, performing the following operations on the first data through the first network model to obtain the characteristics of the first data: dividing the first data into at least two subdata to obtain a subdata series corresponding to the first data; extracting features of each subdata in the subdata sequence based on a dictionary, wherein the dictionary comprises a plurality of data elements, the number of feature values included in the features of each subdata is equal to the number of the elements in the dictionary, and one feature value represents the probability that the subdata contains the data elements corresponding to the positions of the feature values in the dictionary; obtaining the characteristics of first data based on the characteristics of each subdata;
the model training module may be further operable to: for each second datum, determining, based on the dictionary, that the second datum corresponds to a data feature of the dictionary, the data feature characterizing a probability that the second datum corresponds to a respective data element in the dictionary;
the model training module, in determining the first training loss value, may be to: and determining a first training loss value based on the matching degree between the characteristics of the sub data of the first data in each first sample and the data characteristics of the second data corresponding to the dictionary, the matching degree between the characteristics of the first data in each first sample and the characteristics of the second data, and the matching degree between the characteristics of the first data in each first negative example and the characteristics of the second data.
Optionally, the model training module, when determining the first training loss value based on the matching degree of the features of the first data and the features of the second data in each first sample and the matching degree of the features of the first data and the features of the second data in each first negative example, may be configured to:
determining the difference degree between the characteristics of the first data and the characteristics of the second data of each first sample to obtain a first loss value;
for each piece of first data, determining a first similarity corresponding to the first data and a second similarity corresponding to the first data, wherein the first similarity is the similarity between the characteristics of the first data and the characteristics of second data matched with the first data, and the second similarity is the similarity between the first data and the second data in a first negative example where the first data is located;
acquiring reference labels corresponding to the first data, wherein the reference labels comprise similarity labels corresponding to the first similarity and similarity labels corresponding to the second similarity;
determining a second loss value based on the prediction similarity and the reference label corresponding to each first data, wherein the prediction similarity comprises the first similarity and the second similarity, and the second loss value represents the difference between the prediction similarity and the reference label corresponding to each first data;
and determining a first training loss value according to the first loss value and the second loss value.
Optionally, the candidate standard data is a standard expression of a second modality corresponding to the first standard data of the specified type; the initial neural network model further comprises a classification model; the training data set further comprises a second training set, each second sample in the second training set comprising third data of the first modality, fourth data of the second modality matching the third data, and a type label for the third data, wherein the third data in the second training data set comprises third data of a specified type and third data of a non-specified type; after obtaining the neural network model with the first training loss value satisfying the first preset condition, the model training module is further configured to perform the following training process:
continuing to repeatedly execute training operation on the neural network model based on the second training set until the second training loss value meets a second preset condition, wherein the training total loss value meets the preset training end condition and the second training loss value meets the second preset condition; the training operation comprises:
inputting each third data into the first network model to obtain the characteristics of each third data, inputting each fourth data into the second network model to obtain the characteristics of each fourth data, and inputting the characteristics of each third data into the classification model to obtain the prediction type corresponding to each third data;
determining a second training loss value based on the matching degree of the features of the third data and the features of the fourth data in each second sample, the matching degree of the features of the third data and the features of the fourth data in each second negative example, and the matching degree between the type label and the prediction type of each third data;
and if the second training loss value does not meet the second preset condition, adjusting the model parameters of the neural network model.
Optionally, the data of the first modality is speech, the data of the second modality is text, and the data elements are phonemes.
Optionally, the specified type is instruction type speech.
The apparatus of the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus of the embodiments of the present application correspond to the steps in the method of the embodiments of the present application, and for the detailed function description and the beneficial effects of the modules of the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.
An embodiment of the present application further provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored on the memory, and the processor executes the computer program to implement the steps of the method provided in any optional embodiment of the present application.
Fig. 10 is a schematic structural diagram of an electronic device to which an embodiment of the present application is applicable, and as shown in fig. 10, the electronic device 4000 includes a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.
The memory 4003 stores therein a computer program for executing the method provided by the embodiments of the present application, and can be controlled by the processor 4001 to execute. The processor 4001 may implement the steps shown in any one of the method embodiments described above in the present application when executing the computer program stored in the memory 4003.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps and corresponding contents of any one of the foregoing method embodiments of the present application can be implemented.
Embodiments of the present application further provide a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, the steps and corresponding contents of any one of the foregoing method embodiments of the present application can be implemented.
It should be noted that the terms "first," "second," "third," "fourth," "1," "2," and the like (if any) in the description and claims of this application and the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.
It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.
The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.
Claims (13)
1. A data processing method, comprising:
acquiring data to be processed, wherein the data to be processed is data in a first mode;
extracting a first data characteristic of the data to be processed;
matching the first data characteristic with at least one second data characteristic in a target database to obtain a matching result corresponding to each second data characteristic, wherein the target database comprises at least one candidate standard data and the second data characteristic of each candidate standard data, and the candidate standard data are data in a second mode;
and determining target standard data matched with the data to be processed from the candidate standard data according to the matching result corresponding to each second data feature.
2. The method of claim 1, further comprising:
determining the data type of the data to be processed according to the first data characteristic;
the matching the first data characteristic with at least one second data characteristic in a target database comprises:
and when the data type of the data to be processed is a specified type, matching the first data characteristic with at least one second data characteristic in a target database.
3. The method of claim 1, wherein the data of the first modality and the data of the second modality are data of different modalities, the data of the first modality comprising at least one of text, voice, video or images, and the data of the second modality comprising at least one of text, voice, video or images.
4. A method according to any one of claims 1 to 3, wherein the candidate normative data are normative expressions that match first normative data in a normative database, the first normative data being data of a first modality, one of the first normative data corresponding to at least one normative expression.
5. The method of claim 4, further comprising:
when newly added first standard data exist in the standard database, at least one standard expression corresponding to the newly added first standard data is obtained;
extracting second data characteristics of each standard expression corresponding to the newly added first standard data;
and storing each standard expression corresponding to the newly added first standard data and the second data characteristic association corresponding to each standard expression into the target database.
6. The method of claim 1, wherein the first data feature is extracted through a first feature extraction network; the second data characteristic of the candidate standard data is extracted through a second characteristic extraction network; the first feature extraction network and the second feature extraction network are trained by:
acquiring a training data set, wherein the training data set comprises a first training set, and each first sample in the first training set comprises first data of a first modality and second data of a second modality matched with the first data;
performing iterative training on an initial neural network model based on the training data set until a training total loss value meets a preset training end condition, wherein the neural network model comprises a first network model and a second network model, the first network model meeting the training end condition is used as the first feature extraction network, and the second network model meeting the training end condition is used as the second feature extraction network; the training process comprises the following steps:
inputting each first data into a first network model to obtain the characteristics of each first data, and inputting each second data into a second network model to obtain the characteristics of each second data;
determining a first training loss value based on the matching degree of the features of the first data and the features of the second data in each first sample and the matching degree of the features of the first data and the features of the second data in each first negative example; wherein the first negative example comprises first data of one first sample and second data of another first sample;
if the first training loss value does not meet a first preset condition, adjusting model parameters of the first network model and the second network model, wherein the training total loss value meets a preset training end condition, and the first training loss value meets the first preset condition.
7. The method of claim 6, wherein said inputting each of said first data into a first network model to obtain a characteristic of each of said first data comprises:
for each first data, performing the following operations on the first data through the first network model to obtain the characteristics of the first data:
dividing the first data into at least two subdata to obtain a subdata sequence corresponding to the first data;
extracting features of each subdata in the subdata sequence based on a dictionary, wherein the dictionary comprises a plurality of data elements, the number of feature values included in the features of each subdata is equal to the number of the elements in the dictionary, and one feature value represents the probability that the subdata contains the data element corresponding to the position of the feature value in the dictionary;
obtaining the characteristics of the first data based on the characteristics of each subdata;
the method further comprises the following steps:
for each of the second data, determining, based on the dictionary, that the second data corresponds to a data feature of the dictionary that characterizes a probability that the second data corresponds to a respective data element in the dictionary;
the determining a first training loss value comprises:
and determining a first training loss value based on the matching degree between the characteristics of each subdata of the first data in each first sample and the data characteristics of the second data corresponding to the dictionary, the matching degree between the characteristics of the first data in each first sample and the characteristics of the second data, and the matching degree between the characteristics of the first data in each first negative example and the characteristics of the second data.
8. The method of claim 6, wherein determining the first training loss value based on the degree of matching between the features of the first data and the features of the second data in each of the first samples and the degree of matching between the features of the first data and the features of the second data in each of the first negative examples comprises:
determining the difference degree between the characteristics of the first data and the characteristics of the second data of each first sample to obtain a first loss value;
for each piece of first data, determining a first similarity corresponding to the first data and a second similarity corresponding to the first data, wherein the first similarity is the similarity between the characteristics of the first data and the characteristics of second data matched with the first data, and the second similarity is the similarity between the characteristics of the first data and the characteristics of the second data in a first negative example in which the first data is located;
acquiring reference labels corresponding to the first data, wherein the reference labels comprise similarity labels corresponding to the first similarity and similarity labels corresponding to the second similarity;
determining a second loss value based on the prediction similarity and the reference label corresponding to each piece of the first data, wherein the prediction similarity comprises the first similarity and the second similarity, and the second loss value represents the difference between the prediction similarity and the reference label corresponding to each piece of the first data;
determining the first training loss value according to the first loss value and the second loss value.
9. The method according to any one of claims 6 to 8, wherein the candidate standard data is a standard expression of a second modality corresponding to a specified type of first standard data; the initial neural network model further comprises a classification model;
the training data set further comprises a second training set, each second sample in the second training set comprising third data of the first modality, fourth data of the second modality matching the third data, and a type label of the third data, wherein the third data in the second training set comprises third data of a specified type and third data of a non-specified type;
after obtaining the neural network model with the first training loss value satisfying the first preset condition, the training process further includes:
continuing to repeatedly execute training operation on the neural network model based on the second training set until a second training loss value meets a second preset condition, wherein the training total loss value meets a preset training ending condition, and the second training loss value meets the second preset condition; the training operation comprises:
inputting each third data into a first network model to obtain the characteristics of each third data, inputting each fourth data into a second network model to obtain the characteristics of each fourth data, and inputting the characteristics of each third data into a classification model to obtain the prediction type corresponding to each third data;
determining a second training loss value based on the matching degree of the features of the third data and the features of the fourth data in each second sample, the matching degree of the features of the third data and the features of the fourth data in each second negative example, and the matching degree between the type label and the prediction type of each third data;
and if the second training loss value does not meet the second preset condition, adjusting the model parameters of the neural network model.
10. A data processing apparatus, characterized in that the apparatus comprises:
the device comprises a to-be-processed data acquisition module, a to-be-processed data acquisition module and a to-be-processed data acquisition module, wherein the to-be-processed data acquisition module is used for acquiring to-be-processed data which is data in a first mode;
the characteristic acquisition module is used for extracting first data characteristics of the data to be processed;
the data identification module is used for matching the first data characteristic with at least one second data characteristic in a target database to obtain a matching result corresponding to each second data characteristic, and determining target standard data matched with the data to be processed from each candidate standard data according to the matching result corresponding to each second data characteristic;
the target database comprises at least one candidate standard data and a second data characteristic of each candidate standard data, and the candidate standard data are data of a second modality.
11. An electronic device, characterized in that the electronic device comprises a memory in which a computer program is stored and a processor which executes the computer program to implement the method of any of claims 1 to 9.
12. A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1 to 9.
13. A computer program product, characterized in that the computer product comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210118785.7A CN114444609B (en) | 2022-02-08 | 2022-02-08 | Data processing method, device, electronic equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210118785.7A CN114444609B (en) | 2022-02-08 | 2022-02-08 | Data processing method, device, electronic equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114444609A true CN114444609A (en) | 2022-05-06 |
CN114444609B CN114444609B (en) | 2024-10-01 |
Family
ID=81371679
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210118785.7A Active CN114444609B (en) | 2022-02-08 | 2022-02-08 | Data processing method, device, electronic equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114444609B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117725248A (en) * | 2023-04-17 | 2024-03-19 | 书行科技(北京)有限公司 | Image-based text matching method and device, computer equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103077714A (en) * | 2013-01-29 | 2013-05-01 | 华为终端有限公司 | Information identification method and apparatus |
CN109582775A (en) * | 2018-12-04 | 2019-04-05 | 平安科技(深圳)有限公司 | Information input method, device, computer equipment and storage medium |
CN111986661A (en) * | 2020-08-28 | 2020-11-24 | 西安电子科技大学 | Deep neural network speech recognition method based on speech enhancement in complex environment |
CN112699213A (en) * | 2020-12-23 | 2021-04-23 | 平安普惠企业管理有限公司 | Speech intention recognition method and device, computer equipment and storage medium |
CN113392341A (en) * | 2020-09-30 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Cover selection method, model training method, device, equipment and storage medium |
CN113421551A (en) * | 2020-11-16 | 2021-09-21 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, computer readable medium and electronic equipment |
CN113590850A (en) * | 2021-01-29 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Multimedia data searching method, device, equipment and storage medium |
US20210366460A1 (en) * | 2018-03-28 | 2021-11-25 | Tepepathy Labs, Inc. | Text-to-speech synthesis system and method |
-
2022
- 2022-02-08 CN CN202210118785.7A patent/CN114444609B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103077714A (en) * | 2013-01-29 | 2013-05-01 | 华为终端有限公司 | Information identification method and apparatus |
US20210366460A1 (en) * | 2018-03-28 | 2021-11-25 | Tepepathy Labs, Inc. | Text-to-speech synthesis system and method |
CN109582775A (en) * | 2018-12-04 | 2019-04-05 | 平安科技(深圳)有限公司 | Information input method, device, computer equipment and storage medium |
CN111986661A (en) * | 2020-08-28 | 2020-11-24 | 西安电子科技大学 | Deep neural network speech recognition method based on speech enhancement in complex environment |
CN113392341A (en) * | 2020-09-30 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Cover selection method, model training method, device, equipment and storage medium |
CN113421551A (en) * | 2020-11-16 | 2021-09-21 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, computer readable medium and electronic equipment |
CN112699213A (en) * | 2020-12-23 | 2021-04-23 | 平安普惠企业管理有限公司 | Speech intention recognition method and device, computer equipment and storage medium |
CN113590850A (en) * | 2021-01-29 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Multimedia data searching method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
贾全烨等: "一种基于循环神经网络的电网客服语音文本实体识别算法", 供用电, no. 06, 5 June 2020 (2020-06-05), pages 13 - 20 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117725248A (en) * | 2023-04-17 | 2024-03-19 | 书行科技(北京)有限公司 | Image-based text matching method and device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114444609B (en) | 2024-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12008336B2 (en) | Multimodal translation method, apparatus, electronic device and computer-readable storage medium | |
CN111090727B (en) | Language conversion processing method and device and dialect voice interaction system | |
CN112100349A (en) | Multi-turn dialogue method and device, electronic equipment and storage medium | |
CN112100354B (en) | Man-machine conversation method, device, equipment and storage medium | |
CN108538294B (en) | Voice interaction method and device | |
CN113688951B (en) | Video data processing method and device | |
CN111079418A (en) | Named body recognition method and device, electronic equipment and storage medium | |
CN117689963B (en) | Visual entity linking method based on multi-mode pre-training model | |
CN113326702A (en) | Semantic recognition method and device, electronic equipment and storage medium | |
CN111368066B (en) | Method, apparatus and computer readable storage medium for obtaining dialogue abstract | |
CN110781329A (en) | Image searching method and device, terminal equipment and storage medium | |
CN117132923A (en) | Video classification method, device, electronic equipment and storage medium | |
CN116258147A (en) | Multimode comment emotion analysis method and system based on heterogram convolution | |
CN114444609B (en) | Data processing method, device, electronic equipment and computer readable storage medium | |
CN116522905B (en) | Text error correction method, apparatus, device, readable storage medium, and program product | |
CN116955579B (en) | Chat reply generation method and device based on keyword knowledge retrieval | |
CN117828024A (en) | Plug-in retrieval method, device, storage medium and equipment | |
CN116956915A (en) | Entity recognition model training method, device, equipment, storage medium and product | |
CN110942775B (en) | Data processing method and device, electronic equipment and storage medium | |
CN115376503B (en) | Dialog flow generation method, electronic device and storage medium | |
CN115081459B (en) | Spoken language text generation method, device, equipment and storage medium | |
CN115658935B (en) | Personalized comment generation method and device | |
CN117011742A (en) | Title generation method, title generation device, electronic equipment and storage medium | |
CN118692014A (en) | Video tag identification method, device, equipment, medium and product | |
CN116150309A (en) | Intention recognition method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40070386 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |