CN116863920B - Voice recognition method, device, equipment and medium based on double-flow self-supervision network - Google Patents
Voice recognition method, device, equipment and medium based on double-flow self-supervision network Download PDFInfo
- Publication number
- CN116863920B CN116863920B CN202310874348.2A CN202310874348A CN116863920B CN 116863920 B CN116863920 B CN 116863920B CN 202310874348 A CN202310874348 A CN 202310874348A CN 116863920 B CN116863920 B CN 116863920B
- Authority
- CN
- China
- Prior art keywords
- module
- voice
- representation
- model
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 239000013598 vector Substances 0.000 claims abstract description 59
- 230000004927 fusion Effects 0.000 claims abstract description 57
- 230000000873 masking effect Effects 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 29
- 238000013139 quantization Methods 0.000 claims description 20
- 238000002372 labelling Methods 0.000 claims description 12
- 238000012512 characterization method Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000012544 monitoring process Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 4
- 230000009977 dual effect Effects 0.000 claims description 2
- 230000000295 complement effect Effects 0.000 abstract 1
- 230000007246 mechanism Effects 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000002860 competitive effect Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000008014 freezing Effects 0.000 description 2
- 238000007710 freezing Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a voice recognition method, a device, equipment and a medium based on a double-flow self-supervision network, which comprises the following steps: encoding and quantizing the target acoustic features by using an encoding and quantizing module to obtain a voice vector; carrying out reconstruction prediction on the voice vector by utilizing a reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation; fusing the first voice representation and the second voice representation by utilizing a feature fusion sub-module to obtain a fused voice representation; based on the target acoustic characteristics, the first sub-model is combined with a connection time sequence classifier in the CTC module to recognize the fused voice representation so as to obtain a transcribed text. The invention can pay attention to detailed context information of the voice and difference information among different characteristics of the voice, improves the robustness of self-supervision learning, and effectively combines complementary advantages of the generation formula and the discriminant self-supervision learning.
Description
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a medium for speech recognition based on a dual-flow self-supervision network.
Background
Voice is the most direct and effective way of information transfer, and is the most dominant way of people's emotion communication and thought transfer. Automatic speech recognition (Automatic Speech Recognition, ASR) techniques refer to the correct recognition of speech signals as corresponding text content or commands, allowing the machine to understand the human language and perform the relevant operations. ASR technology fuses the leading edge technology of multidisciplinary knowledge, covers basic disciplines such as mathematics and statistics, acoustics and phonetic science, computer and artificial intelligence and leading edge disciplines, is man-machine language communication and information exchange's key link, has very strong practical value. Along with the wide application of computers, ASR technology becomes a key technology for realizing simple and convenient human-computer intelligent interaction, and is widely applied to many real scenes such as search query, automatic navigation, self-service, machine translation, automatic driving and the like, and relates to the fields of industry, culture, business and the like.
ASR has undergone two periods of development, the traditional approach and the deep learning approach. The traditional method mainly integrates an acoustic model, a pronunciation model and a language model to find out word sequences most likely to occur in a given voice observation. With the rapid development of deep learning technology, the performance of voice tasks using deep learning gradually exceeds that of traditional algorithms. The End-to-End speech recognition (E2E ASR) model based on the deep neural network solves the problem that alignment preprocessing is required for labeling speech data, and can directly obtain the mapping relation between input speech waveforms or characteristics and output text content. E2E ASR has simplified model training process and improved speech recognition accuracy by means of powerful modeling and learning capacity. Notably, unlike conventional ASR systems, the performance of the E2E model is largely dependent on the number of available target annotation corpora. However, the workload of voice data collection and manual labeling is huge, and factors such as small languages or dialects can lead to the appearance of low-resource application scenes with insufficient labeling corpus. This presents a significant challenge for developing an effective E2 EASR. The current proposal of end-to-end voice recognition for limited annotation data mainly learns voice infrastructure information on a large amount of non-annotation data through a pre-training strategy, and then performs supervision training on the limited annotation data. The supervised training process can be specifically divided into the following problems according to different supervised learning modes:
(1) Unsupervised learning. The huge workload of data collection and labeling can lead to the occurrence of application scenes with insufficient corpus labeling, which can obviously reduce the modeling capacity of the model. The unsupervised learning does not depend on the labeling data, and the relation among the data samples is found through the structure or the characteristics of the data, so that the performance degradation caused by the insufficient labeling data can be relieved to a certain extent. However, since the unsupervised learning uses unlabeled data to capture the distribution or structure of the data, the supervised information is lost in the model prediction process, so that the deviation of the model prediction is increased, and the limited practical scene application of the labeled data is limited.
(2) Semi-supervised learning. Semi-supervised learning is a learning method combining supervised learning with unsupervised learning. Unlike unsupervised learning, semi-supervised learning considers the idea of partially labeling the unlabeled data in order to alleviate the problem of prediction deviation caused by insufficient supervision information in the process of unsupervised learning. That is, training a model on the annotation data, using the trained model to predict labels without the annotation data, thereby creating pseudo labels. And then, combining the label data with the newly generated pseudo label data as new training data so as to relieve the problem of insufficient supervision information in the unsupervised learning, wherein the performance of the semi-supervised training is seriously dependent on the accuracy of model prediction of the pseudo labels.
(3) Self-supervised learning. Self-supervision learning mainly utilizes auxiliary tasks to mine own supervision information from large-scale non-labeling data, and uses the constructed supervision information training model. More semantic relationships and valuable characterizations for downstream tasks can be learned than unsupervised and semi-supervised. But the speech signal has complex potential structures (including phonemes, syllables, words, prosodic features, sentence context information, etc.), including relevant information at different time scales. The current self-supervision learning scheme cannot give consideration to the difference information among different features and the self-distributed context information of the data, so that the prediction accuracy and the robustness are poor.
To sum up, in order to promote the application of end-to-end speech recognition in the actual scene with limited annotation data and improve the integrity of self-supervision learning on capturing the information of the speech infrastructure, the above problems need to be studied in depth, and a reasonable solution is provided.
Disclosure of Invention
The embodiment of the invention provides a voice recognition method, device, equipment and medium based on a double-flow self-supervision network, which are used for overcoming the defects of the prior art.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
In a first aspect, the present invention provides a method for speech recognition based on a dual-flow self-supervision network, comprising:
Acquiring target acoustic characteristics and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;
Encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a voice vector;
Carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;
fusing the first voice representation and the second voice representation by utilizing the characteristic fusion submodule to obtain a fused voice representation;
and based on the target acoustic characteristics, combining the first sub-model with a connection time sequence classifier in the CTC module to recognize the fused voice representation, and obtaining a transcribed text.
Optionally, the feature fusion submodule comprises a gating circulation unit and an adaptive fusion layer;
accordingly, the fusing the first voice representation and the second voice representation by using the feature fusion sub-module to obtain a fused voice representation, including:
Respectively carrying out feature selection on the first voice representation and the second voice representation by using the gating circulating unit, and correspondingly obtaining a first selected feature and a second selected feature;
and carrying out self-adaptive fusion on the first selected feature and the second selected feature by using the self-adaptive fusion layer.
Optionally, the pre-trained speech recognition model is obtained by training in the following manner:
acquiring an acoustic characteristic sample and a pre-constructed voice recognition model;
Inputting the acoustic feature sample into the pre-constructed speech recognition model;
Calculating to obtain reconstruction loss based on the first voice representation output by the reconstruction prediction module and the acoustic feature sample;
obtaining a contrast loss based on the fused voice representation output by the feature fusion submodule and the acoustic feature sample calculation;
calculating to obtain diversity loss based on codebook information of the acoustic feature samples;
Performing iterative updating on initial network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module according to the reconstruction loss, the comparison loss and the diversity loss to obtain updated network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module;
Taking the updated network parameters as voice characterization extracted by a feature extractor of the CTC module, and training and decoding the CTC module based on the acoustic feature sample and the labeling data so as to obtain a trained voice recognition model;
Or carrying out iterative updating on the randomly initialized network parameters in the coding and quantizing module, the reconstruction prediction module, the comparison prediction module and the CTC module according to the reconstruction loss, the comparison loss and the diversity loss, so as to obtain a trained speech recognition model.
Optionally, the encoding and quantization module includes an encoder and a vector quantization layer, the encoder being obtained based on Conformer networks;
accordingly, the encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a speech vector includes:
encoding the target acoustic feature with the encoder to obtain a potential speech representation;
Discretizing the potential speech representation by the vector quantization layer to obtain the speech vector.
Optionally, the encoder includes multiple layers Conformer, each layer Conformer including:
The system comprises a first feedforward layer, a first residual error and standardization module, a multi-head self-attention layer, a second residual error and standardization module, a convolution module, a third residual error and standardization module, a second feedforward layer, a fourth residual error and standardization module and Layernorm layers which are connected in sequence; the first residual error and the second residual error and the standardized module, the second residual error and the standardized module and the third residual error and the standardized module, and the third residual error and the standardized module and the fourth residual error and the standardized module are in residual error connection.
Optionally, the pre-trained speech recognition model further comprises a random masking module;
Accordingly, after the acquisition of the target acoustic features, the method further comprises:
Performing time random masking and frequency random masking processing on the target acoustic features by using the random masking module to obtain target masked acoustic features;
The encoding and quantizing module is used for encoding and quantizing the target acoustic feature to obtain a speech vector, and the encoding and quantizing module comprises:
And encoding and quantizing the target mask acoustic features by using the encoding and quantizing module to obtain a voice vector.
In a second aspect, the present invention also provides a voice recognition device based on a dual-flow self-supervision network, including:
the acoustic feature and model acquisition module is used for acquiring target acoustic features and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;
The coding and quantizing module is used for coding and quantizing the target acoustic feature by utilizing the coding and quantizing module to obtain a voice vector;
the reconstruction and comparison module is used for carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;
the fusion module is used for fusing the first voice representation and the second voice representation by utilizing the characteristic fusion sub-module to obtain a fused voice representation;
And the classification module is used for identifying the fused voice representation by combining the first sub-model with a connection time sequence classifier in the CTC module based on the target acoustic characteristics to obtain a transcribed text.
In a third aspect, the present invention also provides an electronic device, including a memory and a processor, where the processor and the memory are in communication with each other, the memory storing program instructions executable by the processor, the processor invoking the program instructions to perform the above-described voice recognition method based on a dual-flow self-supervising network.
In a fourth aspect, the present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements a method for speech recognition based on a dual-flow self-supervision network as described above.
The invention has the beneficial effects that: the invention provides a voice recognition method, a device, equipment and a medium based on a double-flow self-supervision network, which are characterized in that a double-channel structure is designed by combining a reconstruction prediction module (Reconstruction Prediction Module, RPM) and a comparison prediction module (Contrastive Prediction Module, CPM) in parallel after a coding and quantization module. The reconstruction prediction is used as an auxiliary task of comparison prediction to respectively predict voice frames of voice vectors, so that detailed voice context information is focused while different characteristic difference information of voices is captured by modeling attribution relations among different voice representations. In addition, in order to effectively utilize the two-channel speech representation, the speech representations of the two branches are fused through a feature fusion submodule, and the feature fusion submodule adaptively fuses the speech representations of the two branches through a parameter learning strategy and controls the exposure of various speech features by using weights. Finally, the dual-flow self-supervision learning network provided by the invention can well initialize the weight of the ASR model. Compared with other self-supervision learning methods, the voice recognition method provided by the invention can achieve competitive prediction accuracy. In addition, in the limited marked data scenario, it is comparable to the most advanced self-supervised learning method.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a prior art speech recognition method based on mask reconstruction;
FIG. 2 is a flow chart of a prior art contrast prediction-based speech recognition method;
FIG. 3 is a schematic flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention;
FIG. 4 is a second flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of a feature fusion sub-module according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an encoder according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the prior art, self-supervised learning can learn valuable speech characterization for downstream tasks, and also can use a trained model for initializing an ASR task, and model parameters learned by self-supervised pre-training are found to be an effective method for initializing an ASR model. At present, the self-supervision learning method mainly comprises two main branches, namely generation type self-supervision learning and discriminant type self-supervision learning.
Wherein the generative self-supervised learning generates or reconstructs input data based on some limited speech frames, including predicting future input from past input, predicting a mask from an unmasked, or predicting original speech from other corrupted speech centers. The following description is given of a method of dividing the autoregressive predicted speech frame and the mask predicted speech frame.
Among them, the inspiration of autoregressive prediction comes mainly from a Language Model (LM) of text, and expands this to the speech domain. Unlike conventional linear regression, the autoregressive model encodes past acoustic wave sequence time information. The model then adjusts the past speech frames while predicting future speech frames. It encodes only the information of the previous time step and not the entire input and thus may result in a loss of global context information.
Mask reconstruction is largely inspired by a mask language model and then expanded into the voice field. Some tokens in the input sentence are randomly replaced with mask tokens, and the model then relies on unmasked tokens to recover hidden speech features from corrupted or masked token features. Where the masking strategy is similar to BERT, masking operations can generally be performed along both the time and frequency dimensions. This allows the model to encode information throughout the input to learn the speech infrastructure information. But the speech signal has complex potential structures (including phonemes, syllables, words, prosodic features, sentence context information, etc.), the masking reconstruction model will encode all the information in the speech signal, but will encode redundant information for a particular ASR task.
For the reasons described above, learning to reconstruct the original speech signal may not be the best way to find the underlying structure of the speech. The contrast model learns the phonetic representation by maximizing the similarity between the given speech and the positive samples while minimizing the similarity between the given speech and the negative samples, thereby distinguishing the target samples (positive) and the interference samples (negative) of the given speech. The following is divided into contrast-based predictive coding (Contrastive Predictive Coding, CPC) and wav2vec2.0 based methods.
Methods based on contrast prediction. Based on contrast predictive coding, a unidirectional modeling form in feature space is adopted, and a nonlinear encoder is first used to map an input speech sequence to hidden space, where the speech representation has only a low temporal resolution. The potential representations of the speech are then encoded using an autoregressive model to obtain a speech context representation, and the potential features of the future frames are predicted via a prediction network in combination with the historical context representation of the speech. And finally, judging the approaching degree of the predicted result and the real characteristic by maximizing the mutual information between the future several frames of audio fragments and the context characterization thereof. This not only allows the model to learn the basic shared information characterization between different parts of the speech encoded (high-dimensional) signal, but it also discards low-level information and more localized noise. The wav2vec2.0 learns speech tokens using InfoNCE losses in combination with masking operations to maximize similarity between the contextualized representation and the original speech representation on a contrast predictive coding basis. It focuses on learning the mapping between input and output, resulting in insufficient capture of the characteristics of the training data itself, resulting in a lack of contextual information.
The speech recognition based on mask reconstruction is that after an acoustic model is trained by using a mask reconstruction self-supervision pre-training strategy, the obtained acoustic model is used for speech representation extraction or fine tuning to carry out speech recognition. The main algorithm flow is shown in fig. 1, and the specific steps of speech recognition based on mask reconstruction are as follows:
first, each input speech feature is considered as an image of dimension t×f, where T is the number of frames and F is the number of frequency bins. Masking is performed along both dimensions by using both time and frequency random masking scheme strategies. For the time mask, the following T n consecutive time steps in each sequence starting from T I are randomly masked (T I,TI+Tn), with a total of 15% of the speech frames being masked without overlap. In the above procedure 80% of the frames are replaced by zero vectors, 10% of the frames are replaced by frames from random positions and remain the same for the rest of the time. Similar to the time mask, the frequency randomly masks the values of consecutive frequency bin blocks to zero throughout all time steps of the input sequence. F consecutive mel frequency channels [ F I,FI +f ] are masked, where F is chosen from {0,1,. }, F } uniformly samples the width of F to select a masking frequency block, and F I is chosen randomly from [0,F-F ].
In addition, during the mask reconstruction process, part of the features are randomly masked, and the autoregressive model networks such as RNN/LSTM/Transforemer are encouraged to fully learn the space-time information in the input features, namely the global context information and the space information of the voice through using the two masking code strategies of time and frequency singly or in a mixed mode.
Finally, the self-supervised learned speech knowledge is incorporated into the speech recognition network by means of representation extraction or fine tuning. For representation extraction, the extracted phonetic representation is fed as input features to the ASR network for supervised training by freezing the network parameters of the self-supervised training autoregressive model as a feature extractor of the ASR network to obtain text output. And for fine tuning, performing supervision training on the self-supervision training autoregressive model and the randomly initialized ASR, and updating network parameters to obtain final text output.
According to the speech recognition method based on mask reconstruction, because the speech signal contains miscellaneous potential structures (including phonemes, syllables, prosodic features, sentence context information and the like), the information capture of the prosodic features and the like affecting the ASR performance is insufficient due to the fact that the mask features are predicted by the context information.
To fully exploit the context prediction mask feature, all information in the speech signal has to be encoded to learn the data-by-itself characteristics of the speech data, which results in higher learning costs and more computational resources than discriminant self-supervised learning.
In addition, the reconstructed prediction encodes all the information in the speech signal, and redundant information for a particular ASR task, resulting in a prediction that is less robust.
In addition to the above-described mask reconstruction-based speech recognition, there is a contrast prediction-based speech recognition method that combines wav2vec2.0 with a downstream speech recognition network to perform speech recognition by speech representation extraction or fine tuning, wherein wav2vec2.0 comprises a feature encoder, a quantization module and a transform context representation. The main algorithm flow is shown in fig. 2, and the voice recognition method based on comparison prediction comprises the following specific steps:
firstly, a feature extractor of a seven-layer convolution network is adopted to encode the original audio into a frame feature sequence, and each frame feature is converted into discrete features through a vector quantization module and is used as a self-supervision target.
Further, a vector quantization module is used to discrete the output of the feature encoder, which contains G sets of codebooks, each containing V variables. For each continuous variable output by the feature encoder, a variable is found in each group of codebooks, and G variables are spliced together, and linear change is performed again to obtain the final discrete feature.
The transducer is then used to obtain a speech context characterization. The output of the feature encoder may be subjected to some masking operation prior to input into the transformer, using masked trainable embedded tokens instead. No masking operation is performed when vector quantization is performed. A contrast loss function is computed from the context and discrete features of the speech characterization to enable mask tokens to be identified in the transducer output among candidate discrete features containing interference terms sampled from other masking instants. Finally, text output of the final voice recognition task is obtained in a mode of representation extraction or fine tuning.
However, the contrast prediction-based speech recognition method has the following drawbacks: discriminant learning focuses on learning the mapping between input and output by comparing the target sample (positive) and interference sample (negative) similarity metrics, taking into account the inherent structure of the data, and thus, processing power for missing mask data is weak.
To sum up, in the prior art, the following problems exist in the self-supervised learning voice recognition algorithm:
1. The information capture is incomplete for voice infrastructure. The generated reconstruction prediction focuses on the distribution of data by voice context information reconstruction mask data, but due to the complex characteristics of voice signals, information capture defects such as rhythm characteristics affecting the ASR performance still exist. The discriminant model focuses on the difference information of the data by comparing the similarity degree of the target sample (positive) and the interference sample (negative), and searches for a classification surface which focuses on the mapping relation between the learning input and the learning output, so that the autocorrelation characteristic of the voice signal is not fully considered, and the processing capacity of missing mask data is weaker. The existing self-supervision learning schemes have the problem of incomplete voice infrastructure information capture.
2. The advantages of the generative and discriminant self-supervision are not effectively combined. Different types of self-supervising models represent different advantages over different downstream tasks. The prior technical proposal lacks an effective proposal for fusing two self-supervision learning. Therefore, in order to better exploit the potential of both models, an efficient fusion strategy is necessary.
3. Reconstruction prediction encodes redundant information for a particular ASR task. In the prior art, the reconstruction prediction uses the context information prediction mask feature, which requires coding all information in the speech signal to learn the characteristics of the speech data itself, which results in encoding redundant information for the ASR task, making the prediction less robust.
The invention aims to solve the problems of incomplete voice infrastructure information capture and poor robustness of model prediction results in self-supervision learning, improves the defects and shortcomings of the existing self-supervision learning scheme, and promotes wider application in practice.
The following describes a voice recognition method based on a double-flow self-supervision network according to the present invention with reference to the accompanying drawings.
Term interpretation:
Self-supervised learning (Self-Supervised Learning, SSL): and the characteristic characteristics of the label-free data are mined by designing auxiliary tasks to serve as supervision information, so that the learning mode of the characteristic extraction capacity of the model is improved.
End-to-End speech recognition (End-to-End Automatic Speech Recognition, E2E ASR): the voice recognition system based on the end-to-end network model directly maps the input voice waveform column to the output text through a neural network model, and each module in the system is not required to be trained independently like a traditional voice recognition algorithm, so that the voice recognition flow is simplified, the problem of automatic alignment of an input sequence and an output sequence is well solved, and forced alignment processing is not required to be carried out on the input sequence.
Attention mechanism (Attention Mechanism): the attention mechanism refers to a method designed to simulate the characteristic that the human time system can naturally and effectively find salient regions in complex scenes in computing deep learning. The attention mechanism in deep learning includes spatial attention, channel attention, self-attention, and the like.
Example 1
FIG. 3 is a schematic flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention; FIG. 4 is a second flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention; as shown in fig. 3 and 4, a voice recognition method based on a dual-flow self-supervision network includes the following steps:
S301, acquiring target acoustic characteristics and a pre-trained voice recognition model.
The pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module. The target acoustic features are the voice data to be recognized.
S302, the coding and quantizing module is utilized to code and quantize the target acoustic feature, and a voice vector is obtained.
In this step, the target acoustic features are encoded and quantized using an encoding and quantization module, so that more meaningful phonetic unit information is learned to enrich the phonetic representation. And inputting the voice vector into a reconstruction prediction module and a comparison prediction module of the double-channel structure.
S303, carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; and simultaneously, predicting the voice vector by utilizing an autoregressive model in the comparison and prediction module to obtain a second voice representation.
In this step, the reconstructed predictions are jointly trained as an auxiliary task to the comparison predictions.
S304, fusing the first voice representation and the second voice representation by utilizing the characteristic fusion submodule to obtain a fused voice representation.
In the step, after the feature fusion submodule is utilized to fuse the voice representations, the feature fusion submodule can adaptively fuse two voice representations through a parameter learning strategy, and explore the attribution relations between different voice representations while focusing on the context information so as to capture different feature difference information.
S305, based on the target acoustic characteristics, the fused voice representation is identified by combining the first sub-model and a connection time sequence classifier in the CTC module, and a transcribed text is obtained.
In this step, final text output is performed by connecting a timing classifier (Connectionist Temporal Classification, CTC for short).
It should be noted that the voice recognition method provided by the invention can be applied to many real scenes such as man-machine interaction, machine translation, automatic driving, intelligent home and the like, and relates to the fields of industry, culture, business and the like. For example, intelligent sound boxes issued by Internet companies such as Google, amazon, hundred degrees, ariba, kodakuda and the like in succession are products of successful landing of ASR technology.
According to the voice recognition method based on the double-flow self-supervision network, a double-channel structure is designed by combining a reconstruction prediction module (Reconstruction Prediction Module, RPM) and a comparison prediction module (Contrastive Prediction Module, CPM) in parallel after a coding and quantization module. The reconstruction prediction is used as an auxiliary task of comparison prediction to respectively predict voice frames of voice vectors, so that detailed voice context information is focused while different characteristic difference information is captured by modeling attribution relations among different voice representations. In addition, in order to effectively utilize the two-channel speech representation, the speech representations of the two branches are fused through a feature fusion submodule, and the feature fusion submodule adaptively fuses the speech representations of the two branches through a parameter learning strategy and controls the exposure of various speech features by using weights. Finally, the dual-flow self-supervision learning network provided by the invention can well initialize the weight of the ASR model. Compared with other self-supervision learning methods, the voice recognition method provided by the invention can achieve competitive prediction accuracy. In addition, in the limited marked data scenario, it is comparable to the most advanced self-supervised learning method.
Optionally, the pre-trained speech recognition model further comprises a random masking module;
Accordingly, after the acquisition of the target acoustic features, the method further comprises:
And performing time random masking and frequency random masking processing on the target acoustic features by using the random masking module to obtain target masked acoustic features. That is, for the target acoustic feature x, a masking acoustic feature is obtained using both time and frequency random masking strategies
Specifically, for a time mask, the start index T I is randomly selected to mask speech having a maximum width of T n, where each sequence is randomly masked (T I,TI+Tn), accounting for 15% of the entire sequence. In the above procedure, 80% of the speech frames are replaced by zero vectors and 10% are replaced by other speech frames randomly sampled from the same speech. Similarly, the frequency mask randomly masks the value of the continuous frequency (F I,FI +f) to zero over all time steps, where F samples the width of F uniformly from {0,1, …, F } to select the masking frequency.
The encoding and quantizing module is used for encoding and quantizing the target acoustic feature to obtain a speech vector, and the encoding and quantizing module comprises:
And encoding and quantizing the target mask acoustic features by using the encoding and quantizing module to obtain a voice vector.
Optionally, the encoder includes multiple layers Conformer, each layer Conformer including:
The system comprises a first feedforward layer, a first residual error and standardization module, a multi-head self-attention layer, a second residual error and standardization module, a convolution module, a third residual error and standardization module, a second feedforward layer, a fourth residual error and standardization module and Layernorm layers which are connected in sequence; the first residual error and the second residual error and the standardized module, the second residual error and the standardized module and the third residual error and the standardized module, and the third residual error and the standardized module and the fourth residual error and the standardized module are in residual error connection.
Specifically, the present invention uses a Conformer-based encoder structure consisting of N layers, each consisting of a Multi-Head Self-Attention layer (MHSA), a convolution module (Convolution module, conv) and a feed-forward layer (Feed forward module, FFN), a residual and normalization layer (Add & Norm), as shown in fig. 6, the overall Conformer structure replaces the original feed-forward layer with two half-step feed-forward layers, one preceding the Multi-Head Attention layer and the second following the convolution module. The second feed forward module is followed by a Layernorm layer. Thus, given an inputThe output H X obtained via Conformer is defined as follows:
H'=H+MHSA(H) (2)
H”=H'+Conv(H) (3)
Multi-headed self-attention is effectively a multi-channel parallel self-attention mechanism. For self-attention mechanisms, the mask language spectrum feature representation is first input After obtaining the query, key and value (Q, K, V) by linear calculation, dot product calculation is carried out:
Wherein W Q、WK、WV is a learnable parameter matrix respectively. Dot product note calculations were then performed by the softmax function:
The multi-headed self-attention mechanism then equally divides the attention input into h different attention channels for parallel computation and concatenates all channel attention results:
Multihead(Q,K,V)=concat(head1,…,headh)Wo (7)
Wherein W Q,WK,WV,Wo is a learnable parameter matrix respectively, Is a scaling factor. In general, it typically uses h=8 parallel attention spaces or heads. In practical applications, d k=dmodel/h is always set so that the computation complexity of multi-head attention is the same as that of single self-attention, and d model represents the dimension of the input vector. The convolution module consists of Pointwise convolutions, DEPTHWISE convolutions, GLU activation layers and Swish activation layers. The feed-forward layer consists of two linear transitions, with one ReLU active in the middle,
FFN(x)=max(0,xW1+b1)W2+b2 (8)
Where x represents the input of the feed forward layer, W 1,W2 represents the matrix of learnable parameters, and b 1,b2 is a constant medium that is introduced to remain linear.
Although the linear transforms are identical at different locations, they use different parameters between different layers. In addition, a residual connection is used around each two sub-layers for layer normalization to achieve more stable and faster convergence.
To focus more on language/phonetic unit information, the present invention accesses the quantization layer after Conformer. The Conformer output potential phonetic representation H X is first mapped to logitsI ε R G×V by the linear layer, where G is the number of codebooks and V is the size of the codebook. The speech discretization representation v t is then obtained by selecting a variable from a fixed size codebook c= { C 1,…,CV } and superimposing the resulting vectors and applying linear transformations. The probability of selecting the v-th code in the g-th codebook is defined as follows:
optionally, the pre-trained speech recognition model is obtained by training in the following manner:
acquiring an acoustic characteristic sample and a pre-constructed voice recognition model;
the acoustic feature samples are input to the pre-constructed speech recognition model.
And calculating and obtaining reconstruction loss based on the first voice representation output by the reconstruction prediction module and the acoustic characteristic sample.
And obtaining contrast loss based on the fused voice representation output by the feature fusion submodule and the acoustic feature sample.
And calculating the diversity loss based on the codebook information of the acoustic feature samples.
And carrying out iterative updating on initial network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module according to the reconstruction loss, the comparison loss and the diversity loss to obtain updated network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module.
And taking the updated network parameters as voice characterization extracted by a feature extractor of the CTC module, and training and decoding the CTC module based on the acoustic feature sample and the labeling data so as to obtain a trained voice recognition model.
Or carrying out iterative updating on the randomly initialized network parameters in the coding and quantizing module, the reconstruction prediction module, the comparison prediction module and the CTC module according to the reconstruction loss, the comparison loss and the diversity loss, so as to obtain a trained speech recognition model.
In this embodiment, the present invention constructs a dual stream structure based on a reconstruction prediction module and a contrast prediction module after the encoding and quantization module. The reconstruction prediction module is mainly composed of a prediction network P net, and aims at generating a mask characteristicAnd reconstructing acoustic features x t. The predictive network in the present invention consists of a Position-Feed Forward Network (FFN). L1 reconstruction loss is then calculated between the input x and the network output of P net to update the network parametersAnd
Where x t represents the original speech feature input (i.e., masking the acoustic features),Representing speech features after masking (target masking acoustic features),For the parameters of Conformer encoders in the encoding and quantization module, the network parameters are reserved for ASR tasks, while the prediction network P net is discarded. The reconstruction prediction module effectively improves the accuracy of speech recognition prediction by understanding the reconstructed masked speech frames from the context of previous and future content.
The contrast prediction module uses an autoregressive model to sum the discrete representations into a new context vector c t. The present invention does not directly use the context vector c t to calculate the contrast prediction, but uses the GFF module to fuse the output of RPM with the autoregressive network output of CPM to obtain the phonetic representation c GFF, thereby improving the accuracy of predicting the phonetic representation. And then, calculating the contrast loss by using the fused voice representation c GFF, thereby being beneficial to learning more comprehensive voice structure information. The model uses the contrast loss to identify a true context vector phonetic representation c t, which includes x t and K interference terms, which includes K interference terms, in a set of k+1 candidate representations. The interference term is uniformly sampled from other masking time steps of the same utterance. The contrast loss is defined as:
In L Contrastive, sim represents cosine similarity between two vectors, and κ is a temperature super-parameter. In addition, diversity loss is also utilized to increase the quantized codebook representation and balance the probability of using all entries in each codebook by maximizing the entropy of the average softmax distribution over the codebook entries of each codebook p g in a batch of audio, where p g,v represents the probability of selecting the v-th code in the g-th codebook.
The final training target L Total of the invention consists of three parts of reconstruction loss L Reconstruction, contrast loss L Contrastive and diversity loss L Diversity, and can simultaneously solve the two self-supervision tasks. The training loss to be minimized is finally:
LTotal=LContrastive+αLDiversity+βLReconstruction (13)
Where α and β are learnable hyper-parameters. L Contrastive is calculated from the speech representation and the acoustic features, where noise samples of the acoustic features are uniformly sampled from other masks of the same speech. For L Diversity, α is set to 0.1 to balance the weight of L Diversity. L Reconstruction is output from acoustic feature X and reconstruction And (5) calculating to obtain the product.
After determining the final training objective function L Total, there are two different ways to train parameters for each module, and the dual-stream self-supervised network learning to speech knowledge can be incorporated into the ASR task for training and decoding by both representation extraction and fine tuning to achieve end-to-end speech recognition with limited annotation data.
Where representation extraction refers to extracting speech tokens by freezing DSSLNet's parameters as a feature extractor for training the CTC module when training with downstream ASR, which is essentially the hidden state of the last layer of the DSSLNet encoder. The extracted representation is used as input replacement FBANK/MFCC and other characteristics to be fed to a CTC module for training and decoding, and text output is obtained.
The fine tuning is performed by using downstream CTC module pair DSSLNet. The output of DSSLNet is connected to the CTC module where the parameters of DSSLNet are not frozen. And then updating the training DSSLNet with the randomly initialized CTC module to perform training decoding to obtain text output.
Optionally, the encoding and quantization module includes an encoder and a vector quantization layer, the encoder being obtained based on Conformer networks;
accordingly, the encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a speech vector includes:
encoding the target acoustic feature with the encoder to obtain a potential speech representation;
Discretizing the potential speech representation by the vector quantization layer to obtain the speech vector.
According to the voice recognition method based on the double-flow self-supervision network provided by the embodiment of the invention, the GUR feature fusion module consisting of the gating circulating unit (GRU) and the self-adaptive fusion layer is provided, and the purpose of self-adaptive feature fusion is achieved by controlling the exposure of different features so as to reduce the redundant information generated by rebuilding and predicting a specific ASR task.
Optionally, the feature fusion submodule includes a gating loop unit and an adaptive fusion layer.
Accordingly, the fusing the first voice representation and the second voice representation by using the feature fusion sub-module to obtain a fused voice representation, including:
And respectively carrying out feature selection on the first voice representation and the second voice representation by using the gating circulating unit, and correspondingly obtaining a first selected feature and a second selected feature.
And carrying out self-adaptive fusion on the first selected feature and the second selected feature by using the self-adaptive fusion layer.
Specifically, the feature fusion submodule (GRUfeaturefusion, abbreviated as GFF) designed by the invention can avoid a great amount of redundant information in the fusion feature, and the module comprises a gate control loop unit (GRU) and an adaptive fusion layer, as shown in fig. 5. The workflow of the GFF module is split into two steps.
First, a first phonetic representation is entered with a second phonetic representation, wherein the GRU is comprised of a reset gate r t and an update gate z t. With the gating mechanism of the GRU, the most useful information is selected from a large number of feature maps, and then the information is selectively aggregated according to the results obtained. The output of this step uses a gating mechanism to selectively pass information. And secondly, carrying out feature fusion processing on the output of the GRU through the self-adaptive fusion layer.
Specifically, during the processing of the first phonetic representation, two pieces of gated information are obtained through the output O Recon of the current RPM (i.e., the first phonetic representation) and the hidden state h t-1 transferred from the previous node:
rt=σ(Wr·[ht-1,ORecon]) (14)
zt=σ(Wz·[ht-1,ORecon]) (15)
Where σ is a sigmoid type function and W r and W z are weights for reset and update gates, respectively.
After the gating information is obtained, O Recon is spliced with the reset data, wherein the reset gate determines how much information needs to be remembered in the past. Then, the output of the current hidden node is obtained by activating the function tanh.
Finally, in the 'update memory' stage, the updated expression is:
Where W represents a learnable parameter of the GRU, tanh represents an activation function, and h t-1 represents a hidden state of the input at the previous time. If the previous information with weight z t is ignored, the input information with current weight (1-z t) is selected.
Similarly, the same calculation is performed for the second phonetic representation, so that h p is finally obtained, which is not described in detail herein.
After h q and h p are obtained, adaptive fusion is performed using an adaptive fusion layer. The method comprises the following specific steps:
OGFF=ηhp+μhq(18)
Where η and μ represent learnable hyper-parameters, h p and h q represent results from GRU processing of the RPM output O Recon and CPM output O Con, respectively.
Example 2
On the basis of embodiment 1, this embodiment 2 provides a voice recognition device based on a dual-flow self-supervision network, where the voice recognition device based on the dual-flow self-supervision network corresponds to the voice recognition based on the dual-flow self-supervision network, and the voice recognition device based on the dual-flow self-supervision network includes:
the acoustic feature and model acquisition module is used for acquiring target acoustic features and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;
The coding and quantizing module is used for coding and quantizing the target acoustic feature by utilizing the coding and quantizing module to obtain a voice vector;
the reconstruction and comparison module is used for carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;
the fusion module is used for fusing the first voice representation and the second voice representation by utilizing the characteristic fusion sub-module to obtain a fused voice representation;
And the classification module is used for identifying the fused voice representation by combining the first sub-model with a connection time sequence classifier in the CTC module based on the target acoustic characteristics to obtain a transcribed text.
Specific details refer to the description of the voice recognition method based on the dual-flow self-supervision network, and are not repeated here.
Example 3
The embodiment 3 of the invention provides an electronic device, which comprises a memory and a processor, wherein the processor and the memory are communicated with each other, the memory stores program instructions which can be executed by the processor, the processor calls the program instructions to execute a voice recognition method based on a double-flow self-supervision network, and the method comprises the following flow steps:
Acquiring target acoustic characteristics and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;
Encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a voice vector;
Carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;
fusing the first voice representation and the second voice representation by utilizing the characteristic fusion submodule to obtain a fused voice representation;
and based on the target acoustic characteristics, combining the first sub-model with a connection time sequence classifier in the CTC module to recognize the fused voice representation, and obtaining a transcribed text.
Example 4
Embodiment 4 of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method for voice recognition based on a dual-flow self-supervision network, the method comprising the following flow steps:
Acquiring target acoustic characteristics and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;
Encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a voice vector;
Carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;
fusing the first voice representation and the second voice representation by utilizing the characteristic fusion submodule to obtain a fused voice representation;
and based on the target acoustic characteristics, combining the first sub-model with a connection time sequence classifier in the CTC module to recognize the fused voice representation, and obtaining a transcribed text.
In summary, in the voice recognition method based on the dual-stream self-supervision network provided by the embodiment of the invention, after the encoding and quantization module, a dual-channel structure is designed by combining the reconstruction prediction module (Reconstruction Prediction Module, RPM) and the comparison prediction module (Contrastive Prediction Module, CPM) in parallel. The reconstruction prediction is used as an auxiliary task of comparison prediction to respectively predict voice frames of voice vectors, so that detailed context information is focused while different characteristic difference information is captured by modeling attribution relations among different voice representations. In addition, in order to effectively utilize the two-channel speech representation, the speech representations of the two branches are fused through a feature fusion submodule, and the feature fusion submodule adaptively fuses the speech representations of the two branches through a parameter learning strategy and controls the exposure of various speech features by using weights. Finally, the dual-flow self-supervision learning network provided by the invention can well initialize the weight of the ASR model. Compared with other self-supervision learning methods, the voice recognition method provided by the invention can achieve competitive prediction accuracy. In addition, in the limited marked data scenario, it is comparable to the most advanced self-supervised learning method.
Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a method or apparatus embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of a method embodiment in part. The method and apparatus embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (9)
1. A voice recognition method based on a dual-flow self-supervision network, comprising:
Acquiring target acoustic characteristics and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;
Encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a voice vector;
Carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;
fusing the first voice representation and the second voice representation by utilizing the characteristic fusion submodule to obtain a fused voice representation;
and based on the target acoustic characteristics, combining the first sub-model with a connection time sequence classifier in the CTC module to recognize the fused voice representation, and obtaining a transcribed text.
2. The voice recognition method based on the double-flow self-supervision network according to claim 1, wherein the feature fusion submodule comprises a gating circulation unit and an adaptive fusion layer;
accordingly, the fusing the first voice representation and the second voice representation by using the feature fusion sub-module to obtain a fused voice representation, including:
Respectively carrying out feature selection on the first voice representation and the second voice representation by using the gating circulating unit, and correspondingly obtaining a first selected feature and a second selected feature;
and carrying out self-adaptive fusion on the first selected feature and the second selected feature by using the self-adaptive fusion layer.
3. The method for voice recognition based on a dual-stream self-monitoring network according to claim 1, wherein the pre-trained voice recognition model is trained by:
acquiring an acoustic characteristic sample and a pre-constructed voice recognition model;
Inputting the acoustic feature sample into the pre-constructed speech recognition model;
Calculating to obtain reconstruction loss based on the first voice representation output by the reconstruction prediction module and the acoustic feature sample;
obtaining a contrast loss based on the fused voice representation output by the feature fusion submodule and the acoustic feature sample calculation;
calculating to obtain diversity loss based on codebook information of the acoustic feature samples;
Performing iterative updating on initial network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module according to the reconstruction loss, the comparison loss and the diversity loss to obtain updated network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module;
Taking the updated network parameters as voice characterization extracted by a feature extractor of the CTC module, and training and decoding the CTC module based on the acoustic feature sample and the labeling data so as to obtain a trained voice recognition model;
Or carrying out iterative updating on the randomly initialized network parameters in the coding and quantizing module, the reconstruction prediction module, the comparison prediction module and the CTC module according to the reconstruction loss, the comparison loss and the diversity loss, so as to obtain a trained speech recognition model.
4. The method of claim 1, wherein the coding and quantization module comprises an encoder and a vector quantization layer, the encoder being obtained based on Conformer networks;
accordingly, the encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a speech vector includes:
encoding the target acoustic feature with the encoder to obtain a potential speech representation;
Discretizing the potential speech representation by the vector quantization layer to obtain the speech vector.
5. The dual-stream, self-supervising network based speech recognition method according to claim 4, wherein the encoder comprises a plurality of layers Conformer, each layer Conformer comprising:
The system comprises a first feedforward layer, a first residual error and standardization module, a multi-head self-attention layer, a second residual error and standardization module, a convolution module, a third residual error and standardization module, a second feedforward layer, a fourth residual error and standardization module and Layernorm layers which are connected in sequence; the first residual error and the second residual error and the standardized module, the second residual error and the standardized module and the third residual error and the standardized module, and the third residual error and the standardized module and the fourth residual error and the standardized module are in residual error connection.
6. The method for voice recognition based on a dual-stream self-monitoring network according to any one of claims 1-5, wherein the pre-trained voice recognition model further comprises a random masking module;
Accordingly, after the acquisition of the target acoustic features, the method further comprises:
Performing time random masking and frequency random masking processing on the target acoustic features by using the random masking module to obtain target masked acoustic features;
And the encoding and quantizing module is used for encoding and quantizing the target mask acoustic features to obtain a voice vector.
7. A voice recognition device based on a dual-flow self-monitoring network, comprising:
the acoustic feature and model acquisition module is used for acquiring target acoustic features and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;
The coding and quantizing module is used for coding and quantizing the target acoustic feature by utilizing the coding and quantizing module to obtain a voice vector;
the reconstruction and comparison module is used for carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;
the fusion module is used for fusing the first voice representation and the second voice representation by utilizing the characteristic fusion sub-module to obtain a fused voice representation;
And the classification module is used for identifying the fused voice representation by combining the first sub-model with a connection time sequence classifier in the CTC module based on the target acoustic characteristics to obtain a transcribed text.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a voice recognition method based on a dual stream self supervising network according to any one of claims 1 to 6 when executing the program.
9. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements a dual-flow self-supervising network based speech recognition method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310874348.2A CN116863920B (en) | 2023-07-17 | 2023-07-17 | Voice recognition method, device, equipment and medium based on double-flow self-supervision network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310874348.2A CN116863920B (en) | 2023-07-17 | 2023-07-17 | Voice recognition method, device, equipment and medium based on double-flow self-supervision network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116863920A CN116863920A (en) | 2023-10-10 |
CN116863920B true CN116863920B (en) | 2024-06-11 |
Family
ID=88218798
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310874348.2A Active CN116863920B (en) | 2023-07-17 | 2023-07-17 | Voice recognition method, device, equipment and medium based on double-flow self-supervision network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116863920B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117152692B (en) * | 2023-10-30 | 2024-02-23 | 中国市政工程西南设计研究总院有限公司 | Traffic target detection method and system based on video monitoring |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115310461A (en) * | 2022-07-14 | 2022-11-08 | 佛山科学技术学院 | Low-resource speech translation method and system based on multi-modal data optimization |
US11551668B1 (en) * | 2020-12-30 | 2023-01-10 | Meta Platforms, Inc. | Generating representations of speech signals using self-supervised learning |
CN115810351A (en) * | 2023-02-09 | 2023-03-17 | 四川大学 | Controller voice recognition method and device based on audio-visual fusion |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023055410A1 (en) * | 2021-09-30 | 2023-04-06 | Google Llc | Contrastive siamese network for semi-supervised speech recognition |
-
2023
- 2023-07-17 CN CN202310874348.2A patent/CN116863920B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11551668B1 (en) * | 2020-12-30 | 2023-01-10 | Meta Platforms, Inc. | Generating representations of speech signals using self-supervised learning |
CN115310461A (en) * | 2022-07-14 | 2022-11-08 | 佛山科学技术学院 | Low-resource speech translation method and system based on multi-modal data optimization |
CN115810351A (en) * | 2023-02-09 | 2023-03-17 | 四川大学 | Controller voice recognition method and device based on audio-visual fusion |
Non-Patent Citations (5)
Title |
---|
Joint Unsupervised and Supervised Training for Multilingual ASR;Junwen Bai等;ICASSP 2022;20220427;全文 * |
Tts4pretrain 2.0: Advancing the use of text and speech in ASR pretraining with consistency and contrastive losses;Zhehuai Chen等;ICASSP 2022;20221231;全文 * |
基于正样本对比与掩蔽重建的自监督语音表示学习;张文林等;通信学报;20220731;全文 * |
深度强化学习进展:从AlphaGo到AlphaGo Zero;唐振韬等;控制理论与应用;20171215(第12期);全文 * |
端到端的深度卷积神经网络语音识别;刘娟宏等;计算机应用与软件;20200412(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN116863920A (en) | 2023-10-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jiang et al. | Speech simclr: Combining contrastive and reconstruction objective for self-supervised speech representation learning | |
CN111382584B (en) | Text translation method and device, readable storage medium and computer equipment | |
CN110164476B (en) | BLSTM voice emotion recognition method based on multi-output feature fusion | |
CN113688244B (en) | Text classification method, system, equipment and storage medium based on neural network | |
CN105139864B (en) | Audio recognition method and device | |
CN115641543B (en) | Multi-modal depression emotion recognition method and device | |
Deng et al. | Foundations and trends in signal processing: Deep learning–methods and applications | |
Fei et al. | Research on speech emotion recognition based on deep auto-encoder | |
CN115495552A (en) | Multi-round dialogue reply generation method based on two-channel semantic enhancement and terminal equipment | |
CN110059324A (en) | Neural network machine interpretation method and device based on the supervision of interdependent information | |
CN115101085A (en) | Multi-speaker time-domain voice separation method for enhancing external attention through convolution | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
CN118132674A (en) | Text information extraction method based on large language model and high-efficiency parameter fine adjustment | |
KR20210042696A (en) | Apparatus and method for learning model | |
CN116863920B (en) | Voice recognition method, device, equipment and medium based on double-flow self-supervision network | |
CN115810351B (en) | Voice recognition method and device for controller based on audio-visual fusion | |
US20240104352A1 (en) | Contrastive Learning and Masked Modeling for End-To-End Self-Supervised Pre-Training | |
CN111444328A (en) | Natural language automatic prediction inference method with interpretation generation | |
CN115310461A (en) | Low-resource speech translation method and system based on multi-modal data optimization | |
CN115221315A (en) | Text processing method and device, and sentence vector model training method and device | |
CN113409772A (en) | Encoder and end-to-end voice recognition system based on local generation type attention mechanism and adopting same | |
CN114239575B (en) | Statement analysis model construction method, statement analysis method, device, medium and computing equipment | |
Deng et al. | History utterance embedding transformer lm for speech recognition | |
CN115376484A (en) | Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction | |
CN115455162A (en) | Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |