CN116416967B - Method for improving Chongqing dialect voice recognition through transfer learning - Google Patents
Method for improving Chongqing dialect voice recognition through transfer learning Download PDFInfo
- Publication number
- CN116416967B CN116416967B CN202111651840.0A CN202111651840A CN116416967B CN 116416967 B CN116416967 B CN 116416967B CN 202111651840 A CN202111651840 A CN 202111651840A CN 116416967 B CN116416967 B CN 116416967B
- Authority
- CN
- China
- Prior art keywords
- follows
- model
- input
- pinyin
- speech recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000013526 transfer learning Methods 0.000 title claims abstract description 14
- 238000011176 pooling Methods 0.000 claims description 12
- 230000002708 enhancing effect Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 230000001902 propagating effect Effects 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 9
- 230000004927 fusion Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000028161 membrane depolarization Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for improving Chongqing dialect voice recognition through transfer learning, which comprises the following steps: 1) Acquiring voice data; 2) Obtaining a voice spectrogram; 3) Vectorizing the voice frequency spectrogram to obtain a vector V; 4) Acquiring an input X of a transducer model; 5) Inputting the parameters Q, K and V into an encoder of a transducer model to obtain an encoder output Y1 and an encoder output Y2; 6) Inputting the encoder output Y1 and the encoder output Y2 into a decoder of a transducer model to obtain a voice recognition text; 8) Determining the input x of a pinyin BERT model; 9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result. According to the invention, through a pipeline design mode, the acoustic model and the language model in the ASR are independently opened, so that the diversity of ASR model selection is enhanced.
Description
Technical Field
The invention relates to the field, in particular to a method for improving Chongqing dialect voice recognition through transfer learning.
Background
The voice recognition technology starts in the fifty of the last century, and now has achieved good results, and similarly, the natural language processing technology is accompanied with the development of the deep learning technology, and gradually develops into a deep semantic model from a statistical model, so that the method is widely applied to some classical NLP scenes, such as NLG tasks, named body recognition and other tasks.
The artificial intelligence products are widely applied in various IT fields, the ASR technology is an important component of artificial intelligence, so that a computer can 'understand' human voice, the development of the ASR technology is helpful for people to communicate with more artificial intelligence products, and 'man-machine interaction' is realized, so that people enjoy convenience and high efficiency brought to life by technological development.
The implementation of ASR can be classified as either the pipeline or end2end concept, with the main difference being in the recognition unit of the acoustic model. The model recognition unit size (word pronunciation model, semisyllable model, or phoneme model) has a large influence on the volume of speech training data, the speech recognition rate, and the flexibility. For a speech recognition system with a medium vocabulary or more, the recognition unit is small, so that the calculation amount is small, the required model memory amount is small, and the required training data amount is relatively small, but the problems are that the positioning and segmentation of the corresponding speech segment are difficult, and the recognition model rule is more complex. Often large recognition units tend to include co-pronunciations in the model, which is beneficial to improving the recognition rate of the system, but requires relatively increased training data.
In summary, the statistical-based language model is affected by the expected size, the effect is limited, and the expression capability of the statistical information is limited at the semantic level. In the prior art, a language model is not integrated in an acoustic model, most acoustic models based on deep learning adopt CNN or RNN-like structures, and the calculation efficiency is limited. The BERT et al model has limited effect in NLG tasks due to its bi-directional attention mechanism in text generation tasks.
Disclosure of Invention
The invention aims to provide a method for improving Chongqing dialect voice recognition through transfer learning, which comprises the following steps:
1) Voice data is acquired. The voice data includes dialects.
2) And carrying out Fourier transform on the voice data to obtain a voice spectrogram.
3) Vectorizing the voice frequency spectrogram by using a VGG network to obtain a vector v.
Vector v is as follows:
v=VGG(DFT(A)) (1)
Wherein A is voice data.
4) An input X of the transducer model is obtained. The transducer model includes an encoder1, an encoder2, and a decoder.
The input X of the transducer is as follows:
X=PE(DFT(A))+Fbank(v) (2)
Where PE is a position-coding function.
5) And converting the input X to obtain parameters Q, K and V.
Q=XWQ,K=XWK,V=XWV (3)
6) The parameter Q, the parameter K, and the parameter V are input to the encoder1 and the encoder2 of the transducer model, respectively, to obtain an encoder output Y1 and an encoder output Y2.
The parameters Q, K, V are as follows:
the encoder includes a multi-headed attention layer, a forward propagation layer.
The outputs MultiHead (Q, K, V) of the multi-headed attention layer are shown below:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO (4)
The parameter head i is as follows:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h (5)
Wherein h is the number of attention layers; w i Q、Wi K、Wi V is the i-th layer weight.
Attention (Q, K, V) is as follows:
In the method, in the process of the invention, Is a normalization parameter;
The output FFN (x) of the forward propagating layer is as follows:
FFN(x)=max(0,xW1+b1)W2+b2 (7)
the input x of the forward propagation layer is as follows:
x=norm(X+MultiHead(Q,K,V)) (8)
the output Y of the encoder is as follows:
Y=FFN(x) (9)
7) The encoder output Y1 and the encoder output Y2 are input to a decoder of the transducer model to obtain a speech recognition text.
8) Based on the speech recognition text, an input x of the pinyin BERT model is determined.
The input x of the pinyin BERT model is as follows:
x=Concat(CE,GE,PYE)WF+PE (10)
Where CE represents word embedding. GE represents glyph embedding. PYE represents Pinyin embedding. PE represents position embedding. W F represents a fully connected layer. Concat denotes vector concatenation.
The glyph embedding GE is shown below:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG (11)
Where I 1、I2、I3 represents a glyph image. W G represents a fully connected layer. The flat represents converting a two-dimensional image into a one-dimensional vector.
Pinyin embedding PYE is as follows:
PYE=max-pooling(CNN(S)) (12)
wherein S represents a pinyin sequence. max-pooling represents maximum pooling. CNN represents a convolution calculation.
9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result.
The speech recognition result p (x 1,x2,x3,...,xn) is as follows:
p(x1,x2,x3,…,xn)=p(x1)p(x2|x1)p(x3|x1,x2)...p(xn|x1,x2,...,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)...p(xn|x3,x1,...,xn-1)...p(xn-1)
=...
=p(x1|xn-1)p(xn|xn-1,x1)...p(x2|xn-1,x1,...,x3) (13)
Where p (x 2|x1) represents the speech recognition text probability distribution.
The invention has the technical effects that the invention can capture semantic level information more comprehensively by changing the language model in the ASR technology from a statistical model to a pretraining model expected in a large scale, and independently open the acoustic model in the ASR through a pipeline design mode, thereby enhancing the diversity of ASR model selection.
According to the invention, through placing embedding of the position in the acoustic model, the acoustic model has a certain language model capacity, and meanwhile, the effectiveness of the acoustic model in extracting acoustic information and completing decoding is enhanced.
The invention captures language information in all directions by introducing the Pinyin, the fonts and the like embedding, which are matched with the characteristics of the ASR in Chinese, such as the same initials, the same finals, the same pronunciation and the like, and simultaneously improves the accuracy of the language model in the decoding process.
The invention applies UniLM model to ASR scene, and improves the accuracy of ASR decoding by means of the effectiveness of UniLM algorithm in text generation task.
Aiming at the remarkable performance of the NLP technology on a large number of NLP tasks by adopting a pre-training method in recent years, the invention provides the method for obtaining the primary ASR result by using a transducer as an acoustic model, and finally obtaining the ASR result output by combining a pre-training model (UniLM) obtained by pinyin pre-training of the transducer as a language model according to the expectation of a language scene.
Drawings
FIG. 1 is a speech recognition process;
FIG. 2 is a speech feature processing flow;
FIG. 3 is a diagram of a transducer structure;
FIG. 4 is an input collation;
FIG. 5 is an input information fusion;
FIG. 6 is a Pinyin embedding.
Detailed Description
The present invention is further described below with reference to examples, but it should not be construed that the scope of the above subject matter of the present invention is limited to the following examples. Various substitutions and alterations are made according to the ordinary skill and familiar means of the art without departing from the technical spirit of the invention, and all such substitutions and alterations are intended to be included in the scope of the invention.
Example 1:
Referring to fig. 1, 2, 3,4, 5, and 6, a method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:
1) Voice data is acquired. The voice data includes dialects.
2) And carrying out Fourier transform on the voice data to obtain a voice spectrogram.
3) Vectorizing the voice frequency spectrogram by using a VGG network to obtain a vector v.
Vector v is as follows:
v=VGG(DFT(A)) (1)
Wherein A is voice data.
4) An input X of the transducer model is obtained. The transducer model includes an encoder1, an encoder2, and a decoder.
The input X of the transducer is as follows:
X=PE(DFT(A))+Fbank(v) (2)
where PE is a position-coding function. Fbank () represents a speech feature extraction operation.
5) And converting the input X to obtain parameters Q, K and V.
Q=XWQ,K=XWK,V=XWV (3)
6) The parameter Q, the parameter K, and the parameter V are input to the encoder1 and the encoder2 of the transducer model, respectively, to obtain an encoder output Y1 and an encoder output Y2.
The parameters Q, K, V are as follows:
the encoder includes a multi-headed attention layer, a forward propagation layer.
The outputs MultiHead (Q, K, V) of the multi-headed attention layer are shown below:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO (4)
The parameter head i is as follows:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h (5)
Wherein h is the number of attention layers; w i Q、Wi K、Wi V is the i-th layer weight.
Attention (Q, K, V) is as follows:
In the method, in the process of the invention, Is a normalization parameter;
The output FFN (x) of the forward propagating layer is as follows:
FFN(x)=max(0,xW1+b1)W2+b2 (7)
the input x of the forward propagation layer is as follows:
x=norm(X+MultiHead(Q,K,V)) (8)
the output Y of the encoder is as follows:
Y=FFN(x) (9)
7) The encoder output Y1 and the encoder output Y2 are input to a decoder of the transducer model to obtain a speech recognition text.
8) Based on the speech recognition text, an input x of the pinyin BERT model is determined.
The input x of the pinyin BERT model is as follows:
x=Concat(CE,GE,PYE)WF+PE (10)
Where CE represents word embedding. GE represents glyph embedding. PYE represents Pinyin embedding. PE represents position embedding. W F represents a fully connected layer. Concat denotes vector concatenation.
The glyph embedding GE is shown below:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG (11)
where I represents a glyph image. W G represents a fully connected layer. The flat represents converting a two-dimensional image into a one-dimensional vector.
Pinyin embedding PYE is as follows:
PYE=max-pooling(CNN(S)) (12)
wherein S represents a pinyin sequence. max-pooling represents maximum pooling. CNN represents a convolution calculation.
9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result.
The speech recognition result p (x 1,x2,x3,...,xn) is as follows:
p(x1,x2,x3,...,xn)=p(x1)p(x2|x1)p(x3|x1,x2)...p(xn|x1,x2,...,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)…p(xn|x3,x1,...,xn-1)...p(xn-1)
=...
=p(x1|xn-1)p(xn|xn-1,x1)...p(x2|xn-1,x1,...,x3) (13)
Where p (x 2|x1) represents the speech recognition text probability distribution.
Example 2:
A method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:
1) According to the audio, a signal processing technology and Fourier transformation are adopted to obtain a spectrogram of a single audio file, and vector expression of the whole structure diagram is extracted through a VGG network structure.
The formula can be expressed as:
V=VGG(DFT(A))
a, an audio file; DFT: discrete fourier transform; VGG: a VGG network; v vector expression of VGG output
2) According to the spectrogram, the position information of each spectrum unit in the original image is obtained, and is vectorized by embedding and then input into a transducer together with Fbank.
The calculation flow and formula of the encoder are as follows:
the transducer input X consists of two parts, namely a position code and Fbank, and PE is a position code function:
X=PE(DFT(A))+Fbank(V)
converting input X to Q, K, V:
Q=XWQ,K=XWK,V=XWV
attention calculation formula:
multi-head attention layer:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO
Wherein:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h
Forward propagation layer:
FFN(x)=max(0,xW1+b1)W2+b2
wherein:
x=norm(X+MultiHead(Q,K,V))
Output of the encoder:
Y=FFN(x)
The decoder calculation process is similar to that of the decoder, and detailed description thereof will be omitted with reference to fig. 3.
3) The maximum characteristics of Chinese characters have two aspects: the first is character shape and the second is spelling. Chinese character is a typical meaning word, and from its origin, its font itself contains a part of semantics. For example, "rivers and lakes" all have three points of water, which means that they are all related to water. From the pronunciation, the spelling of Chinese characters can reflect the meaning of a Chinese character to a certain extent, and plays a role in distinguishing word meaning. For example, a "music" word has two pronunciations, yu and i, the former representing "music" and being a noun; the latter indicates "happiness" and is an adjective. For a polyphone, a "music" is simply input, and the model cannot know whether it should represent "music" or "happy", and additional pronunciation information is needed for depolarization at this time. Based on the two characteristics of the Chinese character, the font and pinyin information of the Chinese character are integrated into the pre-training process of the Chinese corpus. The font vector of a Chinese character is formed from a plurality of different fonts, and the pinyin vector is derived from a corresponding romanized sequence of pinyin characters. The two are fused together with the word vector to obtain a final fusion vector which is used as the input of the pre-training model. The model is trained by using two strategies, namely a whole word mask (Whole Word Masking) and a word mask (CHARACTER MASKING), so that the model more comprehensively establishes the connection among Chinese characters, fonts, pronunciation and context.
X=Concat(CE,GE,PYE)WF+PE
CE, GE, chinese character font, PYE, pinyin, PE, position, WF, full connection layer, X, BERT input, concat, vector splicing.
The Fusion Layer (Fusion Layer) at the bottom Layer fuses the font embedding (Glyph Embedding) and the pinyin embedding (Pinyin Embedding) except the word embedding (Char Embedding) to obtain a Fusion embedding (Fusion Embedding), and then adds the Fusion embedding and the position embedding to form the input of the model. The character pattern is embedded into the Chinese character image with different fonts. Each image is 24 x 24 in size, three fonts of the imitated Song, the running-regular script and the script are vectorized, and the Chinese character font embedding is obtained after the images are spliced and then are subjected to full connection W G.
The process is as shown in fig. 5:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG
I, a font image, WG, a full connection layer, GE, font embedding and flat, wherein the two-dimensional image is converted into a one-dimensional vector.
Pinyin embedding first converts the pinyin of each Chinese character to a sequence of romanized characters, which also contain tones, using pypinyin. For example, for Chinese characters "cat", its Pinyin character sequence is "mao1". For polyphones such as "happy", pypinyin can very accurately identify the correct pinyin in the current context.
The process is as shown in fig. 6:
PYE=max-pooling(CNN(S))
s is a pinyin sequence, max-pooling is max pooling, CNN is convolution calculation, and PYE is pinyin embedding.
4) The final ASR recognition result is generated in combination with the pre-training model UniLM, and compared with the generation model based on the language model, the BERT cannot meet the requirement of the language model due to bidirectional decoding, but the decoding direction is manually controlled by Mask attention, so that the direction is changed from bidirectional to unidirectional:
p(x1,x2,x3,…,xn)=p(x1)p(x2|x1)p(x3|x1,x2)…p(xn|x1,x2,...,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)…p(xn|x3,x1,...,xn-1)…p(xn-1)
=…
=p(x1|xn-1)p(xn|xn-1,x1)…p(x2|xn-1,x1,...,x3)
x 1,x2,…,xn any "departure sequence" is possible. In principle, each sequence corresponds to a model, so in principle there is n-! Implementing a sequential language model corresponds to disturbing the Mask of the original lower triangle form in some way. Just as the Attention provides such an n×n Attention matrix, the present invention has enough degrees of freedom to unmask this matrix in different ways, thereby achieving a diversified effect. Thereby meeting the requirements of the language model.
Claims (7)
1. A method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:
1) Acquiring voice data;
2) Performing Fourier transform on the voice data to obtain a voice spectrogram;
3) Vectorizing the voice frequency spectrogram by utilizing a VGG network to obtain a vector v;
4) Acquiring an input X of a transducer model; the transducer model comprises an encoder1, an encoder2 and a decoder;
5) Converting the input X to obtain a parameter Q, a parameter K and a parameter V;
6) Inputting the parameter Q, the parameter K and the parameter V into an encoder1 and an encoder2 of the transducer model to respectively obtain an encoder output Y1 and an encoder output Y2;
7) Inputting the encoder output Y1 and the encoder output Y2 into a decoder of a transducer model to obtain a voice recognition text;
8) Determining input x of a pinyin BERT model based on the speech recognition text;
9) Inputting an input x into a pinyin BERT model to obtain a voice recognition result;
Vector v is as follows:
v=VGG(DFT(A)) (1)
Wherein A is voice data;
the input X of the transducer is as follows:
X=PE(DFT(A))+Fbank(v) (2)
Wherein PE is a position coding function;
the parameters Q, K, V are as follows:
Q=XWQ,K=XWK,V=XWV (3)。
2. The method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the voice data includes dialects.
3. The method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the encoder comprises a multi-head attention layer and a forward propagation layer;
The outputs MultiHead (Q, K, V) of the multi-headed attention layer are shown below:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO (4)
The parameter head i is as follows:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h (5)
Wherein h is the number of attention layers; w i Q、Wi K、Wi V is the i-th layer weight;
attention (Q, K, V) is as follows:
In the method, in the process of the invention, Is a normalization parameter;
the output FFN (x') of the forward propagating layer is as follows:
FFN(x')=max(0,x'W1+b1)W2+b2 (7)
The input x' of the forward propagation layer is as follows:
x'=norm(X+MultiHead(Q,K,V)) (8)
the output Y of the encoder is as follows:
Y=FFN(x') (9)。
4. the method for enhancing Chongqing dialect speech recognition by transfer learning as recited in claim 1, wherein the input x of the Pinyin BERT model is as follows:
x=Concat(CE,GE,PYE)WF+PE' (10)
Wherein, CE represents word embedding; GE represents character form embedding; PYE represents Pinyin embedding; PE' represents position embedding; w F represents a full link layer; concat denotes vector concatenation.
5. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 4, wherein the glyph embedding GE is as follows:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG (11)
Wherein I 1、I2、I3 represents a glyph image; w G represents a full link layer; the flat represents converting a two-dimensional image into a one-dimensional vector.
6. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 4, wherein the pinyin-embedded PYE is as follows:
PYE=max-pooling(CNN(S)) (12)
Wherein S represents a Pinyin sequence; max-pooling represents maximum pooling; CNN represents a convolution calculation.
7. The method for enhancing Chongqing dialect speech recognition by transfer learning as set forth in claim 1, wherein the speech recognition result p (x 1,x2,x3,…,xn) is as follows:
p(x1,x2,x3,…,xn)=p(x1)p(x2|x1)p(x3|x1,x2)…p(xn|x1,x2,…,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)…p(xn|x3,x1,…,xn-1)...p(xn-1)
=...
=p(x1|xn-1)p(xn|xn-1,x1)…p(x2|xn-1,x1,...,x3) (13)
Where p (x 2|x1) represents the speech recognition text probability distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111651840.0A CN116416967B (en) | 2021-12-30 | 2021-12-30 | Method for improving Chongqing dialect voice recognition through transfer learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111651840.0A CN116416967B (en) | 2021-12-30 | 2021-12-30 | Method for improving Chongqing dialect voice recognition through transfer learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116416967A CN116416967A (en) | 2023-07-11 |
CN116416967B true CN116416967B (en) | 2024-09-24 |
Family
ID=87053265
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111651840.0A Active CN116416967B (en) | 2021-12-30 | 2021-12-30 | Method for improving Chongqing dialect voice recognition through transfer learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116416967B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116416968B (en) * | 2021-12-30 | 2024-09-24 | 重庆大学 | Chongqing dialect voice recognition method of transducer composed of double encoders |
CN118038851A (en) * | 2023-12-27 | 2024-05-14 | 暗物质(北京)智能科技有限公司 | Multiparty speech recognition method, system, equipment and medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679491A (en) * | 2017-09-29 | 2018-02-09 | 华中师范大学 | A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data |
CN112151030A (en) * | 2020-09-07 | 2020-12-29 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-mode-based complex scene voice recognition method and device |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11238845B2 (en) * | 2018-11-21 | 2022-02-01 | Google Llc | Multi-dialect and multilingual speech recognition |
US10916242B1 (en) * | 2019-08-07 | 2021-02-09 | Nanjing Silicon Intelligence Technology Co., Ltd. | Intent recognition method based on deep learning network |
CN110232439B (en) * | 2019-08-07 | 2019-12-24 | 南京硅基智能科技有限公司 | Intention identification method based on deep learning network |
CN111798871B (en) * | 2020-09-08 | 2020-12-29 | 共道网络科技有限公司 | Session link identification method, device and equipment and storage medium |
CN112418011A (en) * | 2020-11-09 | 2021-02-26 | 腾讯科技(深圳)有限公司 | Method, device and equipment for identifying integrity of video content and storage medium |
CN112767958B (en) * | 2021-02-26 | 2023-12-26 | 华南理工大学 | Zero-order learning-based cross-language tone conversion system and method |
CN113723166A (en) * | 2021-03-26 | 2021-11-30 | 腾讯科技(北京)有限公司 | Content identification method and device, computer equipment and storage medium |
-
2021
- 2021-12-30 CN CN202111651840.0A patent/CN116416967B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679491A (en) * | 2017-09-29 | 2018-02-09 | 华中师范大学 | A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data |
CN112151030A (en) * | 2020-09-07 | 2020-12-29 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-mode-based complex scene voice recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
CN116416967A (en) | 2023-07-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110534089B (en) | Chinese speech synthesis method based on phoneme and prosodic structure | |
CN111837178B (en) | Speech processing system and method for processing speech signal | |
US11538455B2 (en) | Speech style transfer | |
CN113439301A (en) | Reconciling between analog data and speech recognition output using sequence-to-sequence mapping | |
CN116416967B (en) | Method for improving Chongqing dialect voice recognition through transfer learning | |
Zhang et al. | Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese | |
WO2019161011A1 (en) | Speech style transfer | |
KR20210146089A (en) | Method for generating multi persona model and providing for conversation styling using the multi persona model | |
Yu et al. | Acoustic modeling based on deep learning for low-resource speech recognition: An overview | |
JP7112075B2 (en) | Front-end training method for speech synthesis, computer program, speech synthesis system, and front-end processing method for speech synthesis | |
Abdelhamid et al. | End-to-end arabic speech recognition: A review | |
Hasegawa-Johnson et al. | Image2speech: Automatically generating audio descriptions of images | |
Abdelmaksoud et al. | Convolutional neural network for arabic speech recognition | |
CN112990353A (en) | Chinese character confusable set construction method based on multi-mode model | |
Zhao et al. | End-to-end-based Tibetan multitask speech recognition | |
CN112420050B (en) | Voice recognition method and device and electronic equipment | |
KR20210122070A (en) | Voice synthesis apparatus and method for 'Call me' service using language feature vector | |
CN115101046A (en) | Method and device for synthesizing voice of specific speaker | |
Effendi et al. | End-to-end image-to-speech generation for untranscribed unknown languages | |
Suyanto et al. | End-to-End speech recognition models for a low-resourced Indonesian Language | |
Shivakumar et al. | A study on impact of language model in improving the accuracy of speech to text conversion system | |
CN113129862B (en) | Voice synthesis method, system and server based on world-tacotron | |
CN106971721A (en) | A kind of accent speech recognition system based on embedded mobile device | |
Naderi et al. | Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method | |
Huzaifah et al. | An analysis of semantically-aligned speech-text embeddings |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |