CN116416967B

CN116416967B - Method for improving Chongqing dialect voice recognition through transfer learning

Info

Publication number: CN116416967B
Application number: CN202111651840.0A
Authority: CN
Inventors: 张美伟; 余娟; 吕洋; 李文沅; 余维华; 王香霖
Original assignee: Chongqing University; Chongqing Medical University
Current assignee: Chongqing University; Chongqing Medical University
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2024-09-24
Anticipated expiration: 2041-12-30
Also published as: CN116416967A

Abstract

The invention discloses a method for improving Chongqing dialect voice recognition through transfer learning, which comprises the following steps: 1) Acquiring voice data; 2) Obtaining a voice spectrogram; 3) Vectorizing the voice frequency spectrogram to obtain a vector V; 4) Acquiring an input X of a transducer model; 5) Inputting the parameters Q, K and V into an encoder of a transducer model to obtain an encoder output Y1 and an encoder output Y2; 6) Inputting the encoder output Y1 and the encoder output Y2 into a decoder of a transducer model to obtain a voice recognition text; 8) Determining the input x of a pinyin BERT model; 9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result. According to the invention, through a pipeline design mode, the acoustic model and the language model in the ASR are independently opened, so that the diversity of ASR model selection is enhanced.

Description

Method for improving Chongqing dialect voice recognition through transfer learning

Technical Field

The invention relates to the field, in particular to a method for improving Chongqing dialect voice recognition through transfer learning.

Background

The voice recognition technology starts in the fifty of the last century, and now has achieved good results, and similarly, the natural language processing technology is accompanied with the development of the deep learning technology, and gradually develops into a deep semantic model from a statistical model, so that the method is widely applied to some classical NLP scenes, such as NLG tasks, named body recognition and other tasks.

The artificial intelligence products are widely applied in various IT fields, the ASR technology is an important component of artificial intelligence, so that a computer can 'understand' human voice, the development of the ASR technology is helpful for people to communicate with more artificial intelligence products, and 'man-machine interaction' is realized, so that people enjoy convenience and high efficiency brought to life by technological development.

The implementation of ASR can be classified as either the pipeline or end2end concept, with the main difference being in the recognition unit of the acoustic model. The model recognition unit size (word pronunciation model, semisyllable model, or phoneme model) has a large influence on the volume of speech training data, the speech recognition rate, and the flexibility. For a speech recognition system with a medium vocabulary or more, the recognition unit is small, so that the calculation amount is small, the required model memory amount is small, and the required training data amount is relatively small, but the problems are that the positioning and segmentation of the corresponding speech segment are difficult, and the recognition model rule is more complex. Often large recognition units tend to include co-pronunciations in the model, which is beneficial to improving the recognition rate of the system, but requires relatively increased training data.

In summary, the statistical-based language model is affected by the expected size, the effect is limited, and the expression capability of the statistical information is limited at the semantic level. In the prior art, a language model is not integrated in an acoustic model, most acoustic models based on deep learning adopt CNN or RNN-like structures, and the calculation efficiency is limited. The BERT et al model has limited effect in NLG tasks due to its bi-directional attention mechanism in text generation tasks.

Disclosure of Invention

The invention aims to provide a method for improving Chongqing dialect voice recognition through transfer learning, which comprises the following steps:

1) Voice data is acquired. The voice data includes dialects.

2) And carrying out Fourier transform on the voice data to obtain a voice spectrogram.

3) Vectorizing the voice frequency spectrogram by using a VGG network to obtain a vector v.

Vector v is as follows:

v＝VGG(DFT(A)) (1)

Wherein A is voice data.

4) An input X of the transducer model is obtained. The transducer model includes an encoder1, an encoder2, and a decoder.

The input X of the transducer is as follows:

X＝PE(DFT(A))+Fbank(v) (2)

Where PE is a position-coding function.

5) And converting the input X to obtain parameters Q, K and V.

Q＝XW^Q,K＝XW^K,V＝XW^V (3)

6) The parameter Q, the parameter K, and the parameter V are input to the encoder1 and the encoder2 of the transducer model, respectively, to obtain an encoder output Y1 and an encoder output Y2.

The parameters Q, K, V are as follows:

the encoder includes a multi-headed attention layer, a forward propagation layer.

The outputs MultiHead (Q, K, V) of the multi-headed attention layer are shown below:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O (4)

The parameter head _i is as follows:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V),i＝1,2,...,h (5)

Wherein h is the number of attention layers; w _i ^Q、W_i ^K、W_i ^V is the i-th layer weight.

Attention (Q, K, V) is as follows:

In the method, in the process of the invention, Is a normalization parameter;

The output FFN (x) of the forward propagating layer is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂ (7)

the input x of the forward propagation layer is as follows:

x＝norm(X+MultiHead(Q,K,V)) (8)

the output Y of the encoder is as follows:

Y＝FFN(x) (9)

7) The encoder output Y1 and the encoder output Y2 are input to a decoder of the transducer model to obtain a speech recognition text.

8) Based on the speech recognition text, an input x of the pinyin BERT model is determined.

The input x of the pinyin BERT model is as follows:

x＝Concat(CE,GE,PYE)W_F+PE (10)

Where CE represents word embedding. GE represents glyph embedding. PYE represents Pinyin embedding. PE represents position embedding. W _F represents a fully connected layer. Concat denotes vector concatenation.

The glyph embedding GE is shown below:

GE＝Concat(flatten(I₁),flatten(I₂),flatten(I₃))W_G (11)

Where I ₁、I₂、I₃ represents a glyph image. W _G represents a fully connected layer. The flat represents converting a two-dimensional image into a one-dimensional vector.

Pinyin embedding PYE is as follows:

PYE＝max-pooling(CNN(S)) (12)

wherein S represents a pinyin sequence. max-pooling represents maximum pooling. CNN represents a convolution calculation.

9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result.

The speech recognition result p (x ₁,x₂,x₃,...,x_n) is as follows:

p(x₁,x₂,x₃,…,x_n)＝p(x₁)p(x₂|x₁)p(x₃|x₁,x₂)...p(x_n|x₁,x₂,...,x_n-1)

＝p(x₃)p(x₁|x₃)p(x₂|x₃,x₁)...p(x_n|x₃,x₁,...,x_n-1)...p(x_n-1)

＝...

＝p(x₁|x_n-1)p(x_n|x_n-1,x₁)...p(x₂|x_n-1,x₁,...,x₃) (13)

Where p (x ₂|x₁) represents the speech recognition text probability distribution.

The invention has the technical effects that the invention can capture semantic level information more comprehensively by changing the language model in the ASR technology from a statistical model to a pretraining model expected in a large scale, and independently open the acoustic model in the ASR through a pipeline design mode, thereby enhancing the diversity of ASR model selection.

According to the invention, through placing embedding of the position in the acoustic model, the acoustic model has a certain language model capacity, and meanwhile, the effectiveness of the acoustic model in extracting acoustic information and completing decoding is enhanced.

The invention captures language information in all directions by introducing the Pinyin, the fonts and the like embedding, which are matched with the characteristics of the ASR in Chinese, such as the same initials, the same finals, the same pronunciation and the like, and simultaneously improves the accuracy of the language model in the decoding process.

The invention applies UniLM model to ASR scene, and improves the accuracy of ASR decoding by means of the effectiveness of UniLM algorithm in text generation task.

Aiming at the remarkable performance of the NLP technology on a large number of NLP tasks by adopting a pre-training method in recent years, the invention provides the method for obtaining the primary ASR result by using a transducer as an acoustic model, and finally obtaining the ASR result output by combining a pre-training model (UniLM) obtained by pinyin pre-training of the transducer as a language model according to the expectation of a language scene.

Drawings

FIG. 1 is a speech recognition process;

FIG. 2 is a speech feature processing flow;

FIG. 3 is a diagram of a transducer structure;

FIG. 4 is an input collation;

FIG. 5 is an input information fusion;

FIG. 6 is a Pinyin embedding.

Detailed Description

The present invention is further described below with reference to examples, but it should not be construed that the scope of the above subject matter of the present invention is limited to the following examples. Various substitutions and alterations are made according to the ordinary skill and familiar means of the art without departing from the technical spirit of the invention, and all such substitutions and alterations are intended to be included in the scope of the invention.

Example 1:

Referring to fig. 1, 2, 3,4, 5, and 6, a method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:

1) Voice data is acquired. The voice data includes dialects.

Vector v is as follows:

v＝VGG(DFT(A)) (1)

Wherein A is voice data.

The input X of the transducer is as follows:

X＝PE(DFT(A))+Fbank(v) (2)

where PE is a position-coding function. Fbank () represents a speech feature extraction operation.

5) And converting the input X to obtain parameters Q, K and V.

Q＝XW^Q,K＝XW^K,V＝XW^V (3)

The parameters Q, K, V are as follows:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O (4)

The parameter head _i is as follows:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V),i＝1,2,...,h (5)

Attention (Q, K, V) is as follows:

In the method, in the process of the invention, Is a normalization parameter;

The output FFN (x) of the forward propagating layer is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂ (7)

the input x of the forward propagation layer is as follows:

x＝norm(X+MultiHead(Q,K,V)) (8)

the output Y of the encoder is as follows:

Y＝FFN(x) (9)

The input x of the pinyin BERT model is as follows:

x＝Concat(CE,GE,PYE)W_F+PE (10)

The glyph embedding GE is shown below:

GE＝Concat(flatten(I₁),flatten(I₂),flatten(I₃))W_G (11)

where I represents a glyph image. W _G represents a fully connected layer. The flat represents converting a two-dimensional image into a one-dimensional vector.

Pinyin embedding PYE is as follows:

PYE＝max-pooling(CNN(S)) (12)

The speech recognition result p (x ₁,x₂,x₃,...,x_n) is as follows:

p(x₁,x₂,x₃,...,x_n)＝p(x₁)p(x₂|x₁)p(x₃|x₁,x₂)...p(x_n|x₁,x₂,...,x_n-1)

＝p(x₃)p(x₁|x₃)p(x₂|x₃,x₁)…p(x_n|x₃,x₁,...,x_n-1)...p(x_n-1)

＝...

＝p(x₁|x_n-1)p(x_n|x_n-1,x₁)...p(x₂|x_n-1,x₁,...,x₃) (13)

Example 2:

A method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:

1) According to the audio, a signal processing technology and Fourier transformation are adopted to obtain a spectrogram of a single audio file, and vector expression of the whole structure diagram is extracted through a VGG network structure.

The formula can be expressed as:

V＝VGG(DFT(A))

a, an audio file; DFT: discrete fourier transform; VGG: a VGG network; v vector expression of VGG output

2) According to the spectrogram, the position information of each spectrum unit in the original image is obtained, and is vectorized by embedding and then input into a transducer together with Fbank.

The calculation flow and formula of the encoder are as follows:

the transducer input X consists of two parts, namely a position code and Fbank, and PE is a position code function:

X＝PE(DFT(A))+Fbank(V)

converting input X to Q, K, V:

Q＝XW^Q,K＝XW^K,V＝XW^V

attention calculation formula:

multi-head attention layer:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O

Wherein:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V),i＝1,2,...,h

Forward propagation layer:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

wherein:

x＝norm(X+MultiHead(Q,K,V))

Output of the encoder:

Y＝FFN(x)

The decoder calculation process is similar to that of the decoder, and detailed description thereof will be omitted with reference to fig. 3.

3) The maximum characteristics of Chinese characters have two aspects: the first is character shape and the second is spelling. Chinese character is a typical meaning word, and from its origin, its font itself contains a part of semantics. For example, "rivers and lakes" all have three points of water, which means that they are all related to water. From the pronunciation, the spelling of Chinese characters can reflect the meaning of a Chinese character to a certain extent, and plays a role in distinguishing word meaning. For example, a "music" word has two pronunciations, yu and i, the former representing "music" and being a noun; the latter indicates "happiness" and is an adjective. For a polyphone, a "music" is simply input, and the model cannot know whether it should represent "music" or "happy", and additional pronunciation information is needed for depolarization at this time. Based on the two characteristics of the Chinese character, the font and pinyin information of the Chinese character are integrated into the pre-training process of the Chinese corpus. The font vector of a Chinese character is formed from a plurality of different fonts, and the pinyin vector is derived from a corresponding romanized sequence of pinyin characters. The two are fused together with the word vector to obtain a final fusion vector which is used as the input of the pre-training model. The model is trained by using two strategies, namely a whole word mask (Whole Word Masking) and a word mask (CHARACTER MASKING), so that the model more comprehensively establishes the connection among Chinese characters, fonts, pronunciation and context.

X＝Concat(CE,GE,PYE)W_F+PE

CE, GE, chinese character font, PYE, pinyin, PE, position, WF, full connection layer, X, BERT input, concat, vector splicing.

The Fusion Layer (Fusion Layer) at the bottom Layer fuses the font embedding (Glyph Embedding) and the pinyin embedding (Pinyin Embedding) except the word embedding (Char Embedding) to obtain a Fusion embedding (Fusion Embedding), and then adds the Fusion embedding and the position embedding to form the input of the model. The character pattern is embedded into the Chinese character image with different fonts. Each image is 24 x 24 in size, three fonts of the imitated Song, the running-regular script and the script are vectorized, and the Chinese character font embedding is obtained after the images are spliced and then are subjected to full connection W _G.

The process is as shown in fig. 5:

GE＝Concat(flatten(I₁),flatten(I₂),flatten(I₃))W_G

I, a font image, WG, a full connection layer, GE, font embedding and flat, wherein the two-dimensional image is converted into a one-dimensional vector.

Pinyin embedding first converts the pinyin of each Chinese character to a sequence of romanized characters, which also contain tones, using pypinyin. For example, for Chinese characters "cat", its Pinyin character sequence is "mao1". For polyphones such as "happy", pypinyin can very accurately identify the correct pinyin in the current context.

The process is as shown in fig. 6:

PYE＝max-pooling(CNN(S))

s is a pinyin sequence, max-pooling is max pooling, CNN is convolution calculation, and PYE is pinyin embedding.

4) The final ASR recognition result is generated in combination with the pre-training model UniLM, and compared with the generation model based on the language model, the BERT cannot meet the requirement of the language model due to bidirectional decoding, but the decoding direction is manually controlled by Mask attention, so that the direction is changed from bidirectional to unidirectional:

p(x₁,x₂,x₃,…,x_n)＝p(x₁)p(x₂|x₁)p(x₃|x₁,x₂)…p(x_n|x₁,x₂,...,x_n-1)

＝p(x₃)p(x₁|x₃)p(x₂|x₃,x₁)…p(x_n|x₃,x₁,...,x_n-1)…p(x_n-1)

＝…

＝p(x₁|x_n-1)p(x_n|x_n-1,x₁)…p(x₂|x_n-1,x₁,...,x₃)

x ₁,x₂,…,x_n any "departure sequence" is possible. In principle, each sequence corresponds to a model, so in principle there is n-! Implementing a sequential language model corresponds to disturbing the Mask of the original lower triangle form in some way. Just as the Attention provides such an n×n Attention matrix, the present invention has enough degrees of freedom to unmask this matrix in different ways, thereby achieving a diversified effect. Thereby meeting the requirements of the language model.

Claims

1. A method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:

1) Acquiring voice data;

2) Performing Fourier transform on the voice data to obtain a voice spectrogram;

3) Vectorizing the voice frequency spectrogram by utilizing a VGG network to obtain a vector v;

4) Acquiring an input X of a transducer model; the transducer model comprises an encoder1, an encoder2 and a decoder;

5) Converting the input X to obtain a parameter Q, a parameter K and a parameter V;

6) Inputting the parameter Q, the parameter K and the parameter V into an encoder1 and an encoder2 of the transducer model to respectively obtain an encoder output Y1 and an encoder output Y2;

7) Inputting the encoder output Y1 and the encoder output Y2 into a decoder of a transducer model to obtain a voice recognition text;

8) Determining input x of a pinyin BERT model based on the speech recognition text;

9) Inputting an input x into a pinyin BERT model to obtain a voice recognition result;

Vector v is as follows:

v＝VGG(DFT(A)) (1)

Wherein A is voice data;

the input X of the transducer is as follows:

X＝PE(DFT(A))+Fbank(v) (2)

Wherein PE is a position coding function;

the parameters Q, K, V are as follows:

Q＝XW^Q,K＝XW^K,V＝XW^V (3)。

2. The method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the voice data includes dialects.

3. The method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the encoder comprises a multi-head attention layer and a forward propagation layer;

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O (4)

The parameter head _i is as follows:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V),i＝1,2,...,h (5)

Wherein h is the number of attention layers; w _i ^Q、W_i ^K、W_i ^V is the i-th layer weight;

attention (Q, K, V) is as follows:

In the method, in the process of the invention, Is a normalization parameter;

the output FFN (x') of the forward propagating layer is as follows:

FFN(x')＝max(0,x'W₁+b₁)W₂+b₂ (7)

The input x' of the forward propagation layer is as follows:

x'＝norm(X+MultiHead(Q,K,V)) (8)

the output Y of the encoder is as follows:

Y＝FFN(x') (9)。

4. the method for enhancing Chongqing dialect speech recognition by transfer learning as recited in claim 1, wherein the input x of the Pinyin BERT model is as follows:

x＝Concat(CE,GE,PYE)W_F+PE' (10)

Wherein, CE represents word embedding; GE represents character form embedding; PYE represents Pinyin embedding; PE' represents position embedding; w _F represents a full link layer; concat denotes vector concatenation.

5. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 4, wherein the glyph embedding GE is as follows:

GE＝Concat(flatten(I₁),flatten(I₂),flatten(I₃))W_G (11)

Wherein I ₁、I₂、I₃ represents a glyph image; w _G represents a full link layer; the flat represents converting a two-dimensional image into a one-dimensional vector.

6. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 4, wherein the pinyin-embedded PYE is as follows:

PYE＝max-pooling(CNN(S)) (12)

Wherein S represents a Pinyin sequence; max-pooling represents maximum pooling; CNN represents a convolution calculation.

7. The method for enhancing Chongqing dialect speech recognition by transfer learning as set forth in claim 1, wherein the speech recognition result p (x ₁,x₂,x₃,…,x_n) is as follows:

p(x₁,x₂,x₃,…,x_n)＝p(x₁)p(x₂|x₁)p(x₃|x₁,x₂)…p(x_n|x₁,x₂,…,x_n-1)

＝p(x₃)p(x₁|x₃)p(x₂|x₃,x₁)…p(x_n|x₃,x₁,…,x_n-1)...p(x_n-1)

＝...

＝p(x₁|x_n-1)p(x_n|x_n-1,x₁)…p(x₂|x_n-1,x₁,...,x₃) (13)