CN116416967B - Method for improving Chongqing dialect voice recognition through transfer learning - Google Patents

Method for improving Chongqing dialect voice recognition through transfer learning Download PDF

Info

Publication number
CN116416967B
CN116416967B CN202111651840.0A CN202111651840A CN116416967B CN 116416967 B CN116416967 B CN 116416967B CN 202111651840 A CN202111651840 A CN 202111651840A CN 116416967 B CN116416967 B CN 116416967B
Authority
CN
China
Prior art keywords
follows
model
input
pinyin
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111651840.0A
Other languages
Chinese (zh)
Other versions
CN116416967A (en
Inventor
张美伟
余娟
吕洋
李文沅
余维华
王香霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Chongqing Medical University
Original Assignee
Chongqing University
Chongqing Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University, Chongqing Medical University filed Critical Chongqing University
Priority to CN202111651840.0A priority Critical patent/CN116416967B/en
Publication of CN116416967A publication Critical patent/CN116416967A/en
Application granted granted Critical
Publication of CN116416967B publication Critical patent/CN116416967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for improving Chongqing dialect voice recognition through transfer learning, which comprises the following steps: 1) Acquiring voice data; 2) Obtaining a voice spectrogram; 3) Vectorizing the voice frequency spectrogram to obtain a vector V; 4) Acquiring an input X of a transducer model; 5) Inputting the parameters Q, K and V into an encoder of a transducer model to obtain an encoder output Y1 and an encoder output Y2; 6) Inputting the encoder output Y1 and the encoder output Y2 into a decoder of a transducer model to obtain a voice recognition text; 8) Determining the input x of a pinyin BERT model; 9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result. According to the invention, through a pipeline design mode, the acoustic model and the language model in the ASR are independently opened, so that the diversity of ASR model selection is enhanced.

Description

Method for improving Chongqing dialect voice recognition through transfer learning
Technical Field
The invention relates to the field, in particular to a method for improving Chongqing dialect voice recognition through transfer learning.
Background
The voice recognition technology starts in the fifty of the last century, and now has achieved good results, and similarly, the natural language processing technology is accompanied with the development of the deep learning technology, and gradually develops into a deep semantic model from a statistical model, so that the method is widely applied to some classical NLP scenes, such as NLG tasks, named body recognition and other tasks.
The artificial intelligence products are widely applied in various IT fields, the ASR technology is an important component of artificial intelligence, so that a computer can 'understand' human voice, the development of the ASR technology is helpful for people to communicate with more artificial intelligence products, and 'man-machine interaction' is realized, so that people enjoy convenience and high efficiency brought to life by technological development.
The implementation of ASR can be classified as either the pipeline or end2end concept, with the main difference being in the recognition unit of the acoustic model. The model recognition unit size (word pronunciation model, semisyllable model, or phoneme model) has a large influence on the volume of speech training data, the speech recognition rate, and the flexibility. For a speech recognition system with a medium vocabulary or more, the recognition unit is small, so that the calculation amount is small, the required model memory amount is small, and the required training data amount is relatively small, but the problems are that the positioning and segmentation of the corresponding speech segment are difficult, and the recognition model rule is more complex. Often large recognition units tend to include co-pronunciations in the model, which is beneficial to improving the recognition rate of the system, but requires relatively increased training data.
In summary, the statistical-based language model is affected by the expected size, the effect is limited, and the expression capability of the statistical information is limited at the semantic level. In the prior art, a language model is not integrated in an acoustic model, most acoustic models based on deep learning adopt CNN or RNN-like structures, and the calculation efficiency is limited. The BERT et al model has limited effect in NLG tasks due to its bi-directional attention mechanism in text generation tasks.
Disclosure of Invention
The invention aims to provide a method for improving Chongqing dialect voice recognition through transfer learning, which comprises the following steps:
1) Voice data is acquired. The voice data includes dialects.
2) And carrying out Fourier transform on the voice data to obtain a voice spectrogram.
3) Vectorizing the voice frequency spectrogram by using a VGG network to obtain a vector v.
Vector v is as follows:
v=VGG(DFT(A)) (1)
Wherein A is voice data.
4) An input X of the transducer model is obtained. The transducer model includes an encoder1, an encoder2, and a decoder.
The input X of the transducer is as follows:
X=PE(DFT(A))+Fbank(v) (2)
Where PE is a position-coding function.
5) And converting the input X to obtain parameters Q, K and V.
Q=XWQ,K=XWK,V=XWV (3)
6) The parameter Q, the parameter K, and the parameter V are input to the encoder1 and the encoder2 of the transducer model, respectively, to obtain an encoder output Y1 and an encoder output Y2.
The parameters Q, K, V are as follows:
the encoder includes a multi-headed attention layer, a forward propagation layer.
The outputs MultiHead (Q, K, V) of the multi-headed attention layer are shown below:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO (4)
The parameter head i is as follows:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h (5)
Wherein h is the number of attention layers; w i Q、Wi K、Wi V is the i-th layer weight.
Attention (Q, K, V) is as follows:
In the method, in the process of the invention, Is a normalization parameter;
The output FFN (x) of the forward propagating layer is as follows:
FFN(x)=max(0,xW1+b1)W2+b2 (7)
the input x of the forward propagation layer is as follows:
x=norm(X+MultiHead(Q,K,V)) (8)
the output Y of the encoder is as follows:
Y=FFN(x) (9)
7) The encoder output Y1 and the encoder output Y2 are input to a decoder of the transducer model to obtain a speech recognition text.
8) Based on the speech recognition text, an input x of the pinyin BERT model is determined.
The input x of the pinyin BERT model is as follows:
x=Concat(CE,GE,PYE)WF+PE (10)
Where CE represents word embedding. GE represents glyph embedding. PYE represents Pinyin embedding. PE represents position embedding. W F represents a fully connected layer. Concat denotes vector concatenation.
The glyph embedding GE is shown below:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG (11)
Where I 1、I2、I3 represents a glyph image. W G represents a fully connected layer. The flat represents converting a two-dimensional image into a one-dimensional vector.
Pinyin embedding PYE is as follows:
PYE=max-pooling(CNN(S)) (12)
wherein S represents a pinyin sequence. max-pooling represents maximum pooling. CNN represents a convolution calculation.
9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result.
The speech recognition result p (x 1,x2,x3,...,xn) is as follows:
p(x1,x2,x3,…,xn)=p(x1)p(x2|x1)p(x3|x1,x2)...p(xn|x1,x2,...,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)...p(xn|x3,x1,...,xn-1)...p(xn-1)
=...
=p(x1|xn-1)p(xn|xn-1,x1)...p(x2|xn-1,x1,...,x3) (13)
Where p (x 2|x1) represents the speech recognition text probability distribution.
The invention has the technical effects that the invention can capture semantic level information more comprehensively by changing the language model in the ASR technology from a statistical model to a pretraining model expected in a large scale, and independently open the acoustic model in the ASR through a pipeline design mode, thereby enhancing the diversity of ASR model selection.
According to the invention, through placing embedding of the position in the acoustic model, the acoustic model has a certain language model capacity, and meanwhile, the effectiveness of the acoustic model in extracting acoustic information and completing decoding is enhanced.
The invention captures language information in all directions by introducing the Pinyin, the fonts and the like embedding, which are matched with the characteristics of the ASR in Chinese, such as the same initials, the same finals, the same pronunciation and the like, and simultaneously improves the accuracy of the language model in the decoding process.
The invention applies UniLM model to ASR scene, and improves the accuracy of ASR decoding by means of the effectiveness of UniLM algorithm in text generation task.
Aiming at the remarkable performance of the NLP technology on a large number of NLP tasks by adopting a pre-training method in recent years, the invention provides the method for obtaining the primary ASR result by using a transducer as an acoustic model, and finally obtaining the ASR result output by combining a pre-training model (UniLM) obtained by pinyin pre-training of the transducer as a language model according to the expectation of a language scene.
Drawings
FIG. 1 is a speech recognition process;
FIG. 2 is a speech feature processing flow;
FIG. 3 is a diagram of a transducer structure;
FIG. 4 is an input collation;
FIG. 5 is an input information fusion;
FIG. 6 is a Pinyin embedding.
Detailed Description
The present invention is further described below with reference to examples, but it should not be construed that the scope of the above subject matter of the present invention is limited to the following examples. Various substitutions and alterations are made according to the ordinary skill and familiar means of the art without departing from the technical spirit of the invention, and all such substitutions and alterations are intended to be included in the scope of the invention.
Example 1:
Referring to fig. 1, 2, 3,4, 5, and 6, a method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:
1) Voice data is acquired. The voice data includes dialects.
2) And carrying out Fourier transform on the voice data to obtain a voice spectrogram.
3) Vectorizing the voice frequency spectrogram by using a VGG network to obtain a vector v.
Vector v is as follows:
v=VGG(DFT(A)) (1)
Wherein A is voice data.
4) An input X of the transducer model is obtained. The transducer model includes an encoder1, an encoder2, and a decoder.
The input X of the transducer is as follows:
X=PE(DFT(A))+Fbank(v) (2)
where PE is a position-coding function. Fbank () represents a speech feature extraction operation.
5) And converting the input X to obtain parameters Q, K and V.
Q=XWQ,K=XWK,V=XWV (3)
6) The parameter Q, the parameter K, and the parameter V are input to the encoder1 and the encoder2 of the transducer model, respectively, to obtain an encoder output Y1 and an encoder output Y2.
The parameters Q, K, V are as follows:
the encoder includes a multi-headed attention layer, a forward propagation layer.
The outputs MultiHead (Q, K, V) of the multi-headed attention layer are shown below:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO (4)
The parameter head i is as follows:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h (5)
Wherein h is the number of attention layers; w i Q、Wi K、Wi V is the i-th layer weight.
Attention (Q, K, V) is as follows:
In the method, in the process of the invention, Is a normalization parameter;
The output FFN (x) of the forward propagating layer is as follows:
FFN(x)=max(0,xW1+b1)W2+b2 (7)
the input x of the forward propagation layer is as follows:
x=norm(X+MultiHead(Q,K,V)) (8)
the output Y of the encoder is as follows:
Y=FFN(x) (9)
7) The encoder output Y1 and the encoder output Y2 are input to a decoder of the transducer model to obtain a speech recognition text.
8) Based on the speech recognition text, an input x of the pinyin BERT model is determined.
The input x of the pinyin BERT model is as follows:
x=Concat(CE,GE,PYE)WF+PE (10)
Where CE represents word embedding. GE represents glyph embedding. PYE represents Pinyin embedding. PE represents position embedding. W F represents a fully connected layer. Concat denotes vector concatenation.
The glyph embedding GE is shown below:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG (11)
where I represents a glyph image. W G represents a fully connected layer. The flat represents converting a two-dimensional image into a one-dimensional vector.
Pinyin embedding PYE is as follows:
PYE=max-pooling(CNN(S)) (12)
wherein S represents a pinyin sequence. max-pooling represents maximum pooling. CNN represents a convolution calculation.
9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result.
The speech recognition result p (x 1,x2,x3,...,xn) is as follows:
p(x1,x2,x3,...,xn)=p(x1)p(x2|x1)p(x3|x1,x2)...p(xn|x1,x2,...,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)…p(xn|x3,x1,...,xn-1)...p(xn-1)
=...
=p(x1|xn-1)p(xn|xn-1,x1)...p(x2|xn-1,x1,...,x3) (13)
Where p (x 2|x1) represents the speech recognition text probability distribution.
Example 2:
A method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:
1) According to the audio, a signal processing technology and Fourier transformation are adopted to obtain a spectrogram of a single audio file, and vector expression of the whole structure diagram is extracted through a VGG network structure.
The formula can be expressed as:
V=VGG(DFT(A))
a, an audio file; DFT: discrete fourier transform; VGG: a VGG network; v vector expression of VGG output
2) According to the spectrogram, the position information of each spectrum unit in the original image is obtained, and is vectorized by embedding and then input into a transducer together with Fbank.
The calculation flow and formula of the encoder are as follows:
the transducer input X consists of two parts, namely a position code and Fbank, and PE is a position code function:
X=PE(DFT(A))+Fbank(V)
converting input X to Q, K, V:
Q=XWQ,K=XWK,V=XWV
attention calculation formula:
multi-head attention layer:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO
Wherein:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h
Forward propagation layer:
FFN(x)=max(0,xW1+b1)W2+b2
wherein:
x=norm(X+MultiHead(Q,K,V))
Output of the encoder:
Y=FFN(x)
The decoder calculation process is similar to that of the decoder, and detailed description thereof will be omitted with reference to fig. 3.
3) The maximum characteristics of Chinese characters have two aspects: the first is character shape and the second is spelling. Chinese character is a typical meaning word, and from its origin, its font itself contains a part of semantics. For example, "rivers and lakes" all have three points of water, which means that they are all related to water. From the pronunciation, the spelling of Chinese characters can reflect the meaning of a Chinese character to a certain extent, and plays a role in distinguishing word meaning. For example, a "music" word has two pronunciations, yu and i, the former representing "music" and being a noun; the latter indicates "happiness" and is an adjective. For a polyphone, a "music" is simply input, and the model cannot know whether it should represent "music" or "happy", and additional pronunciation information is needed for depolarization at this time. Based on the two characteristics of the Chinese character, the font and pinyin information of the Chinese character are integrated into the pre-training process of the Chinese corpus. The font vector of a Chinese character is formed from a plurality of different fonts, and the pinyin vector is derived from a corresponding romanized sequence of pinyin characters. The two are fused together with the word vector to obtain a final fusion vector which is used as the input of the pre-training model. The model is trained by using two strategies, namely a whole word mask (Whole Word Masking) and a word mask (CHARACTER MASKING), so that the model more comprehensively establishes the connection among Chinese characters, fonts, pronunciation and context.
X=Concat(CE,GE,PYE)WF+PE
CE, GE, chinese character font, PYE, pinyin, PE, position, WF, full connection layer, X, BERT input, concat, vector splicing.
The Fusion Layer (Fusion Layer) at the bottom Layer fuses the font embedding (Glyph Embedding) and the pinyin embedding (Pinyin Embedding) except the word embedding (Char Embedding) to obtain a Fusion embedding (Fusion Embedding), and then adds the Fusion embedding and the position embedding to form the input of the model. The character pattern is embedded into the Chinese character image with different fonts. Each image is 24 x 24 in size, three fonts of the imitated Song, the running-regular script and the script are vectorized, and the Chinese character font embedding is obtained after the images are spliced and then are subjected to full connection W G.
The process is as shown in fig. 5:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG
I, a font image, WG, a full connection layer, GE, font embedding and flat, wherein the two-dimensional image is converted into a one-dimensional vector.
Pinyin embedding first converts the pinyin of each Chinese character to a sequence of romanized characters, which also contain tones, using pypinyin. For example, for Chinese characters "cat", its Pinyin character sequence is "mao1". For polyphones such as "happy", pypinyin can very accurately identify the correct pinyin in the current context.
The process is as shown in fig. 6:
PYE=max-pooling(CNN(S))
s is a pinyin sequence, max-pooling is max pooling, CNN is convolution calculation, and PYE is pinyin embedding.
4) The final ASR recognition result is generated in combination with the pre-training model UniLM, and compared with the generation model based on the language model, the BERT cannot meet the requirement of the language model due to bidirectional decoding, but the decoding direction is manually controlled by Mask attention, so that the direction is changed from bidirectional to unidirectional:
p(x1,x2,x3,…,xn)=p(x1)p(x2|x1)p(x3|x1,x2)…p(xn|x1,x2,...,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)…p(xn|x3,x1,...,xn-1)…p(xn-1)
=…
=p(x1|xn-1)p(xn|xn-1,x1)…p(x2|xn-1,x1,...,x3)
x 1,x2,…,xn any "departure sequence" is possible. In principle, each sequence corresponds to a model, so in principle there is n-! Implementing a sequential language model corresponds to disturbing the Mask of the original lower triangle form in some way. Just as the Attention provides such an n×n Attention matrix, the present invention has enough degrees of freedom to unmask this matrix in different ways, thereby achieving a diversified effect. Thereby meeting the requirements of the language model.

Claims (7)

1. A method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:
1) Acquiring voice data;
2) Performing Fourier transform on the voice data to obtain a voice spectrogram;
3) Vectorizing the voice frequency spectrogram by utilizing a VGG network to obtain a vector v;
4) Acquiring an input X of a transducer model; the transducer model comprises an encoder1, an encoder2 and a decoder;
5) Converting the input X to obtain a parameter Q, a parameter K and a parameter V;
6) Inputting the parameter Q, the parameter K and the parameter V into an encoder1 and an encoder2 of the transducer model to respectively obtain an encoder output Y1 and an encoder output Y2;
7) Inputting the encoder output Y1 and the encoder output Y2 into a decoder of a transducer model to obtain a voice recognition text;
8) Determining input x of a pinyin BERT model based on the speech recognition text;
9) Inputting an input x into a pinyin BERT model to obtain a voice recognition result;
Vector v is as follows:
v=VGG(DFT(A)) (1)
Wherein A is voice data;
the input X of the transducer is as follows:
X=PE(DFT(A))+Fbank(v) (2)
Wherein PE is a position coding function;
the parameters Q, K, V are as follows:
Q=XWQ,K=XWK,V=XWV (3)。
2. The method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the voice data includes dialects.
3. The method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the encoder comprises a multi-head attention layer and a forward propagation layer;
The outputs MultiHead (Q, K, V) of the multi-headed attention layer are shown below:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO (4)
The parameter head i is as follows:
headi=Attention(QWi Q,KWi K,VWi V),i=1,2,...,h (5)
Wherein h is the number of attention layers; w i Q、Wi K、Wi V is the i-th layer weight;
attention (Q, K, V) is as follows:
In the method, in the process of the invention, Is a normalization parameter;
the output FFN (x') of the forward propagating layer is as follows:
FFN(x')=max(0,x'W1+b1)W2+b2 (7)
The input x' of the forward propagation layer is as follows:
x'=norm(X+MultiHead(Q,K,V)) (8)
the output Y of the encoder is as follows:
Y=FFN(x') (9)。
4. the method for enhancing Chongqing dialect speech recognition by transfer learning as recited in claim 1, wherein the input x of the Pinyin BERT model is as follows:
x=Concat(CE,GE,PYE)WF+PE' (10)
Wherein, CE represents word embedding; GE represents character form embedding; PYE represents Pinyin embedding; PE' represents position embedding; w F represents a full link layer; concat denotes vector concatenation.
5. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 4, wherein the glyph embedding GE is as follows:
GE=Concat(flatten(I1),flatten(I2),flatten(I3))WG (11)
Wherein I 1、I2、I3 represents a glyph image; w G represents a full link layer; the flat represents converting a two-dimensional image into a one-dimensional vector.
6. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 4, wherein the pinyin-embedded PYE is as follows:
PYE=max-pooling(CNN(S)) (12)
Wherein S represents a Pinyin sequence; max-pooling represents maximum pooling; CNN represents a convolution calculation.
7. The method for enhancing Chongqing dialect speech recognition by transfer learning as set forth in claim 1, wherein the speech recognition result p (x 1,x2,x3,…,xn) is as follows:
p(x1,x2,x3,…,xn)=p(x1)p(x2|x1)p(x3|x1,x2)…p(xn|x1,x2,…,xn-1)
=p(x3)p(x1|x3)p(x2|x3,x1)…p(xn|x3,x1,…,xn-1)...p(xn-1)
=...
=p(x1|xn-1)p(xn|xn-1,x1)…p(x2|xn-1,x1,...,x3) (13)
Where p (x 2|x1) represents the speech recognition text probability distribution.
CN202111651840.0A 2021-12-30 2021-12-30 Method for improving Chongqing dialect voice recognition through transfer learning Active CN116416967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111651840.0A CN116416967B (en) 2021-12-30 2021-12-30 Method for improving Chongqing dialect voice recognition through transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111651840.0A CN116416967B (en) 2021-12-30 2021-12-30 Method for improving Chongqing dialect voice recognition through transfer learning

Publications (2)

Publication Number Publication Date
CN116416967A CN116416967A (en) 2023-07-11
CN116416967B true CN116416967B (en) 2024-09-24

Family

ID=87053265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111651840.0A Active CN116416967B (en) 2021-12-30 2021-12-30 Method for improving Chongqing dialect voice recognition through transfer learning

Country Status (1)

Country Link
CN (1) CN116416967B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116416968B (en) * 2021-12-30 2024-09-24 重庆大学 Chongqing dialect voice recognition method of transducer composed of double encoders
CN118038851A (en) * 2023-12-27 2024-05-14 暗物质(北京)智能科技有限公司 Multiparty speech recognition method, system, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679491A (en) * 2017-09-29 2018-02-09 华中师范大学 A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data
CN112151030A (en) * 2020-09-07 2020-12-29 中国人民解放军军事科学院国防科技创新研究院 Multi-mode-based complex scene voice recognition method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238845B2 (en) * 2018-11-21 2022-02-01 Google Llc Multi-dialect and multilingual speech recognition
US10916242B1 (en) * 2019-08-07 2021-02-09 Nanjing Silicon Intelligence Technology Co., Ltd. Intent recognition method based on deep learning network
CN110232439B (en) * 2019-08-07 2019-12-24 南京硅基智能科技有限公司 Intention identification method based on deep learning network
CN111798871B (en) * 2020-09-08 2020-12-29 共道网络科技有限公司 Session link identification method, device and equipment and storage medium
CN112418011A (en) * 2020-11-09 2021-02-26 腾讯科技(深圳)有限公司 Method, device and equipment for identifying integrity of video content and storage medium
CN112767958B (en) * 2021-02-26 2023-12-26 华南理工大学 Zero-order learning-based cross-language tone conversion system and method
CN113723166A (en) * 2021-03-26 2021-11-30 腾讯科技(北京)有限公司 Content identification method and device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679491A (en) * 2017-09-29 2018-02-09 华中师范大学 A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data
CN112151030A (en) * 2020-09-07 2020-12-29 中国人民解放军军事科学院国防科技创新研究院 Multi-mode-based complex scene voice recognition method and device

Also Published As

Publication number Publication date
CN116416967A (en) 2023-07-11

Similar Documents

Publication Publication Date Title
CN110534089B (en) Chinese speech synthesis method based on phoneme and prosodic structure
CN111837178B (en) Speech processing system and method for processing speech signal
US11538455B2 (en) Speech style transfer
CN113439301A (en) Reconciling between analog data and speech recognition output using sequence-to-sequence mapping
CN116416967B (en) Method for improving Chongqing dialect voice recognition through transfer learning
Zhang et al. Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese
WO2019161011A1 (en) Speech style transfer
KR20210146089A (en) Method for generating multi persona model and providing for conversation styling using the multi persona model
Yu et al. Acoustic modeling based on deep learning for low-resource speech recognition: An overview
JP7112075B2 (en) Front-end training method for speech synthesis, computer program, speech synthesis system, and front-end processing method for speech synthesis
Abdelhamid et al. End-to-end arabic speech recognition: A review
Hasegawa-Johnson et al. Image2speech: Automatically generating audio descriptions of images
Abdelmaksoud et al. Convolutional neural network for arabic speech recognition
CN112990353A (en) Chinese character confusable set construction method based on multi-mode model
Zhao et al. End-to-end-based Tibetan multitask speech recognition
CN112420050B (en) Voice recognition method and device and electronic equipment
KR20210122070A (en) Voice synthesis apparatus and method for 'Call me' service using language feature vector
CN115101046A (en) Method and device for synthesizing voice of specific speaker
Effendi et al. End-to-end image-to-speech generation for untranscribed unknown languages
Suyanto et al. End-to-End speech recognition models for a low-resourced Indonesian Language
Shivakumar et al. A study on impact of language model in improving the accuracy of speech to text conversion system
CN113129862B (en) Voice synthesis method, system and server based on world-tacotron
CN106971721A (en) A kind of accent speech recognition system based on embedded mobile device
Naderi et al. Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method
Huzaifah et al. An analysis of semantically-aligned speech-text embeddings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant