CN110751944B

CN110751944B - Method, device, equipment and storage medium for constructing voice recognition model

Info

Publication number: CN110751944B
Application number: CN201910884620.9A
Authority: CN
Inventors: 王健宗; 贾雪丽
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2024-09-24
Anticipated expiration: 2039-09-19
Also published as: CN110751944A; WO2021051628A1

Abstract

The application relates to the field of artificial intelligence, and provides a method, a device, equipment and a storage medium for constructing a voice recognition model, wherein the method comprises the following steps: acquiring a plurality of training voice samples; constructing a voice recognition model through an independent convolution layer, a convolution residual layer, a full connection layer and an output layer; inputting the training voice information into the voice recognition model, and updating the neuron weight of the voice recognition model through a natural language processing NLP technology, the voice information and a text label corresponding to the voice information to obtain a target model; evaluating the error of the target model by L (S) = -ln _(h(x),z)∈S p(z|h(x))＝-∑_(h(x),z)∈S ln p (z|h (x)); adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight; and deploying the target model and the ideal weight to a client. The influence of the tone in the voice information on the predicted text is reduced, and the operand in the recognition process of the voice recognition model is reduced.

Description

Method, device, equipment and storage medium for constructing voice recognition model

Technical Field

The present application relates to the field of intelligent decision making, and in particular, to a method, apparatus, device, and storage medium for constructing a speech recognition model.

Background

Speech recognition is used to convert speech into text. With the continuous development of deep learning technology, the application range of speech recognition is also becoming wider and wider.

Currently, deep neural networks (deep neural networks, DNN) have become a hotspot for research in the field of automatic speech recognition. The convolutional neural network (convolutional neural networks, CNN) and the cyclic neural network (recurrentneural networks, RNN) have good effects on the establishment of a voice recognition model, and deep learning has become a mainstream scheme of voice recognition.

In deep neural networks, the depth of the network is often closely related to the accuracy of recognition, because the traditional deep neural network can extract multi-level features of lower layers, middle layers and higher layers (low/mid/high-level), the more layers of the network, the more features extracted. However, as the network level is deepened, the degradation phenomenon of the deep neural network also starts to appear, so that the accuracy of voice recognition quickly reaches saturation, and the deeper the network level, the higher the error rate is. In addition, the existing speech recognition model needs to align the speech training samples before training, and align the speech data of each frame with the corresponding labels, so as to ensure that the loss function used in training can accurately estimate the training error of the speech recognition model. However, the alignment process of the voice training samples is complicated and complicated, and requires a great time and cost.

Disclosure of Invention

According to the method, the characteristics of the unlabeled data are acquired, the acquired characteristics are introduced into supervised learning, so that usable sample data are expanded, the utilization efficiency of the unlabeled image is improved, and the accuracy of model prediction is improved.

In a first aspect, the present application provides a method for constructing a speech recognition model, including:

acquiring a plurality of training voice samples, wherein the training voice samples comprise voice information and text labels corresponding to the voice information;

The method comprises the steps that a voice recognition model is built through an independent convolution layer, a convolution residual layer, a full connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are sequentially connected, each residual stacking layer comprises a plurality of residual modules which are sequentially connected, each residual module comprises a plurality of hidden layers which are sequentially connected and a bypass channel which bypasses a plurality of weight layers which are sequentially connected;

Sequentially inputting a plurality of voice samples into the voice recognition model, taking the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input into the voice recognition model, finishing training the voice recognition model, and taking the voice recognition model with the trained neuron weights as a target model after the training is finished;

Evaluating the error of the target model through L (S) = -ln pi _(h(x),z)∈S p(z|h(x))＝-∑_(h(x),z)∈S ln p (z|h (x)), wherein L (S) is the error, x is the voice information, z is the text label, p (z|h (x)) is the similarity between the predicted text and the text label, S is the plurality of training voice samples, and the predicted text refers to the text information calculated and output by the target model according to neuron weights after the voice information is input into the target model;

adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight;

and deploying the target model and the ideal weight to a client.

In some possible designs, before the inputting the plurality of the speech samples into the speech recognition model, the method further comprises:

the training voice information is processed in frames according to preset framing parameters, sentences corresponding to the training voice information are obtained, and the preset framing parameters comprise frame duration, frame number and front and back frame repetition duration;

And converting the statement according to a preset two-dimensional parameter and a filter bank characteristic extraction algorithm to obtain two-dimensional voice information.

In some possible designs, the framing the training speech information according to a preset framing parameter includes:

performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X (k) corresponding to the two-dimensional voice information;

Filtering the linear spectrum through a preset band-pass filter to obtain a target linear spectrum, wherein when the center frequency of the band-pass filter is f (m), the transfer function of the band-pass filter is as follows:

The expression of f (m) is:

the band-pass filter comprises a plurality of band-pass filters with triangular filtering characteristics, F _l is the lowest frequency of the frequency range of the band-pass filter, F _h is the highest frequency of the frequency range of the band-pass filter, N is the length at DFT, F _s is the sampling frequency of the band-pass filter, F _mel is F _mel =1125ln (1+f/700), and the inverse function of Fmel is: b is an integer;

According to And calculating the logarithmic energy corresponding to the target linear spectrum at M is more than or equal to 0 and less than or equal to M to obtain a spectrogram, wherein X (k) is the linear spectrum.

In some possible designs, the fully connected layer includes a classification function, which refers toThe j is a natural number, and the classification function compresses the K-dimensional speech frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ (z) _j, so that the range of each element is between (0, 1), and the sum of all elements is 1.

In some possible designs, the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:

y=f (x, w _i)+w_s x), where F (x, w _i) is the output of the independent convolutional layer and w _s is the weight of the residual block.

In some possible designs, the F (x, w _i) employs a ReLU function as the activation function for the independent convolutional layer, the mathematical expression of the ReLU function being ReLU (x) =max (0, x),

In some possible designs, the adjusting weights of neurons of the target model includes:

And adjusting the weight of the neuron by a random gradient descent method.

In a second aspect, the present application provides an apparatus for constructing a speech recognition model, having a function of implementing a method corresponding to the method for constructing a platform for speech recognition model provided in the first aspect. The functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, which may be software and/or hardware.

The device for constructing the voice recognition model comprises:

The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of training voice samples, and the training voice samples comprise voice information and text labels corresponding to the voice information;

A processing module, configured to construct a speech recognition model by using an independent convolution layer, a convolution residual layer, a full connection layer and an output layer, where the convolution residual layer includes a plurality of residual stacking layers connected in sequence, the residual stacking layers include a plurality of residual modules connected in sequence, the residual modules include a plurality of hidden layers connected in sequence and bypass channels bypassing a plurality of weight layers connected in sequence, the speech samples are sequentially input to the speech recognition model through an input/output module, the speech information and text labels corresponding to the speech information are respectively used as input and output of the speech recognition model, the neuron weights of the speech recognition model are continuously trained through the input and the output, after the training is finished, the voice recognition model with the trained neuron weight is used as a target model, and the error of the target model is estimated through L (S) = -ln _(h(x),z)∈S p(z|h(x))＝-∑_(h(x),z)∈S ln p (z|h (x)), wherein L (S) is the error, x is the voice information, z is the text label, p (z|h (x)) is the similarity of the predicted text and the text label, S is the plurality of training voice samples, and the predicted text is the text information calculated and output by the target model according to the neuron weight after the voice information is input into the target model;

And adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, setting the weight of the neuron with the error smaller than the threshold value as an ideal weight, and deploying the target model and the ideal weight to a client.

In some possible designs, the processing module is further to:

The expression of f (m) is:

The band-pass filter comprises a plurality of band-pass filters with triangular filtering characteristics, F _l is the lowest frequency of the frequency range of the band-pass filter, F _h is the highest frequency of the frequency range of the band-pass filter, N is the length at DFT, F _s is the sampling frequency of the band-pass filter, F _mel =1125ln (1+f/700) is the F _mel function, and the inverse function of Fmel is: b is an integer;

In some possible designs, the processing module is further to: and the input of the residual error module is x, the output of the output residual error module is y, and the mathematical expression of the residual error module is as follows:

In some possible designs, the F (x, w _i) employs a ReLU function as the activation function of the independent convolutional layer, the mathematical expression of the ReLU function being ReLU (x) =max (0, x).

And adjusting the weight of the neuron by a random gradient descent method.

In a further aspect, the present application provides an apparatus for constructing a speech recognition model, which comprises at least one connected processor, a memory, and an input-output unit, wherein the memory is used for storing program codes, and the processor is used for calling the program codes in the memory to execute the method in the above aspects.

In yet another aspect, the application provides a computer storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

According to the application, the input information x is directly transferred to the output of the hidden layer by bypassing the channel, the bypass channel has no weight, the integrity of the input information x is protected, the neural network is deeper to train, only the input and output difference parts are required to train the whole neural network, namely, after the input information x is transferred, each residual module only learns the residual F (x), the training target and difficulty are simplified, the neural network is stable and easy to train, the performance of the voice recognition model is gradually improved along with the increase of the depth of the neural network, the predicted text of the voice recognition model is evaluated by a CTC loss function, the precise mapping relation between the pronunciation phonemes in the text label and the sequences of training voice information is not required to be considered, the voice recognition model can be trained only by the input sequence and the output sequence, and the manufacturing cost of a training voice sample set is saved. In addition, a triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, harmonic waves in the training voice information are eliminated, formants of original sounds are highlighted, the influence of tones in the voice information on a voice recognition model predicted text is avoided, and the operation amount of the voice information in the voice recognition model recognition process is reduced.

Drawings

FIG. 1 is a flow chart of a method for constructing a speech recognition model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an apparatus for constructing a speech recognition model according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus for constructing a speech recognition model according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those listed or explicitly listed or inherent to such process, method, article, or apparatus, but may include other steps or modules that may not be listed or inherent to such process, method, article, or apparatus, the partitioning of such modules by the present application may be by one logical partitioning, and may be implemented by other means, such as a plurality of modules may be combined or integrated in another system, or some features may be omitted, or not implemented.

In order to solve the technical problems, the application mainly provides the following technical scheme:

According to the application, the input information x is directly transferred to the output of the hidden layer by bypassing the channel, the bypass channel has no weight, the integrity of the input information x is protected, the neural network is deeper to train, only the input and output difference parts are required to train the whole neural network, namely, after the input information x is transferred, each residual module only learns the residual F (x), the training target and difficulty are simplified, the neural network is stable and easy to train, the performance of the voice recognition model is gradually improved along with the increase of the depth of the neural network, the predicted text of the voice recognition model is evaluated by a CTC loss function, the precise mapping relation between the pronunciation phonemes in the text label and the sequences of training voice information is not required to be considered, the voice recognition model can be trained only by the input sequence and the output sequence, and the manufacturing cost of a training voice sample set is saved. In addition, the triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate harmonic waves in the training voice information, highlight formants of original sounds, avoid the influence of tones in the voice information on the predicted text of the voice recognition model, and reduce the operation amount of the voice information in the recognition process of the voice recognition model.

Referring to fig. 1, the following provides a method for constructing a speech recognition model, which includes:

101. A plurality of training speech samples is obtained.

The training speech samples include speech information and text labels corresponding to the speech information.

The text labels are used for labeling pronunciation phonemes of the training speech information.

The voice information writes the recorded content into text according to the pre-recorded voice; according to the sequence of the words, the words in the text are numbered, and each word is marked according to the pronunciation phonemes of the word, so that a text label is obtained. Each pronunciation phoneme in the text label corresponds to one or more frames of data in the sound recording.

102. And constructing a voice recognition model through the independent convolution layer, the convolution residual layer, the full connection layer and the output layer.

The convolution residual layer comprises a plurality of serially connected residual stack layers. The residual stacking layer comprises a plurality of residual modules connected in sequence. The residual error module comprises a plurality of hidden layers which are connected in sequence and a bypass channel which bypasses the plurality of weight layers which are connected in sequence.

The independent convolution layer is used for extracting acoustic features from the voice information, eliminating non-maximum values in the acoustic features and reducing complexity of the acoustic features. Acoustic features include pronunciation of specific syllables, user readthrough habits, speech spectrum, and the like.

The convolution residual layer is used to map the acoustic features to the hidden layer feature space.

The full-connection layer is used for integrating the acoustic features mapped to the hidden layer feature space to acquire the meaning of the acoustic features, and outputting probabilities corresponding to various text types according to the meaning.

The output layer is used for outputting the text corresponding to the voice information according to the probabilities corresponding to the various text types.

The voice recognition model in the embodiment adds bypass channels for a plurality of hidden layers which are connected in sequence, so as to solve the problem that the training accuracy is lower and lower along with the increase of the network layers of the traditional neural network. The convolution residual layer of the speech recognition model is provided with a plurality of bypass channels, the bypass channels are used as branch lines of the hidden layers, cross-layer connection between the hidden layers is realized, namely, the input of the hidden layers is directly connected to the next-stage layer, so that the next-stage layer can directly learn residual errors.

Specifically, as shown in fig. 2, in one residual module, the cross-layer connection typically spans only 2 to 3 hidden layers, but is not exclusive of spanning more hidden layers. The situation of crossing only 1 hidden layer has little meaning, and the experimental effect is not ideal.

Assuming the input of the residual block is x, the desired output is H (x), i.e., H (x) is the desired complex potential map, but typically H (x) is difficult to learn; if the input x is directly passed to the output as the initial result, then the goal of the residual block to learn at this time is F (x) =h (x) -x. Thus, compared to the conventional neural network, the speech recognition model in this embodiment is equivalent to changing the learning objective, and does not learn a complete output any more, but learns the difference between the optimal solution H (X) and the congruent mapping X, i.e., the residual: f (x) =h (x) -x.

From an overall functional point of view, if all weights of the residual block are represented by { w _i }, the output result of the residual block is actually calculated as:

y＝F(x，{w_i})+x

taking the example of spanning 2 hidden layers, F (x, { w _i})＝w₂δ(w₁x)＝w₂ReLU(w₁ x) where the ReLU function is the activation function of the residual block, ignoring the bias.

It will be appreciated that F (x, { w _i }) needs to have the same dimensions as x. If the dimensions are not the same, an additional weight matrix w _s may be introduced to linearly project x such that F (x, { w _i }) is the same as the dimension of x, and accordingly, the residual module is calculated as: y=f (x, { w _i})+w_s x

And sequentially inputting a plurality of voice samples into the voice recognition model, respectively taking voice information and text labels corresponding to the voice information as input and output of the voice recognition model, and continuously training neuron weights of the voice recognition model through the input and output until the voice samples are input into the voice recognition model, and ending training of the voice recognition model. After training, the speech recognition model with the trained neuron weight is used as a target model.

In the training process, the weight of neurons in the voice recognition model is randomly initialized, training voice information is used as input of the voice recognition model, and a text label corresponding to the training voice information is used as output reference of the voice recognition model. The training voice information is transmitted forward in a voice recognition model, the voice recognition model randomly classifies the training voice information by utilizing neurons initialized by each layer, and finally, a prediction text corresponding to the training voice information is obtained. And updating the weight of the neuron according to the difference between the predicted text and the text label output by the voice recognition model, and continuing the next iteration until the weight of the neuron approaches the required value.

103. The error of the target model is estimated by L (S) = -ln _(h(x),z)∈S p(z|h(x))＝-∑_(h(x),z)∈S ln p (z|h (x)).

Where L (S) is the error, x is the speech information, z is the text label, p (z|h (x)) is the similarity of the predicted text to the text label, and S is the plurality of training speech samples. The predicted text refers to text information which is calculated and output by the target model according to neuron weights after voice information is input into the target model.

CTC penalty functions are used to measure the degree of inconsistency between the predicted text output by the speech recognition model and the actual text label, which has the advantage that no forced alignment of the input data with the output data is required. Unlike the cross entropy criterion of frame-level alignment between the input features and the target label, CTC loss functions are able to automatically learn the alignment between the speech data and the label sequence (e.g., phonemes or characters, etc.), which eliminates the need for forced alignment of the data, and the input data is not necessarily the same length as the label. The prediction text of the speech recognition model is evaluated by the CTC loss function, the accurate mapping relation between the pronunciation phonemes in the text label and the sequence of the training speech information is not required to be considered, the speech recognition model can be trained by only inputting the sequence and outputting the sequence, and the manufacturing cost of the training speech sample set is saved.

104. And adjusting the weight of the neuron of the target model until the error is smaller than the threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight.

And calculating the error of the corresponding training voice sample set according to the CTC loss function, and updating target parameters such as weight, threshold value and the like in the voice recognition model by using a gradient descent algorithm to counter-propagate the error in the voice recognition model, so as to continuously improve the accuracy of voice recognition of the voice recognition model until convergence requirements are met.

105. And deploying the target model and the ideal weight to the client.

Compared with the prior art, the input information x is directly transferred to the output of the hidden layer by bypassing the bypass channel, the bypass channel has no weight, the integrity of the input information x is protected, the neural network is deeper to train, only the input and output difference parts are needed to train the whole neural network, namely, after the input information x is transferred, each residual error module only learns residual error F (x), the training target and difficulty are simplified, the neural network is stable and easy to train, the performance of the voice recognition model is gradually improved along with the increase of the depth of the neural network, the prediction text of the voice recognition model is evaluated by a CTC loss function, the precise mapping relation between the pronunciation phonemes in the text label and the sequences of training voice information is not needed to be considered, the voice recognition model can be trained only by the input sequence and the output sequence, and the manufacturing cost of a training voice sample set is saved. In addition, the triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, eliminate harmonic waves in the training voice information, highlight formants of original sounds, avoid the influence of tones in the voice information on the predicted text of the voice recognition model, and reduce the operation amount of the voice information in the recognition process of the voice recognition model.

In some embodiments, before inputting the plurality of speech samples into the speech recognition model, the method further comprises:

the method comprises the steps of processing training voice information in frames according to preset frame dividing parameters, obtaining sentences corresponding to the training voice information, wherein the preset frame dividing parameters comprise frame duration, frame number and front and back frame repetition duration;

and converting sentences according to a preset two-dimensional parameter and a filter bank characteristic extraction algorithm to obtain two-dimensional voice information.

In some embodiments, the frame processing training voice information according to a preset frame parameter includes:

Performing discrete Fourier transform on the two-dimensional voice information to obtain a linear frequency spectrum X (k) corresponding to the two-dimensional voice information; ;

the expression of f (m) is:

The band-pass filter comprises a plurality of band-pass filters with triangular filtering characteristics, F _l is the lowest frequency of the frequency range of the band-pass filter, F _h is the highest frequency of the frequency range of the band-pass filter, N is the length of DFT, F _s is the sampling frequency of the band-pass filter, F _mel is F _mel = 1125ln (1+f/700),

Fmel has the inverse function: b is an integer;

According to And calculating the logarithmic energy corresponding to the target linear frequency spectrum by M which is more than or equal to 0 and less than or equal to M to obtain a spectrogram, wherein X (k) is the linear frequency spectrum.

In the above embodiments, the human response to sound pressure is logarithmic, and the human sensitivity to small changes in high sound pressure is not as low as that of low sound pressure. Furthermore, using logarithms may reduce the sensitivity of the extracted features to variations in the input sound energy, as the distance between the sound and the microphone, and thus the sound energy captured by the microphone, varies. The spectrogram is a visual expression mode of sound energy time-frequency distribution, effectively utilizes the correlation between time-frequency two domains, has better extraction effect of the characteristic vector sequence obtained through spectrogram analysis on acoustic characteristics, and is input into a voice recognition model, so that the subsequent operation accuracy is higher. And a triangular band-pass filter is adopted to smooth the frequency spectrum of the training voice information, so that the harmonic wave in the training voice information is eliminated, and the formants of the original sound are highlighted. Therefore, the tone or pitch of a section of sound in the training voice information cannot be reflected in the acoustic characteristics, that is, the voice recognition model cannot be affected by the different tones in the voice information to the predicted text; and the operation amount of voice information in the voice recognition model recognition process is reduced.

In some embodiments, the fully connected layer includes a classification function. The classification function meansJ is a natural number, and the classification function compresses the K-dimensional speech frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector δ (z) _j such that the range of each element is between (0, 1) and the sum of all elements is 1.

In some embodiments, the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is:

y=f (x, w _i)+w_sx.F(x,w_i) is the output of the independent convolutional layer, w _s is the weight of the residual block.

In the above embodiment, the speech recognition model in this embodiment adds bypass channels for a plurality of hidden layers connected in sequence, so as to solve the problem that the training accuracy is lower and lower along with the increase of the network layers in the conventional neural network. The convolution residual layer of the speech recognition model is provided with a plurality of bypass channels, the bypass channels are used as branch lines of the hidden layers, cross-layer connection between the hidden layers is realized, namely, the input of the hidden layers is directly connected to the next-stage layer, so that the next-stage layer can directly learn residual errors.

In particular, in one residual module, the cross-layer connection typically spans only 2 to 3 hidden layers, but does not exclude spanning more hidden layers. The situation of crossing only 1 hidden layer has little meaning, and the experimental effect is not ideal.

Assuming the input of the residual block is x, the desired output is H (x), i.e., H (x) is the desired complex potential map, but typically H (x) is difficult to learn; if the input x is directly passed to the output as the initial result, then the goal of the residual block to learn at this time is F (x) =h (x) -x. Thus, compared to the conventional neural network, the speech recognition model in this embodiment is equivalent to changing the learning objective, and does not learn a complete output any more, but learns the difference between the optimal solution H (X) and the congruent mapping X, i.e., the residual: f (x) =h (x) -x. From an overall functional point of view, if all weights of the residual block are represented by { w _i }, the output result of the residual block is actually calculated as: y=f (x, { w _i }) +x, taking the example of spanning 2 hidden layers, F (x, { w _i})＝w₂δ(w₁x)＝w₂ReLU(w₁ x) with neglected offset, where ReLU () is the activation function of the residual block.

In some embodiments, F (x, w _i) employs a ReLU function as the activation function of the independent convolutional layer, the mathematical expression of the ReLU function being ReLU (x) =max (0, x).

In the above embodiment, the neural network may be trained by the above formula.

In some embodiments, adjusting weights of neurons of the target model includes:

the weights of the neurons are adjusted by a random gradient descent method.

In the embodiment, redundant calculation can be effectively avoided by adopting a random gradient descent algorithm, and the consumed time is shorter. Of course other algorithms may be used by those skilled in the art.

A schematic structure of an apparatus 20 for constructing a speech recognition model is shown in fig. 2, which is applicable to constructing a speech recognition model. The device for constructing a speech recognition model in the embodiment of the present application can implement the steps corresponding to the method for constructing a speech recognition model performed in the embodiment corresponding to fig. 1. The functions implemented by the device 20 for constructing a speech recognition model may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, which may be software and/or hardware. The device for constructing the speech recognition model may include an input/output module 201 and a processing module 202, and the functional implementation of the processing module 202 and the input/output module 201 may refer to the operations performed in the embodiment corresponding to fig. 1, which are not described herein. The input-output module 201 may be used to control the input, output and acquisition operations of the input-output module 201.

In some embodiments, the input/output module 201 may be configured to obtain a plurality of training voice samples, where the training voice samples include voice information and text labels corresponding to the voice information;

The processing module 202 may be configured to construct a speech recognition model by using an independent convolution layer, a convolution residual layer, a full connection layer, and an output layer, where the convolution residual layer includes a plurality of serially connected residual stacking layers, the residual stacking layers include a plurality of serially connected residual modules, and the residual modules include a plurality of serially connected hidden layers and a bypass channel bypassing the plurality of serially connected weight layers; sequentially inputting a plurality of voice samples into the voice recognition model through the input and output module, respectively taking the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input into the voice recognition model, finishing training the voice recognition model, and taking the voice recognition model with the trained neuron weights as a target model after the training is finished; evaluating the error of the target model by L (S) = -ln _(h(x),z)∈S p(z|h(x))＝-∑_(h(x),z)∈S ln p (z|h (x)); wherein L (S) is the error, x is the voice information, z is the text label, p (z|h (x)) is the similarity between the predicted text and the text label, S is the plurality of training voice samples, and the predicted text is the text information calculated and output by the target model according to neuron weights after the voice information is input to the target model; and adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight. And deploying the target model and the ideal weight to a client.

In some embodiments, the processing module 202 is further configured to:

The expression of f (m) is:

According to Calculating logarithmic energy corresponding to the target linear spectrum at M is more than or equal to 0 and less than or equal to M to obtain a spectrogram, wherein X (k) is the linear spectrum;

in some embodiments, the fully connected layer includes a classification function, which refers to The j is a natural number, and the classification function compresses the K-dimensional speech frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector, so that the range of each element is between (0, 1), and the sum of all elements is 1.

In some embodiments, the input of the residual module is x, the output of the output residual module is y, and the mathematical expression of the residual module is: y=f (x, w _i)+w_s x), where F (x, w _i) is the output of the independent convolutional layer and w _s is the weight of the residual block.

In some embodiments, F (x, w _i) employs a ReLU function as the activation function of the independent convolutional layer, where the mathematical expression of the ReLU function is ReLU (x) =max (0, x).

In some embodiments, the adjusting weights of neurons of the target model comprises:

And adjusting the weight of the neuron by a random gradient descent method.

The creation apparatus in the embodiment of the present application is described above from the point of view of modularized functional entities, and the following describes an apparatus for constructing a speech recognition model from the point of view of hardware, as shown in fig. 3, which includes: a processor, a memory, an input output unit (which may also be a transceiver, not identified in fig. 3) and a computer program stored in the memory and executable on the processor. For example, the computer program may be a program corresponding to the method for constructing a speech recognition model in the embodiment corresponding to fig. 1. For example, when the apparatus for constructing a speech recognition model implements the function of the apparatus for constructing a speech recognition model 20 shown in fig. 2, the processor implements the steps in the method for constructing a speech recognition model, which are performed by the apparatus for constructing a speech recognition model 20 in the embodiment corresponding to fig. 2, when executing the computer program. Or the processor, when executing the computer program, implements the functions of the modules in the apparatus 20 for constructing a speech recognition model according to the embodiment corresponding to fig. 2. For another example, the computer program may be a program corresponding to the method for constructing a speech recognition model in the embodiment corresponding to fig. 1.

The processor may be a central processing unit (central processingunit, CPU), other general purpose processor, digital signal processor (DIGITAL SIGNAL processor, DSP), application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the computer device, connecting various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer program and/or modules, and the processor may implement various functions of the computer device by running or executing the computer program and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

The input-output unit may be replaced by a receiver and a transmitter, and may be the same or different physical entities. Are the same physical entities and may be collectively referred to as input/output units. The input and output may be a transceiver.

The memory may be integrated in the processor or may be provided separately from the processor.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM), comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method according to the embodiments of the present application.

While the embodiments of the present application have been described above with reference to the drawings, the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made thereto by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the appended claims, which are to be accorded the full scope of the present application as defined by the following description and drawings, or by any equivalent structures or equivalent flow changes, or by direct or indirect application to other relevant technical fields.

Claims

1. A method of constructing a speech recognition model, the method comprising:

The method comprises the steps that a voice recognition model is built through an independent convolution layer, a convolution residual layer, a full connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are sequentially connected, each residual stacking layer comprises a plurality of residual modules which are sequentially connected, each residual module comprises a plurality of hidden layers which are sequentially connected and a bypass channel which bypasses a plurality of weight layers which are sequentially connected; the independent convolution layer is used for extracting acoustic features from the voice information and eliminating non-maximum values in the acoustic features; the acoustic features include pronunciation of syllables, user readthrough habits and speech spectrum; the convolution residual layer is used for mapping the acoustic features to a hidden layer feature space; the full-connection layer is used for integrating acoustic features mapped to the hidden layer feature space to acquire the meaning of the acoustic features, and outputting probabilities corresponding to various text types according to the meaning; the output layer is used for outputting the text corresponding to the voice information according to the probabilities corresponding to the various text types;

Evaluating the error of the target model through L (S) = -ln pi _(h(x),z)∈S p(z|h(x))＝-∑_(h(x),z)∈S ln p (z|h (x)), wherein L (S) is the error, x is the voice information, z is the text label, p (z|h (x)) is the similarity between a predicted text and the text label, S is the plurality of training voice samples, and the predicted text refers to the text information calculated and output by the target model according to neuron weights after the voice information is input to the target model;

Deploying the target model and the ideal weight to a client;

The full connection layer comprises a classification function, wherein the classification function refers to The j is a natural number, and the classification function compresses the K-dimensional voice frequency domain signal vector z output by the convolution residual layer to another K-dimensional real vector delta (z) _j, so that the range of each element is between (0, 1), and the sum of all elements is 1; wherein the element is a K-dimensional real vector δ (z) _j, j=1, … …, K;

The input of the residual error module is x, the output of the residual error module is y, and the mathematical expression of the residual error module is:

y=f (x, w _i)+w_s x, F (x, w _i) is the output of the independent convolutional layer, w _s is the weight of the residual block, and w _i is the ownership of the residual block.

2. The method of claim 1, wherein prior to said inputting a plurality of said speech samples into said speech recognition model, said method further comprises:

processing training voice information in frames according to preset framing parameters to obtain sentences corresponding to the training voice information, wherein the preset framing parameters comprise frame duration, frame number and front and back frame repetition duration;

And extracting and converting the characteristics of the preset two-dimensional parameters and the filter bank into the sentences to obtain two-dimensional voice information.

3. The method according to claim 2, wherein the framing processing training speech information according to the preset framing parameter comprises:

The expression of f (m) is:

The band-pass filter comprises a plurality of band-pass filters with triangular filtering characteristics, F _l is the lowest frequency of the frequency range of the band-pass filter, F _h is the highest frequency of the frequency range of the band-pass filter, N is the length of the band-pass filter during discrete Fourier transform, F _s is the sampling frequency of the band-pass filter, F _mel is F _mel =1125ln (1+f/70), and the inverse function of Fmel is: b is an integer;

4. The method of claim 1, wherein F (x, w _i) employs a ReLU function as the activation function of the independent convolutional layer, the mathematical expression of the ReLU function being ReLU (x) =max (0, x).

5. The method of claim 1, wherein said adjusting weights of neurons of the target model comprises:

And adjusting the weight of the neuron by a random gradient descent method.

6. An apparatus for constructing a speech recognition model, the apparatus comprising:

The input/output module is used for acquiring a plurality of training voice samples, wherein the training voice samples comprise voice information and text labels corresponding to the voice information;

The processing module is used for constructing a voice recognition model through an independent convolution layer, a convolution residual layer, a full connection layer and an output layer, wherein the convolution residual layer comprises a plurality of residual stacking layers which are sequentially connected, the residual stacking layers comprise a plurality of residual modules which are sequentially connected, and each residual module comprises a plurality of hidden layers which are sequentially connected and a bypass channel which bypasses a plurality of weight layers which are sequentially connected; the independent convolution layer is used for extracting acoustic features from the voice information and eliminating non-maximum values in the acoustic features; the acoustic features include pronunciation of syllables, user readthrough habits and speech spectrum; the convolution residual layer is used for mapping the acoustic features to a hidden layer feature space; the full-connection layer is used for integrating acoustic features mapped to the hidden layer feature space to acquire the meaning of the acoustic features, and outputting probabilities corresponding to various text types according to the meaning; the output layer is used for outputting the text corresponding to the voice information according to the probabilities corresponding to the various text types; sequentially inputting a plurality of voice samples into the voice recognition model through an input/output module, respectively taking the voice information and text labels corresponding to the voice information as input and output of the voice recognition model, and continuously training neuron weights of the voice recognition model through the input and the output until the voice samples are input into the voice recognition model, and ending training of the voice recognition model; after the training is finished, the voice recognition model with the trained neuron weight is taken as a target model; evaluating the error of the target model by L (S) = -ln _(h(x),z)∈S p(z|h(x))＝-∑_(h(x),z)∈S ln p (z|h (x)); wherein L (S) is the error, x is the speech information, z is the text label, p (z|h (x)) is the similarity of the predicted text and the text label, and S is the plurality of training speech samples; the predicted text refers to text information which is calculated and output by the target model according to neuron weights after the voice information is input to the target model;

adjusting the weight of the neuron of the target model until the error is smaller than a threshold value, and setting the weight of the neuron with the error smaller than the threshold value as an ideal weight; deploying the target model and the ideal weight to a client;

the input of the residual error module is x, the output of the output residual error module is y, and the mathematical expression of the residual error module is:

7. The apparatus of claim 6, wherein the processing module is further configured to:

8. An apparatus for constructing a speech recognition model, the apparatus comprising:

at least one processor, memory, and input output unit;

Wherein the memory is for storing program code and the processor is for invoking the program code stored in the memory to perform the method of any of claims 1-5.

9. A computer storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-5.