CN117877483A

CN117877483A - Training method of spoken language scoring model, spoken language scoring method and related equipment

Info

Publication number: CN117877483A
Application number: CN202311745175.0A
Authority: CN
Inventors: 王士进; 韩凯; 吴奎; 金海�; 盛志超; 刘聪; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-04-12

Abstract

The application discloses a training method of a spoken language scoring model, which comprises the steps of obtaining voice recognition data, wherein the voice recognition data comprises voice modal data and text modal data corresponding to the voice modal data; pre-training an initial spoken language scoring model by utilizing the voice recognition data, wherein the initial spoken language scoring model comprises an acoustic encoder and a text encoder, and the voice mode corresponding to the acoustic encoder after pre-training is aligned with the text mode corresponding to the text encoder after pre-training; acquiring spoken language evaluation data; and performing model optimization on the initial spoken language scoring model after pre-training by using the spoken language evaluation data to obtain a target spoken language scoring model. The application also discloses a spoken language scoring method and related equipment. The method and the device improve the efficiency and accuracy of automatic spoken language assessment.

Description

Training method of spoken language scoring model, spoken language scoring method and related equipment

Technical Field

The disclosed embodiments of the present application relate to the field of artificial intelligence technology, and more particularly, to a training method of a spoken language scoring model, a spoken language scoring method, and related devices.

Background

Computer-aided pronunciation training (CAPT) has become an effective tool for non-native language (L2) speakers to learn different foreign language languages. Automatic speech assessment (Automatic speech Assessment) plays a significant role in helping autonomous language learners to increase spoken language levels as an important component of CAPT. The automatic speech assessment task is also extended from the limited tasks of initially reading sentences, words, etc. to more semi-open test tasks such as topic discussion, post-hearing recollection, looking at speech, post-hearing questions and answers, etc. The answers to the students in these tasks contain prompt information for the questions. Traditional open automatic spoken language assessment methods rely on extracting acoustic features (e.g., pronunciation accuracy and fluency) or text features (e.g., grammar and content) from ASR transcribed text as inputs to a regressor or classifier to score spoken language answers. In recent years, with the continuous improvement of ASR effect, the problem of automatic speech assessment is gradually changed to be performed on the transcription result, and the scoring problem is solved by a Natural Language Processing (NLP) technology. Both acoustic feature and text feature based speech assessment schemes are highly dependent on ASR accuracy, however ASR errors can produce cascading errors that greatly impact the performance of automatic spoken language assessment.

Disclosure of Invention

According to the embodiment of the application, the application provides a training method of a spoken language scoring model, a spoken language scoring method and related equipment, so as to solve the related problems.

The first aspect of the application discloses a training method of a spoken language scoring model, comprising the following steps: acquiring voice recognition data, wherein the voice recognition data comprises voice modal data and text modal data corresponding to the voice modal data; pre-training an initial spoken language scoring model by utilizing the voice recognition data, wherein the initial spoken language scoring model comprises an acoustic encoder and a text encoder, and the voice mode corresponding to the acoustic encoder after pre-training is aligned with the text mode corresponding to the text encoder after pre-training; acquiring spoken language evaluation data; and performing model optimization on the initial spoken language scoring model after pre-training by using the spoken language evaluation data to obtain a target spoken language scoring model.

In some embodiments, the pre-training the initial spoken language scoring model based on the speech recognition data includes: initializing the acoustic encoder and the text encoder; and training the initial spoken language scoring model by using a loss function.

In some embodiments, the loss function comprises a speech recognition loss function, the initializing the acoustic encoder comprising: initializing the acoustic encoder with a preset encoder; and connecting the voice recognition loss function at a preset position of the acoustic encoder so as to learn the representation capability of the text mode.

In some embodiments, the text encoder comprises a speech text contrast encoding module and a speech text matching encoding module, wherein the speech text contrast encoding module and the speech text matching encoding module each comprise at least one transducer sub-layer, and the speech text contrast encoding module and the speech text matching encoding module share model parameters; the initializing the text encoder includes: a cross attention layer is added between the self attention layer and the feedforward layer of each transducer sub-layer in the speech text matching coding module to input the output information of the acoustic encoder.

In some embodiments, the loss function comprises a phonetic text versus loss function, the training the initial spoken scoring model with the loss function comprises: calculating the loss of the distance voice from each voice vector to the text vector matched with the voice vector to obtain the loss of the voice modal data to the text modal data corresponding to the voice vector; calculating the distance between each text vector and the voice vector matched with the text vector to obtain the loss from the text modal data to the voice modal data corresponding to the text vector; minimizing the sum of the loss of the voice modal data to the text modal data corresponding to the voice modal data and the loss of the text modal data to the voice modal data corresponding to the voice modal data to train the initial spoken language scoring model.

In some embodiments, the penalty function comprises a phonetic text matching penalty function, the training the initial spoken scoring model with the penalty function comprises: and carrying out two classification on the output of the text encoder by using a voice text matching loss function so as to determine whether the voice modal data and the text modal data are matched.

In some embodiments, the spoken evaluation data includes speech response data and corresponding test question text data, and the model optimizing the initial spoken scoring model after pre-training by using the spoken evaluation data includes: inputting the spoken evaluation data into the initial spoken scoring model after pre-training; and based on the output of the initial spoken language scoring model after the pre-training, carrying out spoken language score prediction by using a mean square error.

The second aspect of the application discloses a spoken language scoring method, comprising: acquiring spoken language test data; inputting the spoken language test data into a spoken language scoring model to output a corresponding spoken language score; wherein the spoken language scoring model is derived based on the training method of the spoken language scoring model described in the first aspect.

A third aspect of the present application discloses an electronic device, including a memory and a processor coupled to each other, where the processor is configured to execute program instructions stored in the memory to implement the training method of the spoken language scoring model described in the first aspect, or to implement the spoken language scoring method described in the second aspect.

A fourth aspect of the present application discloses a non-transitory computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the training method of the spoken scoring model described in the first aspect, or implement the spoken scoring method described in the second aspect.

The beneficial effects of this application are: the initial spoken language scoring model is pre-trained by utilizing voice recognition data, wherein the initial spoken language scoring model comprises an acoustic encoder and a text encoder, a voice mode corresponding to the pre-trained acoustic encoder is aligned with a text mode corresponding to the pre-trained text encoder, and further the initial spoken language scoring model after pre-training is subjected to model optimization by utilizing spoken language evaluation data to obtain a target spoken language scoring model, and the voice mode is aligned with the text mode through pre-training, so that the efficiency and accuracy of automatic spoken language evaluation are improved.

Drawings

The application will be further described with reference to the accompanying drawings and embodiments, in which:

FIG. 1 is a flow chart of a training method of a spoken language scoring model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of the structure of an acoustic encoder according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a text encoder according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a pre-training model according to an embodiment of the present application;

FIG. 5 is a flow chart of a spoken scoring method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a nonvolatile computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The term "and/or" in this application is merely an association relation describing an associated object, and indicates that three relations may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. Furthermore, the terms "first," "second," and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

In order to enable those skilled in the art to better understand the technical solutions of the present application, the technical solutions of the present application are described in further detail below with reference to the accompanying drawings and the detailed description.

Referring to fig. 1, fig. 1 is a flowchart of a training method of a spoken language scoring model according to an embodiment of the present application. The execution subject of the method can be an electronic device with a computing function, such as a microcomputer, a server, a mobile device such as a notebook computer, a tablet computer, and the like.

It should be noted that, if there are substantially the same results, the method of the present application is not limited to the flow sequence shown in fig. 1.

In some possible implementations, the method may be implemented by a processor invoking computer readable instructions stored in a memory, as shown in fig. 1, and may include the steps of:

s11: and acquiring voice recognition data, wherein the voice recognition data comprises voice modal data and text modal data corresponding to the voice modal data.

The voice recognition data is acquired, wherein the voice recognition data comprises voice modal data and text modal data corresponding to the voice modal data, for example, a large amount of voice recognition data is acquired, the voice modal data can be a section of voice, for example, "today weather is good", and the text modal data corresponding to the voice modal data, namely, the words "today weather is good".

S12: and pre-training an initial spoken language scoring model by utilizing voice recognition data, wherein the initial spoken language scoring model comprises an acoustic encoder and a text encoder, and the voice mode corresponding to the pre-trained acoustic encoder is aligned with the text mode corresponding to the pre-trained text encoder.

The initial spoken language scoring model includes an acoustic encoder employing a network structure of transducers enhanced with convolutions, the transducers Blocks of the acoustic encoder including a plurality of transducers sub-modules, each of which mainly includes a Multi-headed self-attention module (Multi-Head SelfAttention), a Convolution module (Convolution), a Feed Forward Layer (Feed Forward) and a Layer normalization module (Layer Norm), wherein the Multi-headed self-attention module is responsible for modeling global context dependencies, and the Convolution module is responsible for grabbing correlations between local features.

The initial spoken scoring model includes a text encoder that includes a plurality of transform sub-modules, wherein the transform sub-modules include a Self-Attention layer (Self-Attention) and a Feed Forward network (Feed Forward), or the transform sub-modules include a Self-Attention layer (Self-Attention), a Feed Forward network (Feed Forward), and a Cross-Attention layer (Cross Attention).

The method comprises the steps of pre-training an initial spoken language scoring model by using voice recognition data, namely pre-training the initial spoken language scoring model by using the acquired voice recognition data to obtain a pre-trained initial spoken language scoring model, wherein a voice mode corresponding to an acoustic encoder after pre-training is aligned with a text mode corresponding to a text encoder after pre-training, namely the voice mode corresponding to the initial spoken language scoring model and the text mode can be mutually represented through pre-training.

S13: and acquiring spoken language evaluation data.

And acquiring spoken evaluation data, namely spoken evaluation task data, such as voice data and text data corresponding to spoken questions and answers, voice data and text data corresponding to reading evaluation, voice data and text data corresponding to topic statements and the like.

S14: and performing model optimization on the initial spoken language scoring model after pre-training by using the spoken language evaluation data to obtain a target spoken language scoring model.

And performing model optimization on the initial spoken language scoring model after pre-training by using the spoken language evaluation data, for example, performing model optimization on the initial spoken language scoring model after pre-training based on the spoken language evaluation data so as to realize prediction of corresponding spoken language scoring, and further obtaining a target spoken language scoring model.

In this embodiment, the initial spoken language scoring model is pre-trained by using voice recognition data, where the initial spoken language scoring model includes an acoustic encoder and a text encoder, a voice mode corresponding to the pre-trained acoustic encoder is aligned with a text mode corresponding to the pre-trained text encoder, and further, the pre-trained initial spoken language scoring model is model-optimized by using spoken language evaluation data to obtain a target spoken language scoring model, and by aligning the pre-trained voice mode with the text mode, performance of automatic spoken language evaluation, such as efficiency and accuracy of automatic spoken language evaluation, is improved.

In some embodiments, pre-training the initial spoken scoring model based on the speech recognition data includes: initializing an acoustic encoder and a text encoder; and training the initial spoken language scoring model by using the loss function.

The method comprises the steps of pre-training an initial spoken language scoring model based on voice recognition data so that a voice mode corresponding to an acoustic encoder after pre-training is aligned with a text mode corresponding to a text encoder after pre-training, specifically, the pre-training comprises the steps of initializing the acoustic encoder and the text encoder, training the initial spoken language scoring model by using a loss function, namely, initializing the acoustic encoder and the text encoder of the initial spoken language scoring model, and pre-training the model by using a preset loss function, wherein the loss function can be used for quantifying the difference between model prediction and real matching.

In some embodiments, the loss function comprises a speech recognition loss function, initializing an acoustic encoder, comprising: initializing an acoustic encoder by using a preset encoder; and connecting a voice recognition loss function at a preset position of the acoustic encoder to learn the representation capability of the text mode.

The initial spoken language scoring model includes an acoustic encoder, as shown in fig. 2, fig. 2 is a schematic structural diagram of the acoustic encoder according to an embodiment of the present application, where input audio is processed, for example, with 10ms of the original audio as a frame, and downsampled by a SpecAug module and a Convolitional subshaping module to 40ms of each frame, and then passed by a Linear and Dropout module, to obtain 40-dimensional Filter bank features, and then input to a Conformer Blocks module of the acoustic encoder, where the Conformer Blocks include N Conformer sub-modules, as shown on the right side of fig. 2, each Conformer sub-module mainly includes a Multi-head self-attention module (Multi-Head Self Attention), a Convolution module (Convolition), a Feed Forward Layer (Feed Forward), and a Layer normalization module (Layer Nor m), where the Multi-head self-attention module is responsible for modeling global context dependence, and the Convolution module is responsible for grabbing correlations between local features.

It will be appreciated that in acoustic encoders, the global correlation of audio can be modeled using the autonomous attention mechanism of the transducer, while convolution can capture local correlations, which in turn can enable the enhancement of the contextual dependence of the transducer to obtain different fields of view with convolution.

The penalty function includes a speech recognition penalty function, i.e., CTC (Connectionist TemporalClassification) penalty function, which enables the model to map an input sequence to an output sequence by introducing CTC constraints in the neural network model, while not requiring knowledge of the correspondence between the input sequence and the output sequence.

Further, the acoustic encoder is initialized with a pre-set encoder, e.g., an encoder of ASR (Automatic Speech Recognition), wherein the pre-trained ASR encoder already contains information such as content,the method is matched with a spoken language evaluation task, and the multi-mode pre-training convergence speed can be accelerated by using an ASR encoder for initialization, so that the cost is saved. The speech recognition loss function is connected at a preset position of the acoustic encoder to learn the representation capability of the text mode, for example, the acoustic encoder has 12 layers of the Conformer layer, and the speech recognition loss function is connected at a preset position, for example, the speech recognition loss function, namely, a speech recognition CTC target, can be connected at a 6 th layer of the Conformer layer, so that the spoken language scoring model still keeps the ASR capability in the training process, and then the 6 th layer of the Conformer layer has the representation capability of learning to adapt to the text mode. It should be noted that the acoustic encoder needs to provide frame-level and sentence-level representations, respectively, to interact with the text encoder to splice a learnable vector h of the same dimension as the hidden layer at the hidden layer output starting position of layer 6 ₀ 。

In some embodiments, the text encoder comprises a speech-text contrast encoding module and a speech-text matching encoding module, wherein the speech-text contrast encoding module and the speech-text matching encoding module respectively comprise at least one transducer sub-layer, and the speech-text contrast encoding module and the speech-text matching encoding module share model parameters; initializing a text encoder, comprising: a cross attention layer is added between the self attention layer and the feedforward layer of each transducer sub-layer in the speech text matching coding module to input the output information of the acoustic encoder.

The initial spoken scoring model includes a text encoder, the model structure in the text encoder is consistent with a standard BERT model, as shown in fig. 3, fig. 3 is a schematic structural diagram of the text encoder according to an embodiment of the present application, the text encoder includes a voice text contrast encoding module and a voice text matching encoding module, the text encoder includes a plurality of transformation sub-modules, that is, transformation sub-layers, wherein the voice text contrast encoding module and the voice text matching encoding module share model parameters, the text encoder is initialized, and a Cross Attention layer is added between a Self-Attention layer and a feedforward layer of each transformation sub-layer in the voice text matching encoding module to input output information of the acoustic encoder, that is, the acoustic information is injected into the text encoder through the Cross Attention layer, specifically, M transformation sub-modules in the voice text contrast encoding module include Self-Attention layers (Self-Attention) and feedforward networks (feedforward networks), and M transformation sub-modules in the voice text matching encoding module include Self-Attention layers (Self-Attention layers), feedforward networks (feedforward networks), and Cross Attention networks (feedforward networks), respectively.

In some embodiments, the loss function comprises a phonetic text versus loss function, training the initial spoken scoring model with the loss function, comprising: calculating the loss of the distance voice between each voice vector and the text vector matched with the voice vector to obtain the loss of the voice modal data to the text modal data corresponding to the voice vector; calculating the distance between each text vector and the voice vector matched with the text vector to obtain the loss from the text modal data to the voice modal data corresponding to the text vector; minimizing the sum of the loss of the voice modal data to the text modal data corresponding to the voice modal data and the loss of the text modal data to the voice modal data corresponding to the voice modal data to realize training of an initial spoken language scoring model.

The loss function includes a speech-text contrast loss function, i.e., an ATC (Audio-text Comparison) loss function, which refers to a speech-text contrast loss that is learned by contrast of positive and negative speech-text pairs to align the potential feature space of speech and text. It will be appreciated that in the training process, it is attempted to maximize the similarity score of the correct speech-text matching pair and minimize the similarity to the negative sample (i.e., the unmatched speech-text pair), for example, the student answers the speech by the acoustic encoder to obtain the acoustic feature vector, the student answers the text by the text encoder to obtain the text feature vector, and then the dot product of the feature vectors of the two modalities is calculated as the similarity score.

Specifically, calculating the loss of the distance voice between each voice vector and the matched text vector to obtain the loss of the voice modal data to the corresponding text modal data, namely calculating the loss of the voice to the text, wherein the voice and the text in batch processing (batch) correspond to positive examples and do not correspond to negative examples; and calculating the distance between each text vector and the matched voice vector to obtain the loss from the text modal data to the voice modal data corresponding to the text vector, namely calculating the loss from the text to the voice, wherein the voice and the text in batch processing (batch) correspond to positive examples and do not correspond to negative examples. Minimizing the sum of the loss of the voice modal data to the text modal data corresponding thereto and the loss of the text modal data to the voice modal data corresponding thereto, i.e., minimizing the loss generated in the comparison of voice texts, to achieve training of the initial spoken language scoring model.

In some embodiments, the penalty function comprises a phonetic text matching penalty function, training the initial spoken scoring model with the penalty function, comprising: and carrying out two classifications on the output of the text encoder by utilizing the voice text matching loss function so as to determine whether voice modal data and text modal data are matched.

The penalty function includes a voice text matching penalty function, i.e., an ATM (Audio-text) penalty function, and the output of the text encoder is classified by the voice text matching penalty function to determine whether the voice modality data matches the text modality data, i.e., by classifying the voice-text matching property by two, modeling the correlation of the voice-text multimodal information, e.g., by using the hidden layer of the text encoder to perform a cross-attention mechanism on the hidden layer of the acoustic encoder, and classifying by two at the [ CLS ] position of the text encoder, the target value is 1 if the voice data corresponds to the text data, and the target value is 0 if the voice data does not correspond to the text data.

In some embodiments, the spoken evaluation data includes speech answer data and corresponding test question text data, and model optimization is performed on the initial spoken scoring model after pre-training by using the spoken evaluation data, including: inputting the spoken language evaluation data into a pre-trained initial spoken language scoring model; and based on the output of the initial spoken language scoring model after the pre-training, carrying out spoken language score prediction by using a mean square error.

The spoken evaluation data comprises speech answering data and corresponding test question text data, namely spoken questions and answers, reading evaluation, topic statement and other spoken evaluation tasks, such as two modes of information of the speech answers and text information of students, wherein the text information comprises questions and reference answers of the spoken questions and answers, texts of the reading questions, topics and reference answers of the topic statements, scoring rules and the like.

Further, model optimization is performed on the initial spoken language scoring model after pre-training by using spoken language evaluation data, for example, the spoken language evaluation data is input into the initial spoken language scoring model after pre-training, further, based on the output of the initial spoken language scoring model after pre-training, spoken language score prediction is performed by using mean square error, namely, text information is encoded by a text encoder and then a cross attention mechanism is performed on hidden layers of the acoustic encoder, score of student response is predicted by using Mean Square Error (MSE) through pooling average projection to one dimension, further, multi-modal end-to-end evaluation is realized, recognition process is omitted, and cascade error is reduced.

For ease of understanding, the process of multi-modal pre-training of the present application is illustrated in fig. 4, where fig. 4 is a schematic structural diagram of a pre-training model of an embodiment of the present application, and the initial spoken language scoring model includes an acoustic encoder and a text encoder. The acoustic encoder has a 2N layer consumer sub-module, and can be externally connected with a CTC loss function of speech recognition at the nth layer, wherein N can be 6. The text encoder comprises a voice text contrast encoding module and a voice text matching encoding module, wherein a transducer sub-module in the voice text contrast encoding module comprises a Self-Attention layer (Self-Attention) and a Feed Forward network (Feed Forward), and an ATC loss function is connected with the acoustic encoder and the text encoder, for example, the ATC loss function is connected with the acoustic encoder and the voice text contrast encoding module; the transformer submodule in the voice text matching coding module comprises a Self-Attention layer (Self-Attention), a Feed Forward network (Feed Forward), and a Cross Attention layer (Cross Attention) between the Self-Attention layer and the Feed Forward network; and uses the ATM penalty function to build a correlation of the speech-text multimodal information by bi-classifying the speech-text matches.

Referring to fig. 5, fig. 5 is a flowchart of a spoken language scoring method according to an embodiment of the present application, and the method may be applied to an electronic device with computing functions. It should be noted that, if there are substantially the same results, the method of the present application is not limited to the flow sequence shown in fig. 5.

In some possible implementations, the method may be implemented by a processor invoking computer readable instructions stored in a memory, as shown in fig. 5, and may include the steps of:

s51: and obtaining spoken language test data.

The spoken evaluation data comprises speech answering data and corresponding test question text data, namely spoken questions and answers, reading evaluation, topic statement and other spoken evaluation tasks, such as two modes of information of the speech answers and text information of students, wherein the text information comprises questions and reference answers of the spoken questions and answers, texts of the reading questions, topics of the topic statements, the reference answers and the like.

S52: the spoken language test data is input into a spoken language scoring model to output a corresponding spoken language score.

The speech response data and the corresponding test question text data are input into a spoken language scoring model to obtain spoken language scores of corresponding testers, wherein the spoken language scoring model is obtained based on the training method of the spoken language scoring model, namely, the speech recognition data comprise speech mode data and the corresponding text mode data, the speech recognition data are utilized to pretrain an initial spoken language scoring model, the initial spoken language scoring model comprises an acoustic encoder and a text encoder, the speech mode corresponding to the pretrained acoustic encoder is aligned with the text mode corresponding to the pretrained text encoder, the spoken language evaluating data are obtained, model optimization is conducted on the pretrained initial spoken language scoring model by utilizing the spoken language evaluating data to obtain a target spoken language scoring model, and specific contents are not repeated.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the processor 62 being adapted to execute program instructions stored in the memory 61 to implement the steps of the training method embodiment of the spoken language scoring model or to implement the steps of the spoken language scoring method embodiment described above. In one particular implementation scenario, electronic device 60 may include, but is not limited to: the microcomputer and the server are not limited herein.

Specifically, the processor 62 is configured to control itself and the memory 61 to implement the steps of the training method embodiment of the spoken language scoring model described above, or to implement the steps of the spoken language scoring method embodiment described above. The processor 62 may also be referred to as a CPU (Central Processing Unit ), and the processor 62 may be an integrated circuit chip with signal processing capabilities. The processor 62 may also be a general purpose processor, a digital signal processor (DigitalSignal Processor, DSP), an application specific integrated circuit (Application Specific IntegratedCircuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be commonly implemented by an integrated circuit chip.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a non-volatile computer readable storage medium according to an embodiment of the present application. The non-transitory computer readable storage medium 70 is used to store a computer program 701, which computer program 701, when executed by a processor, for example by the processor 62 in the above-described fig. 6 embodiment, is used to implement the steps of the above-described training method embodiment for a spoken language scoring model, or to implement the steps of the above-described spoken language scoring method embodiment.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in this application, it should be understood that the disclosed methods and related devices may be implemented in other ways. For example, the above-described embodiments of related devices are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication disconnection between the illustrated or discussed elements may be through some interface, indirect coupling or communication disconnection of a device or element, electrical, mechanical, or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those skilled in the art will readily appreciate that many modifications and variations are possible in the device and method while maintaining the teachings of the present application. Accordingly, the above disclosure should be viewed as limited only by the scope of the appended claims.

Claims

1. A method of training a spoken language scoring model, comprising:

acquiring voice recognition data, wherein the voice recognition data comprises voice modal data and text modal data corresponding to the voice modal data;

pre-training an initial spoken language scoring model by utilizing the voice recognition data, wherein the initial spoken language scoring model comprises an acoustic encoder and a text encoder, and the voice mode corresponding to the acoustic encoder after pre-training is aligned with the text mode corresponding to the text encoder after pre-training;

acquiring spoken language evaluation data;

and performing model optimization on the initial spoken language scoring model after pre-training by using the spoken language evaluation data to obtain a target spoken language scoring model.

2. The method of claim 1, wherein the pre-training an initial spoken scoring model based on the speech recognition data comprises:

initializing the acoustic encoder and the text encoder; and

and training the initial spoken language scoring model by using a loss function.

3. The method of claim 2, wherein the loss function comprises a speech recognition loss function, and wherein initializing the acoustic encoder comprises:

initializing the acoustic encoder with a preset encoder;

and connecting the voice recognition loss function at a preset position of the acoustic encoder so as to learn the representation capability of the text mode.

4. The method of claim 2, wherein the text encoder comprises a phonetic text contrast encoding module and a phonetic text matching encoding module, wherein the phonetic text contrast encoding module and the phonetic text matching encoding module each comprise at least one transducer sub-layer, and wherein the phonetic text contrast encoding module and the phonetic text matching encoding module share model parameters;

the initializing the text encoder includes:

a cross attention layer is added between the self attention layer and the feedforward layer of each transducer sub-layer in the speech text matching coding module to input the output information of the acoustic encoder.

5. The method of claim 2, wherein the loss function comprises a phonetic text versus loss function, the training the initial spoken scoring model with the loss function comprising:

calculating the loss of the distance voice from each voice vector to the text vector matched with the voice vector to obtain the loss of the voice modal data to the text modal data corresponding to the voice vector;

calculating the distance between each text vector and the voice vector matched with the text vector to obtain the loss from the text modal data to the voice modal data corresponding to the text vector;

minimizing the sum of the loss of the voice modal data to the text modal data corresponding to the voice modal data and the loss of the text modal data to the voice modal data corresponding to the voice modal data to train the initial spoken language scoring model.

6. The method of claim 2, wherein the loss function comprises a phonetic text matching loss function, the training the initial spoken scoring model with the loss function comprising:

and carrying out two classifications on the output of the initial spoken language scoring model by utilizing a voice text matching loss function so as to determine whether the voice modal data and the text modal data are matched.

7. The method of claim 1, wherein the spoken evaluation data includes speech response data and corresponding test question text data, and wherein the model optimizing the initial spoken scoring model after pre-training using the spoken evaluation data comprises:

inputting the spoken evaluation data into the initial spoken scoring model after pre-training;

and based on the output of the initial spoken language scoring model after the pre-training, carrying out spoken language score prediction by using a mean square error.

8. A method of spoken language scoring, comprising:

acquiring spoken language test data;

inputting the spoken language test data into a spoken language scoring model to output a corresponding spoken language score;

wherein the spoken language scoring model is derived based on the training method of the spoken language scoring model of any one of claims 1 to 7.

9. An electronic device comprising a memory and a processor coupled to each other, the processor configured to execute program instructions stored in the memory to implement the method of training the spoken scoring model of any one of claims 1-7 or to implement the method of spoken scoring of claim 8.

10. A non-transitory computer readable storage medium having program instructions stored thereon, which when executed by a processor, implement the method of training the spoken scoring model of any one of claims 1 to 7, or implement the method of spoken scoring of claim 8.