CN108417210B

CN108417210B - Word embedding language model training method, word recognition method and system

Info

Publication number: CN108417210B
Application number: CN201810022130.3A
Authority: CN
Inventors: 俞凯; 陈瑞年
Original assignee: Shanghai Jiaotong University; AI Speech Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2020-06-26
Anticipated expiration: 2038-01-10
Also published as: CN108417210A

Abstract

The invention discloses a word embedding language model training method, which comprises the following steps: determining attributes of all words in a corpus to generate a word list, wherein the attributes comprise part-of-speech classifications of all words, probability distribution of all part-of-speech classifications and probability distribution of all words under the part-of-speech classifications; generating word vectors of all words in the word list; generating part-of-speech classification vectors corresponding to part-of-speech classifications of all words in the vocabulary; and training by taking word vectors of words in the word list and part-of-speech classification vectors of the words in the word list as input and taking probability distribution of part-of-speech classifications to which the words in the word list belong and probability distribution of the words in the word list under the part-of-speech classifications as output to obtain the word embedding language model. In the embodiment of the invention, even if the OOV words are encountered, the language model is carried out, and the OOV words can be accurately identified through morphological information of the OOV words and syntactic level information of part of speech classification.

Description

Word embedding language model training method, word recognition method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a word embedding language model training method, a word recognition method and a word recognition system.

Background

The language models currently applied in speech recognition systems are mainly used for scoring the words and sentences recognized by speech recognition and adding the scores of the acoustic models to obtain the optimal recognized result. The existing language model based on the neural network is slow in training speed and needs to have a fixed known word list during training. In the conventional language model, each word in the vocabulary used for training is represented by a one-hot vector, for example, a vocabulary size is 1 ten thousand (i.e., there are 1 ten thousand words in the table), so that a ten thousand dimensional vector is used to represent a word and only the bit corresponding to the word is 1. Then the vector is input into a neural network and multiplied by a word embedding matrix to be converted into a real vector, and finally the training of a language model is realized; when the words are recognized, the words are converted into real number vectors for recognition.

However, the inventor finds in the implementation of the present invention that the vocabulary used for training the language model often cannot cover all words, so when using the conventional language model, if the word to be recognized is a word that is not entered into the vocabulary (e.g., out-of-vocabulary, OOV, which appears in the future in practical applications), the conventional language model cannot correctly and reliably recognize the word (because the word to be recognized does not exist in the vocabulary, there is no vector corresponding to the word to be recognized at all, and thus the vector cannot be multiplied by the word embedding vector to obtain a real vector corresponding to the OOV), unless the word to be recognized is added to the vocabulary, and a new language model is retrained using the new vocabulary.

To solve the problems of the conventional language model, it is currently most common to represent all the out-of-vocabulary words with a special token < unk >. A certain vocabulary is first obtained and a special < unk > character is added. All OOV words in the training set are then replaced with < unk > for training. During use, all OOV words are also replaced with < unk >.

In the process of implementing the present invention, the inventor finds that the words beyond the vocabulary are usually rare words, so the training data of the words is very little, and the prior art abandons the linguistic information of the OOV words, so the final recognition result of the words is also very inaccurate.

Disclosure of Invention

Embodiments of the present invention provide a method and system for adding a new word in a neural network language model, so as to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for training a word-embedded language model, including:

determining attributes of all words in a corpus to generate a word list, wherein the attributes comprise part-of-speech classifications of all words, probability distribution of all part-of-speech classifications and probability distribution of all words under the part-of-speech classifications;

generating word vectors of all words in the word list;

generating part-of-speech classification vectors corresponding to part-of-speech classifications of all words in the vocabulary;

and training by taking word vectors of words in the word list and part-of-speech classification vectors of the words in the word list as input and taking probability distribution of part-of-speech classifications to which the words in the word list belong and probability distribution of the words in the word list under the part-of-speech classifications as output to obtain the word embedding language model.

In a second aspect, an embodiment of the present invention further provides a word recognition method, where the word embedding language model in the foregoing embodiment of the present invention is adopted in the method, and the method includes:

generating a word vector of a word to be recognized;

determining part-of-speech classification vectors of part-of-speech classifications of the words to be recognized;

and inputting the word vector and the part-of-speech classification vector of the word to be recognized into the word embedding language model so as to obtain the probability distribution of the part-of-speech classification to which the word to be recognized belongs and the probability distribution of the word to be recognized under the part-of-speech classification.

In a third aspect, an embodiment of the present invention further provides a word embedding language model training system, including:

the word list generation program module is used for determining the attributes of all words in the corpus to generate a word list, wherein the attributes comprise part-of-speech classifications of all words, probability distribution of all part-of-speech classifications and probability distribution of all words under the part-of-speech classifications;

the word vector generating program module is used for generating word vectors of all words in the word list;

a part-of-speech classification vector generation program module for generating part-of-speech classification vectors corresponding to part-of-speech classifications of all words in the vocabulary;

and the model training program module is used for taking word vectors of the words in the word list and part-of-speech classification vectors of the words in the word list as input, and taking the probability distribution of the part-of-speech classification to which the words in the word list belong and the probability distribution of the words in the word list under the part-of-speech classification to which the words belong as output to train so as to obtain the word embedding language model.

In a fourth aspect, an embodiment of the present invention further provides a word recognition system, including:

word embedding language model;

the word vector generating program module is used for generating word vectors of the words to be recognized;

the word list generation program module is used for determining part-of-speech classification vectors of part-of-speech classification of the words to be recognized;

and the word recognition program module is used for inputting the word vector and the part of speech classification vector of the word to be recognized into the word embedding language model so as to obtain the probability distribution of the part of speech classification to which the word to be recognized belongs and the probability distribution of the word to be recognized under the part of speech classification.

In a fifth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the word embedding language model training methods and/or the word recognition methods described above in the present invention.

In a sixth aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the word embedding language model training methods and/or word recognition methods of the present invention described above.

In a seventh aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is caused to execute any one of the above word embedding language model training method and/or the word recognition method.

The embodiment of the invention has the beneficial effects that: in the embodiment of the invention, when the language model is trained, the words in the corpus are not directly taken for training, but the attributes of all the words are determined at first, wherein the attributes comprise part-of-speech classification of all the words, probability distribution of all the part-of-speech classification and probability distribution of all the words under the part-of-speech classification; morphological information and syntactic level information of words are comprehensively considered during language model training, particularly, introduction of the syntactic level information is comprehensively considered, and commonalities of the words belonging to the same part-of-speech classification are comprehensively considered, so that even if the trained language model meets OOV words, the trained language model can be accurately identified through the morphological information of the OOV words and the syntactic level information of the part-of-speech classification in practical application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow diagram of one embodiment of a method for training a word-embedding language model of the present invention;

FIG. 2 is a flow diagram of another embodiment of a method for training a word-embedding language model in accordance with the present invention;

FIG. 3 is a flow diagram of one embodiment of a method of word recognition in accordance with the present invention;

FIG. 4 is a flow diagram of another embodiment of a method of word recognition of the present invention;

FIG. 5 is a flow diagram of yet another embodiment of a word recognition method of the present invention;

FIG. 6 is a block diagram illustrating an embodiment of a word embedding language model according to the present invention;

FIG. 7 is a functional block diagram of an embodiment of a word embedding language model training system of the present invention;

FIG. 8 is a functional block diagram of another embodiment of a word embedding language model training system of the present invention;

FIG. 9 is a functional block diagram of one embodiment of a word recognition system of the present invention;

FIG. 10 is a functional block diagram of another embodiment of a word recognition system of the present invention;

FIG. 11 is a functional block diagram of yet another embodiment of a word recognition system of the present invention;

fig. 12 is a schematic structural diagram of an embodiment of an electronic device according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like can refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As shown in fig. 1, an embodiment of the present invention provides a method for training a word-embedded language model, where the method includes:

and S11, determining the attributes of all the words in the corpus to generate a word list, wherein the attributes comprise part of speech classification of all the words, probability distribution of all the part of speech classification and probability distribution of all the words under the part of speech classification.

In the word list in this embodiment, all words from the corpus are stored according to part-of-speech classifications, which may include nouns, adjectives, verbs, adverbs, and the like, and the proportion of the words included in all the part-of-speech classifications to all the words from the corpus is determined by statistics, that is, the probability distribution of all the part-of-speech classifications is determined.

Then, the proportion of the words belonging to each part of speech classification to all the words under the part of speech classification is further counted, namely, the probability distribution of all the words under the part of speech classification is determined.

And S12, generating word vectors of all words in the word list. Morphological information of the words is obtained by obtaining word vectors of the words.

S13, generating part-of-speech classification vectors corresponding to part-of-speech classifications of all words in the word list; syntactic level information (i.e., semantic information) of words belonging to a corresponding part-of-speech classification can be determined by determining part-of-speech classification vectors such that all words belonging to the same part-of-speech classification share the same part-of-speech classification vector.

S14, taking word vectors of words in the word list and part-of-speech classification vectors of the words in the word list as input, and taking probability distribution of part-of-speech classification to which the words in the word list belong and probability distribution of the words in the word list under the part-of-speech classification as output to train so as to obtain the word embedding language model.

The two vectors are directly spliced together as input, and a normalized exponential function of factorization is adopted to calculate the probability of each word at an output layer. The probability distribution of each semantic classification is calculated first, and then the probability distribution of the words under the parts of speech is calculated. Finally, the two probability distributions are multiplied to obtain the required probability distribution.

In the embodiment of the invention, when the language model is trained, the words in the corpus are not directly taken for training, but the attributes of all the words are determined at first, wherein the attributes comprise part-of-speech classification of all the words, probability distribution of all the part-of-speech classification and probability distribution of all the words under the part-of-speech classification; and morphological information and syntactic level information of words are comprehensively considered during language model training, particularly the introduction of the syntactic level information, certain commonalities of the words on semantics are utilized, parameters of the words in a word list can be utilized by an OOV word through a parameter sharing method, and the commonalities of the words belonging to the same part of speech classification are comprehensively considered, so that even if the OOV word is encountered, the trained language model can be accurately identified in practical application through the morphological information of the OOV word and the syntactic level information of the part of speech classification.

In addition, because we introduce extra information (semantic classification, morphological decomposition), we can directly model unseen new words, and the modeling is more accurate. However, in the conventional approach, modeling a new word requires collecting new data and then retraining the model, which is very time consuming. Therefore, in the actual use process, the model can greatly save the time for adding new words. The method can be further integrated into a voice recognition system to realize quick word list updating, and meanwhile, the recognition accuracy of low-frequency words is improved.

As shown in fig. 2, in some embodiments, the generating a word vector for all words in the word list includes:

s21, judging whether the words obtained from the word list are low-frequency words or not;

s22, if yes, resolving the words obtained from the word list into words, and coding the resolved words to determine corresponding word vectors;

and S23, if not, extracting the vector of the words obtained from the word list as a word vector.

In the embodiment of the invention, different processing methods are adopted for high-frequency words and low-frequency words: for high frequency words, each word has its own independent word vector (e.g., one-hot vectors may be used); for low-frequency words, firstly, the words are morphologically decomposed (for Chinese, i.e. decomposed into words), then, some sequence coding methods are further adopted to convert the words into vectors with Fixed length, and common coding methods include character-level Fixed-size conventional forgetting coding (FOFE), direct addition of word vectors, coding by using a cyclic or convolutional neural network, and the like. In the embodiment of the invention, a word is not regarded as a one-hot vector alone, but classified from a syntax level (pop, part-of-speed tag) (namely, representation of the syntax level of one-hot), and then, from a morphological level, a FOFE code is coded.

The distinction between high-frequency words and low-frequency words is made because not all the meanings of a word can be well represented by its word, so the embodiments of the present invention avoid the impact on the performance of the speech model under the above circumstances.

In addition, a set of parameters needs to be set for each word in the conventional language model. In the method of the embodiment of the present invention, since the low-frequency words are all decomposed into words (morphological decomposition), we only need to set parameters for all the high-frequency words and words, which brings about the effect that the required parameter amount is greatly reduced (usually by about 80%), and the advantage brought by the small parameter amount is that the word embedding language model obtained by the embodiment of the present invention can be embedded into some smaller devices (such as mobile phones).

Morphological decomposition of words may also be performed using phonemes from the beginning, but since some homophones have very different meanings, the phoneme decomposition method is not very effective, and the problems are overcome by using the method of the embodiment of the present invention.

In some embodiments, the training with the word vectors of the words in the vocabulary and the part-of-speech classification vectors of the words in the vocabulary as inputs and the probability distribution of the part-of-speech classifications to which the words in the vocabulary belong and the probability distribution of the words in the vocabulary under the part-of-speech classifications as outputs to obtain the word-embedded language model includes:

inputting word vectors of words in the word list and part-of-speech classification vectors of the words in the word list into a long-term and short-term memory network;

inputting the output of the long and short term memory network into a part of speech classifier to obtain the probability distribution of part of speech classification to which the words in the word list belong;

and inputting the output of the long-time and short-time memory network into a word classifier to obtain the probability distribution of the words in the word list under the part-of-speech classification.

The trained word embedded language model comprises a long-time and short-time memory network, a part-of-speech classifier and a word classifier.

As shown in fig. 3, an embodiment of the present invention further provides a word recognition method, where the method uses the word embedding language model in the embodiment of the present invention, and the method includes:

s31, generating word vectors of the words to be recognized;

s32, determining part-of-speech classification vectors of part-of-speech classifications of the words to be recognized;

s33, inputting the word vector and the part of speech classification vector of the word to be recognized into the word embedding language model to obtain the probability distribution of the part of speech classification to which the word to be recognized belongs and the probability distribution of the word to be recognized under the part of speech classification.

The language model adopted in the embodiment of the invention does not directly take the words in the corpus for training when training, but firstly determines the attributes of all the words, including part-of-speech classification of all the words, the probability distribution of all the part-of-speech classification and the probability distribution of all the words under the part-of-speech classification; and morphological information and syntactic level information of words are comprehensively considered during language model training, particularly the introduction of the syntactic level information, certain commonalities of the words on semantics are utilized, parameters of the words in a word list can be utilized by an OOV word through a parameter sharing method, and the commonalities of the words belonging to the same part of speech classification are comprehensively considered, so that even if the OOV word is encountered, the trained language model can be accurately identified in practical application through the morphological information of the OOV word and the syntactic level information of the part of speech classification.

As shown in fig. 4, in some embodiments, when the word to be recognized belongs to a vocabulary for training the word embedding language model, the generating a word vector of the word to be recognized includes:

s41, judging whether the words to be recognized are low-frequency words or not;

s42, if yes, the words to be recognized are disassembled into words, and the words obtained through disassembly are coded to be used for determining corresponding word vectors;

and S43, if not, extracting the vector of the word to be recognized as a word vector.

As shown in fig. 5, in some embodiments, when the word to be recognized does not belong to a vocabulary used for training the word embedding language model, the generating the word vector of the word to be recognized includes:

s51, determining the attribute of the word to be recognized to update the word list;

s52, the words to be recognized are disassembled into words, and the words obtained through disassembly are coded so as to be used for determining corresponding word vectors.

The method for rapidly adding the new words in the word embedding language model is realized in the embodiment. The method can be further integrated into a voice recognition system to realize quick word list updating, and meanwhile, the recognition accuracy of low-frequency words is improved.

The following further describes the embodiment of the present invention by comparing the processing of OOV words by the conventional LSTM (Long Short-Term Memory) language model with the technical solution of the embodiment of the present invention.

Introduction of LSTM language model:

the deep learning method is widely applied to the language model and has great success. A Long Short Term Memory (LSTM) network is a Recurrent Neural Network (RNN) architecture that is particularly well suited for sequences. Let V be a vocabulary, at each timestamp t, the word w is entered_tFrom a one-hot vector e_tRepresentation, then word embedding can be obtained as x_t：

x_t＝E_ie_t(1)

Wherein E_i∈R^m×|V|An input word embedding matrix, and m represents the gradation of the input word embedding. Specifically, one step of the LSTM is to convert x_t，h_t-1，c_t-1As input, and generates h_t，c_t. The details of the calculations are omitted herein. The probability distribution of the next word is computed on the output layer by affine transformation of the hidden layer followed by the softmax function:

wherein E is_o ^jIs Eo ∈ R^m×|V|Is also called output embedding, and b^jIs the bias term. We have found that the bias terms of the output layer play an important role, being highly correlated with the frequency of words.

Since most of the computational cost depends on the output layer, a factorized softmax output layer is proposed to increase the speed of the language model. This approach is based on the assumption that words can be mapped to classes. Let S be a class set. Unlike equation (2), the probability distribution of the next word of the factorized output layer is calculated as follows:

P(w_t+1＝j|w_1:t)＝P(s_t+1＝s_j|h_t)P(w_t+1＝j|s_j，h_t) (3)

wherein s is_jRepresents the word w_t+1Class V of_sjIs of class s_jA set of all words of (a). Here, the probability calculation of a word is divided into two stages: we first estimate the probability distribution of the class and then compute the probability of a particular word from the desired class. In fact a word may belong to multiple classes. But in this context, each word maps to a different class, i.e., all classes are mutually exclusive. Common parts of speech are frequency-based classes or classes derived from data-driven methods.

OOV word processing

As previously mentioned, two methods have been used in the classic LSTM language model to handle OOV word problems:

1. a special class of < UNK > is used to replace all OOV words, using another measure called alignment complexity:

wherein V_OOVIs a collection of words for all OOV words. We refer to this method as "unk" in the experiment.

2. The model is retrained with the updated vocabulary. Since OOV words have no or few positive examples in the training set, their probabilities will be assigned a small value after training. This approach may be similar to the smoothing approach used in n-gram language models. We refer to this method as "retraining" in the experiment.

Both of these conventional methods have their disadvantages: in the uk LSTM language model, the probability of misjudging OOV words is due to the frequency of OOV words not matching the training data and the test data. Furthermore, this approach ignores the linguistic information of the OOV words. The main problem with retrained LSTM language models is that they are time consuming.

In the traditional LSTM language model, word embedding for each word is independent, which creates two problems. First, new words cannot be embedded with training words. Second, this is rare due to the lack of training data. The motivation for structured word embedding is to use parameter sharing to solve both problems. Unlike the data-driven approach, the parameter sharing approach must be based on explicit rules. By using syntactic and morphological rules, we can easily find the shared parameters of OOV words and build their own structured word embedding in our model.

Morphological syntax structured embedding:

at a syntactic level, each word is assigned to a part-of-speech (POS) class. All words in the same POS (part-of-speech) class share the same POS class embedding, called syntactic embedding. A part of speech is a word with similar grammatical features. Therefore, we assume that syntactic embedding represents the basic syntactic function of a word.

For each word, we labeled its POS tag with several example sentences and selected the most common tag as a part of speech (POS tag is also available from dictionary). Example sentences for words in the vocabulary (IV) are selected from the training set. For OOV words, illustrative sentences may be composed of or selected from other data sources, such as network data. Unlike data-driven approaches, POS tag-based syntactic embedding can be easily generated for OOV words using rules.

Character (or sub-word) representations are widely used in many nlp (natural Language processing) tasks as an additional feature to improve the performance of low frequency words, particularly in morphologically rich languages. But for high frequency words, the improvement is limited. Morphological embedding is established herein to further capture the semantics of low frequency words. This is based on the assumption that the sparsity of data for low frequency words is less severe at the character level. For high frequency words, word embedding is preserved. Therefore, hybrid embedding, i.e., morphological embedding of low frequency words and word embedding of high frequency words, should be in the same dimension.

In previous literature, word embedding was combined with sub-word level features to obtain enhanced embedding of all words. In contrast, the morphological embedding of low frequency words proposed herein relies only on character-level features. Thus, it has the ability to model OOV words.

The proposed morphological embedding utilizes character-level fixed-size conventional forgetting encoding (FOFE) character information. In our model, all low frequency words are represented by the character sequence e₁: t represents, wherein e_tIs a one-hot representation of the character at the timestamp t. FOFE is based on a simple recursive formula (z)₀0) the entire sequence is encoded:

z_t＝αz_t-1+e_t(1≤t≤T) (7)

where 0< α <1 is a constant forgetting factor that controls the impact of history on the final time step in addition, a feed-Forward Neural Network (FNN) is used to convert the character-level FOFE code into the final morphological embeddings.

Combining structured embedding with the LSTM language model:

fig. 6 is a schematic structural diagram of a word embedding language model according to an embodiment of the present invention.

At the input level, structured embedding of input words is obtained by concatenating its syntactic embedding with word embedding (for high frequency words) or morphological embedding (for low frequency words).

In the output layer it is easy to use a factorised softmax structure. The output class embedding matrix Ec in formula (4) is replaced by syntactic embedding, and the output embedding matrix Eo in formula (5) is replaced by word and shape embedding.

Once training is complete, syntactic and morphological embedding of OOV words is readily available. To calculate the probability of an OOV word, we need to reconstruct the output layer parameters in equation (5): eo, b. All embedding and bias terms b in Eo are retained for IV words, and the embedding of OOV words in Eo is filled by its morphological embedding. In experiments, we find that the bias term is highly correlated with the word frequency, which means that the higher the word frequency, the larger the bias value. In this context, we use the bias term for OOV words as a small constant value of experience.

By utilizing structured word embedding, OOV words can be incorporated into the LSTM language model without requiring retraining. As we have mentioned earlier, the data sparsity of OOV words can also be mitigated by sharing parameters in the proposed model during training.

The structured word embedding language model provided by the embodiment of the invention also realizes parameter compression. In the LSTM language model, word embedding of low frequency words occupies a significant portion of the model parameters but is not well-trained. By replacing low frequency words with character representations, the number of parameters can be greatly reduced.

In the LSTM language model, the number of parameters for word embedding is 2 × | V | × H, whereas in the structured embedded LSTM language model, the total number of parameters is (| Vh | + | Vchar | + | S |) × H³Where Vh denotes a high frequency word, Vchar denotes a character set, and S denotes a POS tag group. Experiments have shown that parameters can be reduced by nearly 90% when V60000, Vh 8000, Vchar 5000, and S32.

In order to verify that the method and the system of the present invention can achieve the expected effect, the inventor performs a test based on a Short Message Service (SMS) data set to evaluate a word embedding language model (hereinafter, referred to as a structured word embedding LSTM language model).

TABLE 1 data set information

Table 1 gives details of the data set. Two different sized vocabularies are used for each data set. Complete vocabulary set V_fCovering all the words that appear in the corpus. Small vocabulary set V_sIs V_fToA subset of. In this word (IV) is defined as V_sThe word in (1) means V, out-of-vocabulary (OOV)_fInstead of V_sThe word in (1). The sms-30m dataset was also used as a training set and Mandarin spontaneous dialog test set (about 25 hours, 3K utterances) for ASR (automatic speech recognition) re-execution tasks.

1. By small vocabulary sets V_sTo train the LSTM language model and all OOV words are treated as a single entity<UNK>Symbol, called "UNK".

2. By complete vocabulary set V_fThe LSTM language model is retrained, referred to as "retraining".

For LSTM language models with structured word embedding, a small vocabulary set V is used in the training phase_sUpdating model vocabulary to V in testing stage_f。

To keep with the size of the proposed model, the input embedding size of the LSTM baseline is set to 600 and the output embedding size is set to 300. In the LSTM language model with structured embedding, the syntactic embedding size is set to 300. We use 1-layer 5000-300FNN for FOFE encoding, where 5000 is the character set V_cSet FOFE's α to 0.7, set the bias term for the new word to 0, fine tune these two empirical parameters in the active set the most frequent 8192 words are selected as high frequency words, with the other words considered low frequency words in our model.

Complexity evaluation

The results of the confusability evaluations are shown in Table 2. In particular, for "unk" LSTM, the calculation of the PPL for OOV words is replaced by equation (6). The results show that the proposed Structure Embedding (SE) method has similar performance to the uk LSTM. Retrained LSTM, however, performs worse. For further investigation, we performed PPL calculations for the in-vocabulary (IV) and out-of-vocabulary (OOV) words separately for each model. The results of the experiment are shown in table 3. The un LSTM performs best in IV words at the expense of OOV words, because its OOV words have a very high PPL. Retrained LSTM greatly increases the PPL of OOV words and decreases in IV words relative to uk LSTM. Our approach further improves PPL in OOV words with similar performance in IV words.

TABLE 2 comparison of complexity between different OOV binding methods

TABLE 3 complexity resolution of words inside and outside the vocabulary

Fast vocabulary update in ASR

In an Automatic Speech Recognition (ASR) system, a back-off n-gram is used as a language model to generate a lattice from which an n-best list is generated. The n-best list can then be re-adjusted using the neural network language model for better performance. Typically, the n-gram and neural network language models share the same vocabulary. Thus, both the n-gram and neural network language models need to be retrained when the vocabulary is updated. Compared with the neural network language model, the training time of the n-element language model can be ignored.

This experiment was divided into two stages. In the first stage, the LSTM language model and the LSTM language model with Structured Embedding (SE) are trained using the small vocabulary representations as uk LSTM and SE, respectively. An n-gram language model trained with Vs is used to generate the n-best list. We then used the uk LSTM to perform n-best list re-scoring. In the second stage, the vocabulary Vs is expanded to a larger vocabulary Vf. As the vocabulary changes, we need to retrain the unkLSTM and n-gram models. But the vocabulary of LSTM with SE is reconstructed without retraining. The retrained LSTM and the LSTM with SE are then used to redefine the n-best list generated by the new n-gram.

TABLE 4 character error Rate comparison and decomposition of in-vocabulary sentences and out-of-vocabulary sentences

The results of the experiment are shown in table 4. With the benefit of lexical expansion, retrained LSTM achieved an absolute 0.38% CER improvement in all sentences. The proposed LSTM model with structured embedding (LSTM with SE) achieves the best performance. To investigate what sentences can be derived from the proposed model to get the greatest benefit, we divided the re-scored sentences into two categories, called intra-lexical sentences (IVS) and extra-lexical sentences (OOVS), depending on whether all words are present in Vs. As shown in Table 4, the unk LSTM trained with Vs has a higher CER for out-of-vocabulary sentences because the n-grams constructed from Vs cannot produce these OOV words. By expanding the vocabulary, the retrained LSTM yields a significant improvement in CER for out-of-vocabulary sentences. Compared to retrained LSTM, the proposed model outperforms CER on both IV and OOV sentences. Moreover, the improvement in CER for OOV sentences (1.13% absolute) is significantly higher than the improvement in CER for IV sentences (0.13% absolute), meaning LSTM with SE has better ability to model OOV words. Please note that by using the proposed structured word embedded LSTM language model, the model retraining time in the conventional method can be saved and better performance can be achieved.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

As shown in FIG. 7, an embodiment of the present invention further provides a word embedding language model training system 700, including:

a word list generation program module 710, configured to determine attributes of all words in the corpus to generate a word list, where the attributes include part-of-speech classifications of all words, probability distributions of all part-of-speech classifications, and probability distributions of all words under the part-of-speech classifications to which the words belong;

a word vector generator module 720, configured to generate word vectors of all words in the word list;

a part-of-speech classification vector generation program module 730 for generating part-of-speech classification vectors corresponding to the part-of-speech classifications of all the words in the vocabulary;

and the model training program module 740 is configured to train by taking word vectors of words in the word list and part-of-speech classification vectors of the words in the word list as inputs and taking probability distribution of part-of-speech classifications to which the words in the word list belong and probability distribution of the words in the word list under the part-of-speech classifications to which the words belong as outputs, so as to obtain the word embedding language model.

As shown in fig. 8, in some embodiments, the word vector generator module 720 includes:

a frequency word determination program unit 721 configured to determine whether a word obtained from the vocabulary is a low frequency word;

a first word vector generating program unit 722, configured to, when it is determined that a word obtained from the word list is a low-frequency word, disassemble the word obtained from the word list into words, and encode the disassembled words to determine corresponding word vectors;

a second word vector generating program unit 723, configured to, when it is determined that a word obtained from the word list is a high-frequency word, extract a vector of the word obtained from the word list as a word vector.

As shown in fig. 9, an embodiment of the present invention further provides a word recognition system 900, including:

the word embedding language model 910 described in the above embodiments of the present invention;

a word vector generating program module 920, configured to generate a word vector of a word to be recognized;

a vocabulary generating program module 930 for determining a part-of-speech classification vector of the part-of-speech classification of the word to be recognized;

a word recognition program module 940, configured to input the word vector and the part-of-speech classification vector of the word to be recognized into the word embedding language model, so as to obtain a probability distribution of the part-of-speech classification to which the word to be recognized belongs and a probability distribution of the word to be recognized under the part-of-speech classification to which the word to be recognized belongs.

As shown in fig. 10, in some embodiments, when the word to be recognized belongs to a vocabulary for training the word embedding language model, the word vector generator module 920 includes:

a frequency word judging program unit 921 for judging whether the word to be recognized is a low frequency word;

the first word vector generation program unit 922 is configured to, when it is determined that a word obtained from the word list is a low-frequency word, parse the word to be recognized into a word, and encode the parsed word to determine a corresponding word vector;

and a second word vector generation program unit 923 configured to, when it is determined that the word obtained from the word list is a high-frequency word, extract a vector of the word to be recognized as a word vector.

As shown in fig. 11, in some embodiments, when the word to be recognized does not belong to the vocabulary used for training the word embedding language model, the word vector generator module 920 includes:

a vocabulary updating program unit 921' for determining the attribute of the word to be recognized to update the vocabulary;

the word vector generator 922' is configured to disassemble the word to be recognized into words, and encode the disassembled words to determine corresponding word vectors.

In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the word embedding language model training method and/or the word recognition method of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the above word embedding language model training methods and/or word recognition methods.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of a word embedding language model training method and/or a word recognition method.

In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the program is executed by a processor to perform the steps of a word embedding language model training method and/or a word recognition method.

The system and/or the input method system for implementing the language model construction according to the embodiment of the present invention may be used to execute the method and/or the input method for implementing the language model construction according to the embodiment of the present invention, and accordingly achieve the technical effects achieved by the method and/or the input method for implementing the language model construction according to the embodiment of the present invention, and are not described herein again.

In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

Fig. 12 is a schematic hardware structure diagram of an electronic device for a word embedding language model training method and/or a word recognition method according to another embodiment of the present application, where as shown in fig. 12, the device includes:

one or more processors 1210 and a memory 1220, with one processor 1210 being an example in fig. 12.

The apparatus for performing the method for implementing the word embedding language model training method and/or the word recognition method may further include: an input device 1230 and an output device 1240.

The processor 1210, memory 1220, input device 1230, and output device 1240 may be connected by a bus or other means, such as by a bus connection in fig. 12.

The memory 1220 is a non-volatile computer-readable storage medium, and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for implementing the word embedding language model training and/or the word recognition method in the embodiments of the present application. The processor 1210 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 1220, that is, implementing the above method embodiments to implement the word embedding language model training method and/or the word recognition method.

The memory 1220 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the device for implementing word embedding language model training and/or the word recognition device, and the like. Further, the memory 1220 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 1220 optionally includes memory located remotely from processor 1210, and such remote memory may be connected to the word embedding language model training device and/or the word recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 1230 may receive input numeric or character information and generate signals related to user settings and function controls of the word embedding language model training device and/or the word recognition device. The output device 1240 may include a display device such as a display screen.

The one or more modules are stored in the memory 1220 and, when executed by the one or more processors 1210, perform the word embedding language model training method and/or the word recognition method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of word embedding language model training, comprising:

judging whether the words obtained from the word list are low-frequency words or not;

if so, resolving the words obtained from the word list into words, and encoding the words obtained by resolving so as to determine corresponding word vectors;

if not, extracting the vector of the word obtained from the word list as a word vector;

2. A method of word recognition using the word embedding language model of claim 1, the method comprising:

generating a word vector of a word to be recognized;

inputting the word vector and the part-of-speech classification vector of the word to be recognized into the word embedding language model so as to obtain the probability distribution of the part-of-speech classification to which the word to be recognized belongs and the probability distribution of the word to be recognized under the part-of-speech classification;

when the word to be recognized belongs to a vocabulary for training the word embedding language model, the generating a word vector of the word to be recognized includes:

judging whether the words to be recognized are low-frequency words or not;

if so, resolving the words to be recognized into characters, and coding the resolved characters to determine corresponding word vectors;

and if not, extracting the vector of the word to be recognized as a word vector.

3. The method of claim 2, wherein, when the word to be recognized does not belong to a vocabulary for training the word embedding language model, the generating a word vector for the word to be recognized comprises:

determining the attribute of the word to be recognized so as to update the word list;

and resolving the words to be recognized into words, and coding the words obtained by resolving so as to determine corresponding word vectors.

4. A word embedding language model training system, comprising:

a frequency word judgment program unit for judging whether the word obtained from the word list is a low frequency word;

a first word vector generation program unit, configured to, when it is determined that a word obtained from the word list is a low-frequency word, parse the word obtained from the word list into words, and encode the parsed words to determine corresponding word vectors;

a second word vector generation program unit configured to, when it is determined that a word obtained from the word list is a high-frequency word, extract a vector of the word obtained from the word list as a word vector;

5. A word recognition system comprising:

the word embedding language model recited in claim 4;

the word recognition program module is used for inputting the word vector and the part of speech classification vector of the word to be recognized into the word embedding language model so as to obtain the probability distribution of the part of speech classification to which the word to be recognized belongs and the probability distribution of the word to be recognized under the part of speech classification to which the word to be recognized belongs;

when the word to be recognized belongs to a vocabulary for training the word embedding language model, the word vector generation program module comprises:

the frequency word judging program unit is used for judging whether the words to be identified are low-frequency words or not;

the first word vector generation program unit is used for decomposing the word to be identified into characters when the word obtained from the word list is judged to be a low-frequency word, and coding the decomposed characters so as to determine a corresponding word vector;

and the second word vector generation program unit is used for extracting the vector of the word to be identified as a word vector when the word obtained from the word list is judged to be a high-frequency word.

6. The system of claim 5, wherein when the word to be recognized does not belong to a vocabulary for training the word embedding language model, the word vector generator module comprises:

the word list updating program unit is used for determining the attribute of the word to be recognized so as to update the word list;

and the word vector generation program unit is used for decomposing the words to be recognized into characters and coding the characters obtained by decomposition so as to determine corresponding word vectors.

7. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-3.

8. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 3.