CN111767731A - Training method and device of grammar error correction model and grammar error correction method and device - Google Patents
Training method and device of grammar error correction model and grammar error correction method and device Download PDFInfo
- Publication number
- CN111767731A CN111767731A CN202010655492.3A CN202010655492A CN111767731A CN 111767731 A CN111767731 A CN 111767731A CN 202010655492 A CN202010655492 A CN 202010655492A CN 111767731 A CN111767731 A CN 111767731A
- Authority
- CN
- China
- Prior art keywords
- error correction
- statement
- sentence
- training
- source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a grammar error correction model training method and device, a grammar error correction method and device, computing equipment and a computer readable storage medium. The training method comprises the following steps: performing data expansion processing based on the first training set to obtain a second training set; acquiring a second source sample statement and a second target sample statement based on the second training set; inputting the second source sample statement into a grammar error correction model to generate an error correction sample statement; determining a loss value based on the error corrected sample statement and the second target sample statement; and carrying out iterative training on the grammar error correction model based on the loss value until a training stop condition is reached. The aim of automatically expanding the training set is achieved by performing data enhancement processing on the existing training set, and the manual labor is effectively reduced.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for training a syntax error correction model, a method and an apparatus for syntax error correction, a computing device, and a computer-readable storage medium.
Background
When the neural network model is used for Chinese grammar error correction, a large amount of marking data is often needed. For the case of lack of labeling data, the data is often labeled by employing a labeling personnel, and the manual labeling of the data is time-consuming and labor-consuming.
The technical problems in the prior art are as follows: the expected effect is often not achieved by allowing a machine to automatically correct the Chinese sentences with grammar errors, wherein one important reason is the lack of a large amount of marking data. This is because the Chinese grammar error is of various kinds, and different annotators may have different annotation results for the same error, so we are required to adopt some automatic way to expand the training set.
Disclosure of Invention
In view of the above, the present application provides a method and an apparatus for training a syntax error correction model, a method and an apparatus for syntax error correction, a computing device, and a computer-readable storage medium, so as to solve the technical defects in the prior art.
Specifically, the application provides the following technical scheme:
the application provides a training method of a grammar error correction model, which comprises the following steps:
performing data expansion processing based on the first training set to obtain a second training set;
acquiring a second source sample statement and a second target sample statement based on the second training set;
inputting the second source sample statement into a grammar error correction model to generate an error correction sample statement;
determining a loss value based on the error corrected sample statement and the second target sample statement;
and carrying out iterative training on the grammar error correction model based on the loss value until a training stop condition is reached.
Optionally, for the training method, wherein the first training set comprises a first source sample statement and a first target sample statement;
the performing data expansion processing based on the first training set to obtain a second training set includes:
preprocessing the first source sample statement and the first target sample statement;
carrying out weight assignment on the word units based on the occurrence frequency of the word units in the first training set to construct a dictionary;
decomposing sentences contained in the source sample sentences of the first training set according to the dictionary to obtain second source sample sentences of data expansion; and constructing the second training set according to the second source sample statement and a second target sample statement corresponding to the second source sample statement.
Optionally, for the training method, wherein the corruption process comprises a word insertion process and/or a word substitution process;
the decomposing processing of the sentences contained in the source sample sentences of the first training set according to the dictionary to obtain a second source sample sentence of data expansion includes:
performing word insertion processing on the first source sample statement according to the dictionary to obtain a second source sample statement of data expansion; and/or performing word substitution processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence of data expansion.
Optionally, with respect to the training method, performing word insertion processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence with data expansion includes:
a1, acquiring the first source sample statement and the sentence length n of the first source sample statement;
a2, generating a corresponding first array based on the sentence length n of the first source sample sentence;
wherein each value in the first array is a randomly generated value in the range of (0, 1);
each numerical value in the first array has a subscript i corresponding to the position sequence of the numerical value in the first array, and the value range of the subscript i is an integer in the range of (0, n-1);
a3, acquiring subscript i corresponding to a numerical value smaller than a first threshold value in the first array according to the preset first threshold value;
a4, randomly selecting a word unit in the dictionary based on the weight, inserting the word unit into the ith position in the first source sample sentence, and generating a second source sample sentence with data expansion after word insertion processing.
Optionally, with respect to the training method, the performing word substitution processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence with data expansion includes:
b1, acquiring the first source sample statement and the sentence length n of the first source sample statement;
b2, generating a corresponding second array based on the sentence length of the first source sample sentence, wherein each numerical value in the second array is a numerical value in a randomly generated (0,1) range;
each numerical value in the second array has a subscript i corresponding to the position sequence of the numerical value in the second array, and the value range of the subscript i is an integer in the range of (0, n-1);
b3, acquiring subscript i corresponding to a numerical value smaller than a second threshold value in the second array according to the preset second threshold value;
b4, randomly selecting a word unit in the dictionary based on the weight, replacing the word unit at the ith position in the first source sample sentence with the randomly selected word unit, and generating a second source sample sentence with data expansion after word replacement processing.
Optionally, for the training method, wherein the first training set comprises a first source sample statement and a first target sample statement;
the performing data expansion processing based on the first training set to obtain a second training set further includes:
c1, preprocessing the first source sample statement and the first target sample statement;
c2, constructing a reverse training set in the form of < a first target sample statement, a first source sample statement > based on the first source sample statement and the first target sample statement;
c3, performing reverse training on the grammar error correction model based on the reverse training set, wherein the parameters of the grammar error correction model are fixed after the first target sample sentence is used as the input of the grammar error correction model, the first source sample sentence is used as the target output of the grammar error correction model, and the preset algebra training is performed;
c4, inputting the first target sample sentences in the reverse training set into the grammar error correction model with fixed parameters, and generating a preset number of candidate error correction sentences through bundle searching;
c5, reordering the candidate error correction statements in the preset number, and selecting the candidate error correction statements in the preset sequence as second source sample statements;
c6, constructing the second training set according to the second source sample statement and the second target sample statement corresponding to the second source sample statement.
Optionally, for the training method, the preprocessing the first source sample statement and the first target sample statement includes:
performing word segmentation processing on the first source sample sentence and the first target sample sentence, and performing separation processing on each word unit;
removing sentences with the sentence length larger than a preset threshold value in the first training set;
and removing the same sentences in the first source sample sentences and the first target sample sentences.
Optionally, for the training method, the inputting the second source sample statement into a syntax error correction model to generate an error-corrected sample statement includes:
inputting the second source sample statement into an encoder of the syntax error correction model for encoding to generate an encoding vector;
inputting the coding vector into a decoder of the syntax error correction model for decoding to obtain a preset number of candidate error correction sample sentences;
reordering the candidate error correction sample sentences of the preset number;
and taking the candidate error correction sample statement with the highest score as an error correction sample statement according to the reordering result.
Optionally, for the training method, the iteratively training the syntax error correction model based on the loss value until a training stop condition is reached includes:
judging whether the loss value is smaller than a preset threshold value or not;
if not, continuing to obtain a sample sentence to be processed and a label sentence for training;
if yes, stopping training.
The application provides a grammar error correction method, which comprises the following steps:
obtaining a source statement;
inputting the source sentences into a grammar error correction model to generate grammar error correction sentences;
wherein the grammar error correction model is trained by the training method of any one of claims 1-9.
Optionally, for the syntax error correction method, the inputting the source sentence into a syntax error correction model to generate a syntax error correction sentence includes:
inputting the source sentences to an encoder of the grammar error correction model for encoding to generate encoding vectors;
and inputting the coding vector to a decoder of the grammar error correction model for decoding, and generating the grammar error correction statement.
The application provides a trainer of grammar error correction model, includes:
the data expansion module is configured to perform data expansion processing based on the first training set to obtain a second training set;
an obtaining module configured to obtain a second source sample statement and a second target sample statement based on the second training set;
a syntax error correction module configured to input the second source sample statement to a syntax error correction model, and generate an error correction sample statement;
a penalty determination module configured to determine a penalty value based on the error corrected sample statement and the second target sample statement;
an iterative training module configured to iteratively train the grammar error correction model based on the loss value until a training stop condition is reached.
The application provides a grammar error correction device, including:
an obtaining module configured to obtain a source sentence;
a grammar error correction module configured to input the source sentences to a grammar error correction model to generate grammar error correction sentences;
wherein the grammar error correction model is trained by the training method of any one of claims 1-9.
The present application further provides a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any preceding paragraph when executing the instructions.
The present application also provides a computer readable storage medium storing computer instructions, wherein the instructions, when executed by a processor, implement the steps of the method of any preceding paragraph.
Has the advantages that:
according to the training method of the error correction model, the purpose of automatically expanding the training set is achieved by performing data enhancement processing on the existing training set, and manual labor is effectively reduced.
In the data enhancement processing process, the decomposed corpus processing and the reverse translation processing are combined for application, so that the data of the existing training set is effectively expanded.
And when the decomposed corpus is processed, the weight assignment is carried out on the word units in the dictionary in the word insertion and word replacement processes, the word units inserted or replaced are randomly selected based on the weight of the word units in the dictionary, the word units with large weight in the dictionary are easier to select, compared with the method of directly and randomly selecting one word unit, the word selection can be carried out based on the weight by considering the word frequency, the objective rule of grammatical errors is better met, and the accuracy of the model is further improved.
According to the grammar error correction method, grammar error correction of the source sentences can be achieved by using the trained grammar error correction model, and grammar error correction accuracy is improved.
According to the training device of the grammar error correction model, the purpose of automatically expanding the training set is achieved by performing data enhancement processing on the existing training set, and manual labor is effectively reduced; in the training device provided by the application, the decomposed corpus unit and the reverse translation unit are combined for application, so that the data of a training set in the training device is effectively expanded; in the word insertion processing word unit and the word substitution processing subunit in the training device, the word unit to be inserted or replaced is randomly selected based on the weight of the word unit in the dictionary, the word unit with the large weight in the dictionary is easier to select, and compared with the method of directly and randomly selecting one word unit, the word unit is selected based on the weight and then inserted or substituted, the method is more in line with the objective law of grammar error, and the accuracy of the training device is further improved.
The grammar error correction device provided by the application can realize grammar error correction of the source sentences by utilizing the trained grammar error correction model, and improves grammar error correction accuracy.
Drawings
FIG. 1 is a schematic flowchart of a grammar error correction model training method according to an embodiment of the present application;
FIG. 2 is a schematic flowchart of a grammar error correction model training method provided in the second embodiment of the present application;
FIG. 3 is a schematic flowchart of a grammar error correction model training method provided in the third embodiment of the present application;
FIG. 4 is a model diagram of a grammar error correction model training method provided in the third embodiment of the present application;
FIG. 5 is a flowchart illustrating a syntax error correction method provided in the fourth embodiment of the present application;
FIG. 6 is a block diagram of a training apparatus for a grammar error correction model according to a fifth embodiment of the present application;
FIG. 7 is a block diagram of a syntax error correction apparatus according to a sixth embodiment of the present application;
fig. 8 is a schematic structural diagram of a computing device provided in seventh embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if," as used herein, may be interpreted as "when or" responsive to a determination, "depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Transformer model: the Encoder is essentially a structure of an Encoder (Encoder) -Decoder (Decoder), wherein the Encoder is formed by sequentially connecting 6 encoding layers, and the Decoder is formed by sequentially connecting 6 decoding layers. As with all generative models, the encoder receives the original input text and outputs the encoded vectors to the decoder, which generates the decoded vectors and results in the final output text.
Bundle search (beam search): applied to the field of machine translation, a commonly used algorithm when finding the best translation result (the result with the largest probability) is a beam search. The bundle search has one parameter: beamwidth (beamwidth), which indicates that beamwidth candidates are considered in generating each translation result.
Word unit (token): before any actual processing of the input text, it needs to be segmented into language units such as words, punctuation marks, numbers or letters, which are called word units. For an english text, a word unit may be a word, a punctuation mark, a number, etc., and for a chinese text, the smallest word unit may be a word, a punctuation mark, a number, etc.
Sentence length: the sentence length (number of words) refers to the number of word units included in a sentence.
Array (array): the array is used for storing a group of data, and the types of each element stored in the array are required to be the same, so that the size of the memory occupied by each element is consistent. The values in the array are accessed by a combination of array name and subscript. Once an array is created, its length cannot be changed, with the effective subscripts of the array elements ranging from 0 to n-1.
Encoder (encoder): and converting the sentence to be translated into a coding vector from words.
Decoder (decoder): the encoded vector is generated into a decoded vector, and the decoded vector is converted into an answer.
Language Model (Language Model): is a probabilistic model used to compute the probability of a sentence. Using a language model, it can be determined which word sequence is more likely, or given several words, the next most likely word can be predicted.
Reordering (Rerank) refers to the ordering of probability value magnitudes for multiple candidate sentences based on a language model.
In the present application, a method and an apparatus for training a syntax error correction model, a method and an apparatus for syntax error correction, a computing device, and a computer-readable storage medium are provided, and detailed descriptions are individually provided in the following embodiments.
Example one
The present embodiment provides a training method of an error correction model, which, as shown in fig. 1, includes steps S101 to S105, and each step will be described in detail below.
S101, performing data expansion processing based on the first training set to obtain a second training set.
In this embodiment, the first training set is an existing training set with less data. It is divided into a source end (source sentence end) and a target end (target sentence end). In the first training set, the source side (source sentence side) includes a first source sample sentence, and the target side (target sentence side) includes a first target sample sentence. The first source sample statement is a statement with a grammar error, and the first target sample statement is a target sample statement with a correct grammar, which corresponds to the first source sample statement one by one.
For example, in the first training set, the first source sample statement is "learn up regardless of things", where "learn up" has grammatical errors; the first target sample sentence corresponding to the first source sample sentence is "to learn regardless of the fact".
Specifically, in the first training set, the first source sample statement and the first target sample statement exist in a statement pair, that is, < first source sample statement, first target sample statement >. When a person skilled in the art trains a syntax error correction model, a first source sample sentence at a source end (source sentence end) is generally used as an error correction model input, and a first target sample sentence at a target end (target sentence end) is used as an output label of the error correction model.
Specifically, the number of < first source sample statement, first target sample statement > statement pairs contained in the first training set is limited.
Further, in order to expand the data amount of the training set, a data expansion process is performed based on the first training set to obtain a second training set.
Specifically, the operations of the data expansion process include, but are not limited to: performing corpus corruption processing (for example, performing random word insertion and word replacement processing on the first source sample sentence) included in the source end (source sentence end) in the first training set, or performing reverse translation on the first source sample sentence based on the sentence pair in the first training set, and performing data expansion on the first training set by the sentence obtained after the reverse translation.
Further, in this embodiment, a second source sample statement obtained by expanding the data and a second target sample statement corresponding to the second sample statement together form a second training set.
Specifically, in this embodiment, when performing data expansion processing based on the first training set, word insertion, word replacement, and reverse translation may be performed on the same first source sample sentence, so as to obtain three new sentences, and then the first target sample sentence corresponding to the first source sample sentence corresponds to 4 source sample sentences in total, and then the four source sample sentences after data expansion are used as second source sample sentences, and each of the four same target sample sentences corresponding to each second source sample sentence is the second sample sentence.
For example, the original sentence pair in the first training set is < i learn some chinese after getting up five and a half times a day in the morning, i begin to learn some chinese after getting up five and a half times a day in the morning >, and then the first source sample sentence (denoted as w1) is respectively subjected to word insertion, word replacement, and reverse translation processing, so as to obtain new sentences:
w2, after I get up half a day five times in the morning, I just learn some Chinese, wherein the 'good' word unit is an inserted word unit;
w3, after I get up half a day in the morning, I learn some texts and replace the original text characters with book characters;
w4, and after getting up half a day and in the morning, I learn some Chinese.
Then w 1-w 4 are respectively combined with the original first target sample statement (p1), and the formed statement pair is as follows:
< w1, p1>, < w2, p1 '>, < w3, p1 ">, < w4, p 1'", wherein p1 'to p 1' "are the same as p 1.
Then w 1-w 4 are the second source sample statements, and p 1-p 1' "are the second target sample statements corresponding to the second source sample statements.
Further, the second source sample statement and the corresponding second target sample statement constitute a second training set.
S102, acquiring a second source sample statement and a second target sample statement based on the second training set.
Specifically, a statement pair is obtained based on a second training obtained after data expansion, and the form of the statement pair is < a second source sample statement, a second target sample statement >.
Further, the second source sample statement is used as an input of the syntax error correction model, and the second target sample statement is used as a target sample tag statement of an output of the syntax error correction model. In this embodiment, the second target sample sentence in the sentence pair in the second training set obtained after data expansion is used as the label sentence output by the model.
S103, inputting the second source sample statement into a grammar error correction model to generate an error correction sample statement.
In this embodiment, the syntax error correction model is specifically in the form of a transform neural network model. The Transformer neural network model comprises an Encoder (Encoder) and a Decoder (Decode).
Specifically, in the present embodiment, the second source sample statement in the statement pair included in the second training set is input into the syntax error correction model, and the error-corrected error correction sample statement is generated through model processing.
S104, determining a loss value based on the error correction sample statement and the second target sample statement.
Specifically, the error correction sample statement generated by the language error correction model is compared with a second target sample statement (i.e., a tag statement) corresponding to a second source sample statement of the input model, and a loss value is calculated by a loss function.
In practical applications, the loss function may be, for example, a classification cross entropy function, a maximum entropy function, or the like, which is not limited in this application.
And S105, carrying out iterative training on the grammar error correction model based on the loss value until a training stopping condition is reached.
Specifically, the threshold value of the loss value may be set in advance as a condition for stopping training. For example, the threshold value is set to 0.2. This is not limited by the present application.
In this embodiment, through the steps S101 to S105, a training method for a syntax error correction model is provided, wherein a second training set with more data is generated on the basis of an existing first training set with a limited number through data extension processing, so that training data of the model can be effectively extended, a large amount of labor required by manual labeling is saved, and a model training effect is further improved.
Example two
The embodiment provides a training method of a grammar error correction model, which is shown in fig. 2 and specifically includes step S201 to step S207.
S201, performing data expansion processing based on the first training set to obtain a second training set.
In this embodiment, the first training set includes a first source sample statement and a first target sample statement.
The first training set is an already existing training set with less data. It is divided into a source end (source sentence end) and a target end (target sentence end). In the first training set, the source side (source sentence side) includes a first source sample sentence, and the target side (target sentence side) includes a first target sample sentence. The first source sample statement is a statement with a grammar error, and the first target sample statement is a target sample statement with a correct grammar, which corresponds to the first source sample statement one by one.
Specifically, in the first training set, the first source sample statement and the first target sample statement exist in a statement pair, that is, < first source sample statement, first target sample statement >. In the application, in the process of training the grammar error correction model based on the existing first training set, a first source sample sentence at a source end (a source sentence end) is used as an error correction model input, and a first target sample sentence at a target end (a target sentence end) is used as an output label of the error correction model.
Further, in this embodiment, the first source sample statement and the first target sample statement are preprocessed first.
The pretreatment specifically comprises the following steps: performing word segmentation processing on the first source sample sentence and the first target sample sentence, and performing separation processing on each word unit;
removing sentences with the sentence length larger than a preset threshold value in the first training set;
and removing the same sentences in the first source sample sentences and the first target sample sentences.
Further, separating each word unit in all sentences included in the first source sample sentence and the first target sample sentence from each other by adopting a space; deleting overlength or overlength sentences in the first training set, wherein for example, the word units are more than 30 overlength sentences, and the word units are less than 5 overlength sentences; in addition, when the first source sample sentence and the first target sample sentence in the first training set have the same sentence, the same sentence pair is deleted.
In the embodiment, the existing first training set is preprocessed by the method, so that the bad influence of too long or too short sentences on the training effect of the model can be avoided, and the accuracy of model training is improved; and because the sentences at the source sentence end and the target sentence end are the same, the training of the model grammar error correction capability can not be achieved, and the load in the model training process is increased, so that the first source sample sentence and the first target sample sentence which are the same in the first training set are deleted, and the model training can be more accurate.
Further, in this embodiment, a second training set is obtained by performing data expansion processing based on the preprocessed first training set.
Specifically, the data expansion processing includes: the two processing methods are completely independent and can be used alternately in the actual application process, for example, the decomposed corpus processing is performed first, and then the statement pair obtained by decomposing the corpus is subjected to reverse translation processing; only the decomposed corpus processing or only the reverse translation processing may be performed. This is not limited by the present application.
Specifically, the performing data expansion processing based on the first training set to obtain the second training set includes:
preprocessing the first source sample statement and the first target sample statement;
carrying out weight assignment on the word units based on the occurrence frequency of the word units in the first training set to construct a dictionary;
decomposing sentences contained in the source sample sentences of the first training set according to the dictionary to obtain second source sample sentences of data expansion; and constructing the second training set according to the second source sample statement and a second target sample statement corresponding to the second source sample statement.
The preprocessing of the first source sample statement and the first target sample statement is already described in the foregoing, and is not repeated here.
In this embodiment, a dictionary is constructed by word units appearing in all sentences contained in the preprocessed first training set, and weight assignment is performed according to the frequency of the word units appearing in the first training set: the more the number of occurrences, the greater the weight, i.e., the greater the weight of a word unit, the greater the probability that the word unit is used. For example, if the frequency of occurrence of the "i" word element in the first training set is greater than the frequency of occurrence of the "i" word element in the "i" training set, the weighting value of the "i" word element is greater than the weighting value of the "i" word element in the "i" training set.
Further, the statement pair in the second training set exists in the form of < second source sample statement, second target sample statement >. Specifically, in the second training set, there are a plurality of identical second target sample sentences, and the identical second target sample sentences respectively correspond to different second source sample sentences.
Further, in this embodiment, the decomposing the sentences included in the source sample sentences of the first training set according to the dictionary to obtain a second source sample sentence of data expansion includes:
performing word insertion processing on the first source sample statement according to the dictionary to obtain a second source sample statement of data expansion; and/or performing word substitution processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence of data expansion.
Further, the performing word insertion processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence with data expansion includes:
a1, acquiring the first source sample statement and the sentence length n of the first source sample statement;
a2, generating a corresponding first array based on the sentence length n of the first source sample sentence;
wherein each value in the first array is a randomly generated value in the range of (0, 1);
each numerical value in the first array has a subscript i corresponding to the position sequence of the numerical value in the first array, and the value range of the subscript i is an integer in the range of (0, n-1);
a3, acquiring subscript i corresponding to a numerical value smaller than a first threshold value in the first array according to the preset first threshold value;
a4, randomly selecting a word unit in the dictionary based on the weight, inserting the word unit into the ith position in the first source sample sentence, and generating a second source sample sentence with data expansion after word insertion processing.
Further, the performing word substitution processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence with data expansion includes:
b1, acquiring the first source sample statement and the sentence length n of the first source sample statement;
b2, generating a corresponding second array based on the sentence length of the first source sample sentence, wherein each numerical value in the second array is a numerical value in a randomly generated (0,1) range;
each numerical value in the second array has a subscript i corresponding to the position sequence of the numerical value in the second array, and the value range of the subscript i is an integer in the range of (0, n-1);
b3, acquiring subscript i corresponding to a numerical value smaller than a second threshold value in the second array according to the preset second threshold value;
b4, randomly selecting a word unit in the dictionary based on the weight, replacing the word unit at the ith position in the first source sample sentence with the randomly selected word unit, and generating a second source sample sentence with data expansion after word replacement processing.
Further, in this embodiment, in the word insertion and word substitution processing, the "randomly selecting a word unit in the dictionary based on weight" specifically refers to: in the process of randomly selecting word units, word units with larger weights in the dictionary are easier to select, for example, word units with weight 0.9 are easier to select than word units with weight 0.1.
For example, the first source sample statement is: i do not have high level, but I compete with other students (marked as m)0);
The first target sample sentence is that although I is not high, I should compete with other students (denoted as p)0)。
In the case of the word insertion process, in which the first source sample sentence length is 17, then the base is setRandomly generating a first array (0.5,0.5,0.1,0.7,0.5,. 0.6) in sentence length, wherein subscripts of numerical values in the array are 0-16, a preset threshold value is 0.3, if '0.1' in the generated first array is smaller than the first threshold value, a numerical subscript i of '0.1' is 2, randomly selecting a word unit 'not' from a constructed dictionary based on weights, inserting the word unit into the 2 nd position of the original first source sample sentence, and generating a second source sample sentence which is subjected to data expansion after word insertion processing, namely 'I is not high, but I wants to compete with other students' (i.e. m is m)1)。
Further, in this embodiment, when the first array generated based on the first source sample statement includes a plurality of numerical values smaller than the preset threshold, the word units selected from the dictionary may be inserted into the first source sample statement simultaneously based on all subscripts smaller than the preset first threshold numerical value, or may be inserted in batches, or one by one. If the insertion is in batch or one insertion, each batch of word units or one word unit is inserted, and a new word unit is inserted, the ith position is updated along with the inserted word units.
When word substitution processing is performed, a first source sample sentence length is 17, then a first array (0.5,0.5,0.1,0.7,0.5,. 0.6) is randomly generated based on the sentence length, subscripts of numerical values in the array are 0-16, a preset threshold value is 0.3, 0.1 in the generated first array is smaller than the first threshold value, a numerical subscript i of 0.1 is 2, a word unit of you is randomly selected from a constructed dictionary based on weights, the word unit of 2 nd position in the original first source sample sentence is replaced by you, and a second source sample sentence generated after word substitution processing and data expansion is "I is not high but competed with other students" (i is m)2)。
In this embodiment, the second source sample sentence m of data expansion is obtained by the word insertion and word replacement processing1And m2And m is1And m2Respectively corresponding second target sample statement and p0The same is true.
In the course of the training method for the grammar error correction model provided by the embodiment, in the course of constructing the dictionary, the weight assignment is carried out on the word units based on the occurrence frequency of the word units, and in the course of carrying out word insertion and word substitution by decomposition processing, the word units to be inserted or replaced are randomly selected in the dictionary based on the weights of the word units, so that the words with larger weights in the dictionary are more easily selected, and accord with the objective law of grammar errors, that is, the words which occur more frequently are more easily used in the course of making mistakes in grammar, and the model training effect is further improved.
Further, the performing data expansion processing based on the first training set to obtain a second training set further includes:
c1, preprocessing the first source sample statement and the first target sample statement;
c2, constructing a reverse training set in the form of < a first target sample statement, a first source sample statement > based on the first source sample statement and the first target sample statement;
c3, performing reverse training on the grammar error correction model based on the reverse training set, wherein the parameters of the grammar error correction model are fixed after the first target sample sentence is used as the input of the grammar error correction model, the first source sample sentence is used as the target output of the grammar error correction model, and the preset algebra training is performed;
c4, inputting the first target sample sentences in the reverse training set into the grammar error correction model with fixed parameters, and generating a preset number of candidate error correction sentences through bundle searching;
c5, reordering the candidate error correction statements in the preset number, and selecting the candidate error correction statements in the preset sequence as second source sample statements;
c6, constructing the second training set according to the second source sample statement and the second target sample statement corresponding to the second source sample statement.
The preprocessing of the first source sample statement and the first target sample statement is described in detail in the foregoing.
Specifically, the step c5 of reordering the preset number of candidate error correction statements uses a trained Language Model (LM).
Specifically, the language model is: for any sequence of words, the language model can calculate the probability that the sequence is a sentence. For example, the word sequence a: "weather of today | true | good | o", it can be seen that the word sequence a is obviously a sentence, and if a better trained language model is adopted, it gives a high probability for the word sequence a; another example is word sequence B: "today | fruit | learns | not as | it is clear that word sequence B is not a sentence, and if the language model is trained well, it gives a very small probability for sequence B.
Further, suppose a language model is created for chinese, V representing a dictionary, V ═ sun, sunrise, moon, cloud, and human. In practical application, the dimension of V is very high, and can reach tens of thousands of dimensions and hundreds of thousands of dimensions.
Another sentence, which is composed of words, is represented as: w is a1w2w3...wnWherein w isiBelonging to a dictionary V.
The role of the language model is: given the lexicon V, the probability p (w) that an arbitrary sequence of words is a sentence can be computed1w2w3...wn) Wherein p is more than or equal to 0. Thus, for each given sequence of words, a corresponding p (w) can be calculated based on the language model1w2w3...wn) The plurality of word sequences are then reordered based on the probability p.
Further, the language model learns p (w) from the data1w2w3...wn) The simplest method of (1) is "number", which specifically is: assuming a total of N sentences in the training set, the language model can be counted out in the training set (w)1w2w3...wn) The number of occurrences, assuming n, then p (w)1w2w3...wn) N/N. However, the prediction capability of the method is almost 0, and once the word sequence does not appear in the training set, the output probability of the model is 0, which is obviously quite unreasonable.
Thus, it is possible to provideTo more reasonably learn p (w)1w2w3...wn) An n-gram language model may be employed. The principle of the n-gram language model is as follows: expand p according to the chain rule (chain rule):
further, to simplify the posterior probability p (w)i|w1,...,wi-1) The first-order Markov assumption (first-order Markov assumption) is introduced: each word depends only on the previous word;
then p (w)i|w1,...,wi-1)=p(wi|wi-1);
further, a second order markov assumption may also be introduced: each word depends on the first two words, and the specific principle introduces the similarity of first-order Markov assumptions, which are not described in detail herein. Obtained finally
where count (×) represents the number of occurrences in the training set. Note that since the n-gram has too many parameters, there are many | V! cellsnIn practice, many parameters are not present in the training set, i.e. count (w)i-N+1,....,wi-1,wi) The probability of many sentences is 0 when the model performs prediction, and in order to avoid this, some smoothing processing needs to be performed on the case of count (×) 0, and the simplest method is that all phrases are usedThe number of occurrences is increased by 1.
The foregoing is a specific principle of the n-gram language model.
In summary, through the above reverse translation process, the second source sample sentence of the data extension can be generated based on the first target sample sentence in the existing first training set, and the training set is further expanded.
For example, the sentence pairs in the first training set<m1,p1>The first source sample statement and the first target sample statement are exchanged to generate a reverse training set<p1,m1>The reverse training set comprises a plurality of forms of<p1,m1>The sentence pair of (1);
then the first target sample statement p in the reverse training set1Inputting the sentence into a grammar error correction model for reverse translation training, and using a first source sample sentence m1After certain algebraic training as an output target label of reverse training, fixing parameters of a reverse translation grammar error correction model;
then the first source sample statement m in the reverse training set statement pair1Inputting the candidate error correction sentences into a reverse grammar error correction model with fixed parameters, and generating a preset number of candidate error correction sentences through bundle searching, for example, setting the bundle size to 12, so that 12 candidate error correction sentences corresponding to the input first target sample sentence can be generated;
and sorting the generated 12 candidate error correction sentences by adopting the trained language model, and then selecting the candidate error correction sentences in a preset sequence as second source sample sentences, for example, selecting the candidate error correction sentences sorted in the 4 th sequence as second source sample sentences.
Further, in this embodiment, one or more error correction statements may be selected according to the ordering of the candidate error correction statements, which is not limited in this application.
Further, the second training set is constructed according to the second source sample statement and a second target sample statement corresponding to the second source sample statement. Wherein the second target sample sentence is a sentence input in the reverse translation process.
Further, the existing data of the first training set is expanded through the decomposed corpus processing (word insertion, word substitution) and the reverse translation processing, so as to obtain a second training set.
In the embodiment, through data expansion processing, the second training set can be automatically generated based on the first training set, so that the training set is expanded, and additional manpower is not needed for labeling, so that the time and the labor cost are saved; and when decomposed corpus processing is carried out, random selection is carried out based on the weight of the word units in the dictionary, so that the method is closer to the mistake making habit in actual life, and the accuracy and the confidence coefficient of model training can be further improved.
S202, acquiring a second source sample statement and a second target sample statement based on the second training set.
Specifically, a statement pair is obtained based on a second training set obtained after data expansion, and the form of the statement pair is < a second source sample statement and a second target sample statement >.
Further, the second source sample statement is used as a sample statement of the syntax error correction model, and the second target sample statement is used as a tag statement of the syntax error correction model.
S203, inputting the second source sample statement into the encoder of the grammar error correction model for encoding, and generating an encoding vector.
Specifically, a second source sample statement is input to an embedding layer of the syntax error correction problem solving model to generate an embedding vector;
and inputting the embedded vector into an encoder of the grammar error correction model for encoding to generate an encoding vector.
And S204, inputting the coding vector into a decoder of the syntax error correction model for decoding to obtain a preset number of candidate error correction sample sentences.
S205, reordering the candidate error correction sample sentences of the preset number.
Specifically, the preset number of candidate error correction sample sentences are reordered by using the trained language model.
The process of reordering the language models is described in detail in the foregoing, and is not repeated herein.
And S206, taking the candidate error correction sample statement with the highest score as an error correction sample statement according to the reordering result.
For example, put the second source sample statement "do you know when to go to class? "generating an embedded vector by embedding through an embedding layer (embedding layer) of a syntax error correction model, and then inputting the embedded vector to an Encoder (Encoder) of the syntax error correction model to generate an encoded vector; then inputting the coding vector into a Decoder (Decoder) of a syntax error correction model, and generating a preset number of candidate error correction sample statements, for example, generating 12 candidate error correction sample statements; and then, reordering the 12 candidate error correction sample sentences by adopting the trained language model, and according to a reordering result, taking the candidate error correction sample sentence with the highest score as an error correction sample sentence, for example, taking the 5 th candidate error correction sample sentence in the 12 candidate error correction sample sentences as an error correction sample sentence of the grammar error correction model if the score of the 5 th candidate error correction sample sentence is the highest.
And S207, carrying out iterative training on the grammar error correction model based on the loss value until a training stop condition is reached.
The method specifically comprises the following steps: judging whether the loss value is smaller than a preset threshold value or not;
if not, continuing to obtain a sample sentence to be processed and a label sentence for training;
if yes, stopping training.
For example, if the preset value is 0.2, the training is stopped when the loss value is less than 0.2.
The embodiment provides a training method of a grammar error correction model, which achieves the purpose of automatically expanding a training set by performing data enhancement processing on an existing training set and effectively reduces manual labor.
In the data enhancement processing process, the decomposed corpus processing and the reverse translation processing are combined for application, so that the data of the existing training set is effectively expanded.
And when the decomposed corpus is processed, the weight assignment is carried out on the word units in the dictionary in the word insertion and word replacement processes, the word units inserted or replaced are randomly selected based on the weight of the word units in the dictionary, the word units with large weight in the dictionary are easier to select, and compared with the method of directly and randomly selecting one word unit, the selection is carried out based on the weight, so that the method is more in line with the objective law of grammatical errors, and the accuracy of model training is further improved.
EXAMPLE III
The embodiment provides a grammar error correction model training method, which is shown in fig. 3 and includes the following steps:
s301, preprocessing data.
Specifically, preprocessing a first source sample statement and a first target sample statement in an existing first training set includes:
performing word segmentation processing on the first source sample sentence and the first target sample sentence, and performing separation processing on each word unit;
removing sentences with the sentence length larger than a preset threshold value in the first training set;
and removing the same sentences in the first source sample sentences and the first target sample sentences.
Further, separating each word unit in all sentences included in the first source sample sentence and the first target sample sentence from each other by adopting a space; deleting overlength or overlength sentences in the first training set, wherein for example, overlength sentences are obtained when the word units are more than 35, and overlength sentences are obtained when the word units are less than 5; in addition, when the first source sample sentence and the first target sample sentence in the first training set have the same sentence, the same sentence pair is deleted.
And S302, enhancing data.
Specifically, data expansion processing is performed on the basis of the preprocessed first training set to obtain a second training set.
In this embodiment, the data enhancement includes the following ways: and reversely translating and decomposing the linguistic data.
Further, the decomposed corpus includes: word insertion, word substitution.
Further, in the training method of the grammar error correction model provided in this embodiment, the decomposed corpus may further include operations such as word exchange.
Specifically, the two manners of processing the reverse translation and the decomposed corpus are completely independent, and can be used alternately in the actual application process, for example, the decomposed corpus is processed first, and then the statement pair obtained by the decomposed corpus is subjected to the reverse translation; only the decomposed corpus processing or only the reverse translation processing may be performed. This is not limited by the present application.
Through the data enhancement, the embodiment obtains the second training set on the basis of the existing first training set.
And S303, generating candidate error correction sample sentences based on the Chinese grammar error correction model of the Transformer.
And based on the obtained second training set, inputting a second source sample sentence into the Chinese grammar error correction model to generate a candidate error correction sample sentence.
Further, the basic structure of the Chinese grammar error correction model is a Transformer model structure.
The statement process of generating candidate error correction samples by the Chinese grammar error correction model based on the Transformer is shown as a structure in a dotted frame in FIG. 4.
Specifically, first, a second source sample sentence "which movie you like well? "Embedding is performed by an Embedding layer (Embedding layer) to generate an embedded vector;
then inputting the embedded vector into an Encoder (Encoder) for encoding to generate an encoded vector;
the encoded vector is then input to a Decoder (Decoder) for decoding, generating a predetermined number of candidate error correction sample statements, e.g. corresponding to the input second source sample statement, "which movie you like well? "generate 10 candidate error correction sample statements p1-p10。
And S304, reordering the language models.
As shown by the structures outside the dashed box in fig. 4Using the trained language model to correct the candidate error sample sentence p1-p10Reordering (Rerank); the process of reordering the language models is described in detail in the foregoing, and is not repeated herein.
Finally, based on the result of the reordering, p is selected1-p10The error correction sample statement with the highest median value is used. For example at p1-p10In (c) p7Is the highest, i.e. in p7As error correction sample statements.
And S305, performing iterative training.
The error correction sample statement (such as p) obtained above7) Performing loss value calculation on a second target sample statement corresponding to the second source sample statement in the step S303;
and then carrying out iterative training on the grammar error correction model based on the loss value until a training stop condition is reached.
The embodiment provides a training method of a grammar error correction model through the steps, and the aim of automatically expanding the training set is fulfilled by performing data enhancement processing on the existing training set, so that the manual labor is effectively reduced. In the data enhancement processing process, the decomposed corpus processing and the reverse translation processing are combined for application, so that the data of the existing training set is effectively expanded.
Example four
Based on the grammar error correction model obtained by the aforementioned training method, the present embodiment provides a grammar error correction method, as shown in fig. 5, including the following steps:
s501, obtaining a source sentence.
Specifically, the source sentence obtaining method includes: obtained from various forum websites, obtained from the composition of a student's submitted text paper, etc. This is not limited by the present application.
S502, inputting the source sentences into a grammar error correction model to generate grammar error correction sentences.
Further, the inputting the source sentence into a grammar error correction model to generate a grammar error correction sentence includes:
inputting the source sentences to an encoder of the grammar error correction model for encoding to generate encoding vectors;
and inputting the coding vector to a decoder of the grammar error correction model for decoding, and generating the grammar error correction statement.
Further, the inputting the source sentence into a grammar error correction model to generate a grammar error correction sentence includes:
inputting the source sentences to an embedding layer of a grammar error correction model for embedding, and generating embedding vectors of the source sentences;
inputting the embedded vector into an encoder of a syntax error correction model for encoding to generate an encoding vector;
and inputting the coding vector into a decoder of a syntax error correction model for decoding to obtain a syntax error correction statement.
Specifically, the basic structure of the syntax error correction model employed in the present application is a transform model structure.
The Transformer model comprises an embedded layer, an encoder and a decoder.
(1) And performing embedding layer processing on the source sentence, more specifically, segmenting the source sentence to obtain a plurality of word units, then performing word embedding processing on each word unit, and finally obtaining a word vector of each word unit.
Word embedding is actually a type of technique that represents individual word units as real-valued vectors in a predetermined vector space. Each word unit is mapped to a vector (initial randomization).
The usual step of using an embedding layer is generally to pre-process the source sentence first, converting each word unit into a one-hot form of encoding. The word vector corresponding to the word unit is actually one part of the algorithm model, the word vector is represented by a predefined dimension, and the size is initialized randomly. Here, the embedding layer is actually an input layer of the syntax error correction model.
(2) Specifically, the encoder of the Transformer model comprises six encoding layers in total.
Specifically, each encoding layer in the encoder includes 1 multi-head self-attention layer (FFN) and 1 fully connected feed-forward network (FFN).
For the encoded vectors input to the multi-attention layer, there are 3 different vectors corresponding to each word unit, namely, word vectors q (query), k (key), v (value). The multi-head attention layer is calculated by projecting word vectors Q, K and V through h different linear transformations and finally splicing different attention results.
In the attention calculation of the encoder, the word vectors q (query), k (key), and v (value) are all equal to each other, and they are the first encoding vector output from the previous encoding layer. For the first coding layer, the word vector Q, K, V is the vector output by the embedding layer (word embedding) multiplied by the weight matrix.
Specifically, the calculation formula of the multi-head attention layer is as follows:
headi=Attention(QWi Q,KWi K,VWi V) (1)
head1,...,headh)WO(2)
q, K, V is a word vector corresponding to the input encoding vector;
headiself-attention results for each head (head) of a multi-head attention layer;
the Multihead is an output result of the multi-head attention layer;
concat is a splicing function;
Wi Q、Wi K、Wi Vweight matrices for each linear transformation of the word vector Q, K, V, e.g., three different word vectors Q, K, V for each word unit, all 64-dimensional, are multiplied by the embedded vector by three different weight matrices W through 3 different weight matricesi Q、Wi K、Wi VThe three matrices are then 512 x 64 dimensions.
WOThe weight matrix required for linear transformation has 512 x 512 dimensions.
The output of the multi-head attention layer is then input to a fully connected layer (FFN). The calculation formula of the full connection layer (FFN) is as follows:
x=max(0,xH1+b1)H2+b2 (3)
wherein H1、H2Training to obtain a parameter matrix;
b1、b2as a parameter, training is obtained;
x is the output result of the multi-head attention layer;
FFN (x) is the output result of the full link layer.
And obtaining the coding vector output by the jth coding layer through the output result of the full connection layer.
(3) Specifically, the decoder of the Transformer model comprises six decoding layers in total.
For each decoding layer, three layers are included, the first layer is a masked multi-head attention layer (multi-head self-attention), the second layer is a multi-head attention layer (multi-head self-attention), and the third layer is a feed-forward network (feed-forward network). For the multi-head attention layer and the feedforward layer, the process of generating the coding vector by the coding layer is described in detail, and is not described herein again.
Note that in the self-attention calculation of the decoding layers, the word vectors Q, K, V are equal in dimension, for the first decoding layer, the word vector Q corresponds to the reference vector input to the decoder, and the word vectors K and V are from the encoded vectors corresponding to the source sentences output by the encoder; for the other decoding layers except the first decoding layer, the word vector Q is from the decoding vector output by the last decoding layer, and the word vectors K and V are from the encoding vectors corresponding to the source sentences output by the encoder.
As can be seen from the above comparison between the structure of the decoding layer and the structure of the encoding layer, the decoding layer has one more masked multi-head self-entry layer (masked multi-head self-entry) than the encoding layer. The operations of the multi-head attention layer and the multi-head attention layer are basically consistent, except that the operation of the mask is added. The role of the mask is to prevent future output words from being used by the model in performing the task of grammar correction. The first word is not able to refer to the result of the second word. Mask changes this information to 0 to ensure that the output at each position i will only depend before i bits (i bits are not included because of the right shift by one bit and Mask).
Further, the grammar error correction model is obtained by training through the training method provided by the foregoing embodiment. The training process of the syntax error correction model has been described in detail in the foregoing embodiment, and is not repeated in this embodiment.
The embodiment provides a grammar error correction method, which can realize grammar error correction of a source sentence by using a trained grammar error correction model, and improve grammar error correction accuracy.
EXAMPLE five
The embodiment provides a training device for a grammar error correction model, which is shown in fig. 6 and includes the following modules:
and a data expansion module 610 configured to perform data expansion processing based on the first training set to obtain a second training set.
Wherein the first training set comprises a first source sample statement and a first target sample statement.
Specifically, the data expansion module 610 includes: a corpus corruption unit and a reverse translation unit.
The corpus corruption unit is configured to:
preprocessing the first source sample statement and the first target sample statement;
carrying out weight assignment on the word units based on the occurrence frequency of the word units in the first training set to construct a dictionary;
decomposing sentences contained in the source sample sentences of the first training set according to the dictionary to obtain second source sample sentences of data expansion; and constructing the second training set according to the second source sample statement and a second target sample statement corresponding to the second source sample statement.
The corpus corruption unit comprises a word insertion processing subunit and/or a word replacement processing subunit.
The word insertion processing subunit is configured to: and performing word insertion processing on the first source sample statement according to the dictionary to obtain a second source sample statement of data expansion.
The word insertion processing subunit is further configured to:
acquiring the first source sample statement and the sentence length n of the first source sample statement;
generating a corresponding first array based on the sentence length n of the first source sample sentence;
wherein each value in the first array is a randomly generated value in the range of (0, 1);
each numerical value in the first array has a subscript i corresponding to the position sequence of the numerical value in the first array, and the value range of the subscript i is an integer in the range of (0, n-1);
obtaining a subscript i corresponding to a numerical value smaller than a first threshold value in the first array according to a preset first threshold value;
and randomly selecting a word unit in the dictionary, inserting the word unit into the ith position in the first source sample statement, and generating a second source sample statement of which the data is expanded after word insertion processing.
The word substitution processing subunit is configured to: and performing word substitution processing on the first source sample statement according to the dictionary to obtain a second source sample statement of data expansion.
The word substitution processing subunit is further configured to:
acquiring the first source sample statement and the sentence length n of the first source sample statement;
generating a corresponding second array based on the sentence length of the first source sample sentence, wherein each numerical value in the second array is a numerical value in a (0,1) range generated randomly;
each numerical value in the second array has a subscript i corresponding to the position sequence of the numerical value in the second array, and the value range of the subscript i is an integer in the range of (0, n-1);
obtaining subscript i corresponding to a numerical value smaller than a second threshold value in the second array according to a preset second threshold value;
and randomly selecting a word unit in the dictionary, replacing the word unit at the ith position in the first source sample sentence with the randomly selected word unit, and generating a second source sample sentence with data expansion after word replacement processing.
The reverse translation unit is configured to:
preprocessing the first source sample statement and the first target sample statement;
constructing a reverse training set in the form of < a first target sample statement, a first source sample statement > based on the first source sample statement and a first target sample statement;
carrying out reverse training on the grammar error correction model based on the reverse training set, wherein the first target sample sentence is used as the input of the grammar error correction model, the first source sample sentence is used as the target output of the grammar error correction model, and after training of a preset algebra is carried out, the parameter of the grammar error correction model is fixed;
inputting the first target sample statement in the reverse training set into the grammar error correction model with fixed parameters, and generating a preset number of candidate error correction statements through cluster searching;
reordering the candidate error correction statements in the preset number, and selecting the candidate error correction statements in the preset sequence as second source sample statements;
and constructing the second training set according to the second source sample statement and a second target sample statement corresponding to the second source sample statement.
The corpus corruption unit and the reverse translation unit are further configured to:
performing word segmentation processing on the first source sample sentence and the first target sample sentence, and performing separation processing on each word unit;
removing sentences with the sentence length larger than a preset threshold value in the first training set;
and removing the same sentences in the first source sample sentences and the first target sample sentences.
An acquisition module 620 configured to: and acquiring a second source sample statement and a second target sample statement based on the second training set.
And a syntax error correction module 630 configured to input the second source sample statement to a syntax error correction model, and generate an error correction sample statement.
The syntax error correction module 630 is further configured to:
inputting the second source sample statement into an encoder of the syntax error correction model for encoding to generate an encoding vector;
inputting the coding vector into a decoder of the syntax error correction model for decoding to obtain a preset number of candidate error correction sample sentences;
reordering the candidate error correction sample sentences of the preset number;
and taking the candidate error correction sample statement with the highest score as an error correction sample statement according to the reordering result.
A penalty determination module 640 configured to determine a penalty value based on the error corrected sample statement and the second target sample statement.
An iterative training module 650 configured to iteratively train the syntax error correction model based on the loss value until a training stop condition is reached.
In the training device for the grammar error correction model provided by the embodiment, the purpose of automatically expanding the training set is achieved by performing data enhancement processing on the existing training set, so that the manual labor is effectively reduced, and in the training device provided by the embodiment, the decomposed corpus unit and the reverse translation unit are combined for application, so that the data of the training set in the training device is effectively expanded; and in the word insertion processing word unit and the word substitution processing subunit in the training device, the word unit to be inserted or replaced is randomly selected based on the weight of the word unit in the dictionary, the word unit with the large weight in the dictionary is easier to select, compared with the method of directly and randomly selecting one word unit, the word unit is selected to be inserted or substituted based on the weight, the objective law of grammar error is better met, and the accuracy of the training device is further improved.
EXAMPLE six
The embodiment provides a syntax error correction device, as shown in fig. 7, including the following modules:
an obtaining module 710 configured to obtain a source sentence;
a grammar error correction module 720 configured to input the source sentence into a grammar error correction model, generating a grammar error correction sentence;
wherein, the grammar error correction model is obtained by training through the training method provided by the foregoing embodiment. This embodiment is not described in detail.
The syntax error correction module 720 is further configured to:
inputting the source sentences to an encoder of the grammar error correction model for encoding to generate encoding vectors;
and inputting the coding vector to a decoder of the grammar error correction model for decoding, and generating the grammar error correction statement.
The embodiment provides a grammar error correction model, which can realize grammar error correction of a source sentence by using a trained grammar error correction model, and improve grammar error correction accuracy.
EXAMPLE seven
The present embodiment also provides a computing device 800, as shown in FIG. 8.
FIG. 8 is a block diagram illustrating a configuration of a computing device 800 according to an embodiment of the present description. The components of the computing device 800 include, but are not limited to, memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein, the processor 820 may execute the steps of the grammar error correction model training method or the steps of the grammar error correction method provided by the foregoing embodiments. The specific steps are not described in detail in this embodiment.
An embodiment of the present application further provides a computer readable storage medium, which stores computer instructions, when executed by a processor, for implementing the syntax error correction training method or the syntax error correction method steps as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the above-mentioned training method for syntax error correction or the technical solution of the syntax error correction method, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the above description of the technical solution of the training method for syntax error correction or the technical solution of the syntax error correction method.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.
Claims (15)
1. A method for training a grammar error correction model, comprising:
performing data expansion processing based on the first training set to obtain a second training set;
acquiring a second source sample statement and a second target sample statement based on the second training set;
inputting the second source sample statement into a grammar error correction model to generate an error correction sample statement;
determining a loss value based on the error corrected sample statement and the second target sample statement;
and carrying out iterative training on the grammar error correction model based on the loss value until a training stop condition is reached.
2. A training method as claimed in claim 1 wherein the first training set comprises a first source sample statement and a first target sample statement;
the performing data expansion processing based on the first training set to obtain a second training set includes:
preprocessing the first source sample statement and the first target sample statement;
carrying out weight assignment on the word units based on the occurrence frequency of the word units in the first training set to construct a dictionary;
decomposing sentences contained in the source sample sentences of the first training set according to the dictionary to obtain second source sample sentences of data expansion; and constructing the second training set according to the second source sample statement and a second target sample statement corresponding to the second source sample statement.
3. Training method according to claim 2, wherein the corruption process comprises a word insertion process and/or a word substitution process;
the decomposing processing of the sentences contained in the source sample sentences of the first training set according to the dictionary to obtain a second source sample sentence of data expansion includes:
performing word insertion processing on the first source sample statement according to the dictionary to obtain a second source sample statement of data expansion; and/or performing word substitution processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence of data expansion.
4. The training method according to claim 3, wherein the performing word insertion processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence with data expansion comprises:
a1, acquiring the first source sample statement and the sentence length n of the first source sample statement;
a2, generating a corresponding first array based on the sentence length n of the first source sample sentence;
wherein each value in the first array is a randomly generated value in the range of (0, 1);
each numerical value in the first array has a subscript i corresponding to the position sequence of the numerical value in the first array, and the value range of the subscript i is an integer in the range of (0, n-1);
a3, acquiring subscript i corresponding to a numerical value smaller than a first threshold value in the first array according to the preset first threshold value;
a4, randomly selecting a word unit in the dictionary based on the weight, inserting the word unit into the ith position in the first source sample sentence, and generating a second source sample sentence with data expansion after word insertion processing.
5. The training method according to claim 3, wherein the performing word substitution processing on the first source sample sentence according to the dictionary to obtain a second source sample sentence with data expansion comprises:
b1, acquiring the first source sample statement and the sentence length n of the first source sample statement;
b2, generating a corresponding second array based on the sentence length of the first source sample sentence, wherein each numerical value in the second array is a numerical value in a randomly generated (0,1) range;
each numerical value in the second array has a subscript i corresponding to the position sequence of the numerical value in the second array, and the value range of the subscript i is an integer in the range of (0, n-1);
b3, acquiring subscript i corresponding to a numerical value smaller than a second threshold value in the second array according to the preset second threshold value;
b4, randomly selecting a word unit in the dictionary based on the weight, replacing the word unit at the ith position in the first source sample sentence with the randomly selected word unit, and generating a second source sample sentence with data expansion after word replacement processing.
6. A training method as claimed in claim 1 wherein the first training set comprises a first source sample statement and a first target sample statement;
the performing data expansion processing based on the first training set to obtain a second training set further includes:
c1, preprocessing the first source sample statement and the first target sample statement;
c2, constructing a reverse training set in the form of < a first target sample statement, a first source sample statement > based on the first source sample statement and the first target sample statement;
c3, performing reverse training on the grammar error correction model based on the reverse training set, wherein the parameters of the grammar error correction model are fixed after the first target sample sentence is used as the input of the grammar error correction model, the first source sample sentence is used as the target output of the grammar error correction model, and the preset algebra training is performed;
c4, inputting the first target sample sentences in the reverse training set into the grammar error correction model with fixed parameters, and generating a preset number of candidate error correction sentences through bundle searching;
c5, reordering the candidate error correction statements in the preset number, and selecting the candidate error correction statements in the preset sequence as second source sample statements;
c6, constructing the second training set according to the second source sample statement and the second target sample statement corresponding to the second source sample statement.
7. Training method according to claim 2 or 6, wherein the preprocessing of the first source sample statement and the first target sample statement comprises:
performing word segmentation processing on the first source sample sentence and the first target sample sentence, and performing separation processing on each word unit;
removing sentences with the sentence length larger than a preset threshold value in the first training set;
and removing the same sentences in the first source sample sentences and the first target sample sentences.
8. The training method of claim 1, wherein the inputting the second source sample statement into a syntax error correction model to generate an error corrected sample statement comprises:
inputting the second source sample statement into an encoder of the syntax error correction model for encoding to generate an encoding vector;
inputting the coding vector into a decoder of the syntax error correction model for decoding to obtain a preset number of candidate error correction sample sentences;
reordering the candidate error correction sample sentences of the preset number;
and taking the candidate error correction sample statement with the highest score as an error correction sample statement according to the reordering result.
9. The training method according to claim 1, wherein iteratively training the grammar error correction model based on the loss value until a training stop condition is reached comprises:
judging whether the loss value is smaller than a preset threshold value or not;
if not, continuing to obtain a sample sentence to be processed and a label sentence for training;
if yes, stopping training.
10. A method for syntax error correction, comprising:
obtaining a source statement;
inputting the source sentences into a grammar error correction model to generate grammar error correction sentences;
wherein the grammar error correction model is trained by the training method of any one of claims 1-9.
11. The method of grammar error correction according to claim 1, wherein said inputting said source sentence into a grammar error correction model to generate a grammar error correction sentence comprises:
inputting the source sentences to an encoder of the grammar error correction model for encoding to generate encoding vectors;
and inputting the coding vector to a decoder of the grammar error correction model for decoding, and generating the grammar error correction statement.
12. An apparatus for training a grammar error correction model, comprising:
the data expansion module is configured to perform data expansion processing based on the first training set to obtain a second training set;
an obtaining module configured to obtain a second source sample statement and a second target sample statement based on the second training set;
a syntax error correction module configured to input the second source sample statement to a syntax error correction model, and generate an error correction sample statement;
a penalty determination module configured to determine a penalty value based on the error corrected sample statement and the second target sample statement;
an iterative training module configured to iteratively train the grammar error correction model based on the loss value until a training stop condition is reached.
13. An apparatus for syntax error correction, comprising:
an obtaining module configured to obtain a source sentence;
a grammar error correction module configured to input the source sentences to a grammar error correction model to generate grammar error correction sentences;
wherein the grammar error correction model is trained by the training method of any one of claims 1-9.
14. A computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-9 or claims 10-11 when executing the instructions.
15. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 9 or claims 10 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010655492.3A CN111767731A (en) | 2020-07-09 | 2020-07-09 | Training method and device of grammar error correction model and grammar error correction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010655492.3A CN111767731A (en) | 2020-07-09 | 2020-07-09 | Training method and device of grammar error correction model and grammar error correction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111767731A true CN111767731A (en) | 2020-10-13 |
Family
ID=72725775
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010655492.3A Pending CN111767731A (en) | 2020-07-09 | 2020-07-09 | Training method and device of grammar error correction model and grammar error correction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111767731A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112329447A (en) * | 2020-10-29 | 2021-02-05 | 语联网(武汉)信息技术有限公司 | Training method of Chinese error correction model, and Chinese error correction method and device |
CN112364990A (en) * | 2020-10-29 | 2021-02-12 | 北京语言大学 | Method and system for realizing grammar error correction and less sample field adaptation through meta-learning |
CN112560846A (en) * | 2020-12-23 | 2021-03-26 | 北京百度网讯科技有限公司 | Error correction corpus generation method and device and electronic equipment |
CN113221545A (en) * | 2021-05-10 | 2021-08-06 | 北京有竹居网络技术有限公司 | Text processing method, device, equipment, medium and program product |
CN113723080A (en) * | 2021-07-26 | 2021-11-30 | 山东建筑大学 | English article automatic grammar error correction method based on reverse translation |
CN113807081A (en) * | 2021-09-18 | 2021-12-17 | 北京云上曲率科技有限公司 | Method and device for correcting chat text content based on context |
CN113822044A (en) * | 2021-09-29 | 2021-12-21 | 深圳市木愚科技有限公司 | Grammar error correction data generating method, device, computer equipment and storage medium |
CN114510925A (en) * | 2022-01-25 | 2022-05-17 | 森纵艾数(北京)科技有限公司 | Chinese text error correction method, system, terminal equipment and storage medium |
CN114861597A (en) * | 2022-05-17 | 2022-08-05 | 北京飞象星球科技有限公司 | Training method and device for problem solving model for filling up null question |
CN115062611A (en) * | 2022-05-23 | 2022-09-16 | 广东外语外贸大学 | Training method, device, equipment and storage medium of grammar error correction model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120101804A1 (en) * | 2010-10-25 | 2012-04-26 | Xerox Corporation | Machine translation using overlapping biphrase alignments and sampling |
CN109657251A (en) * | 2018-12-17 | 2019-04-19 | 北京百度网讯科技有限公司 | Method and apparatus for translating sentence |
CN110046350A (en) * | 2019-04-12 | 2019-07-23 | 百度在线网络技术(北京)有限公司 | Grammatical bloopers recognition methods, device, computer equipment and storage medium |
CN110162767A (en) * | 2018-02-12 | 2019-08-23 | 北京京东尚科信息技术有限公司 | The method and apparatus of text error correction |
CN110427619A (en) * | 2019-07-23 | 2019-11-08 | 西南交通大学 | It is a kind of based on Multichannel fusion and the automatic proofreading for Chinese texts method that reorders |
CN111062205A (en) * | 2019-12-16 | 2020-04-24 | 北京大学 | Dynamic mask training method in Chinese automatic grammar error correction |
-
2020
- 2020-07-09 CN CN202010655492.3A patent/CN111767731A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120101804A1 (en) * | 2010-10-25 | 2012-04-26 | Xerox Corporation | Machine translation using overlapping biphrase alignments and sampling |
CN110162767A (en) * | 2018-02-12 | 2019-08-23 | 北京京东尚科信息技术有限公司 | The method and apparatus of text error correction |
CN109657251A (en) * | 2018-12-17 | 2019-04-19 | 北京百度网讯科技有限公司 | Method and apparatus for translating sentence |
CN110046350A (en) * | 2019-04-12 | 2019-07-23 | 百度在线网络技术(北京)有限公司 | Grammatical bloopers recognition methods, device, computer equipment and storage medium |
CN110427619A (en) * | 2019-07-23 | 2019-11-08 | 西南交通大学 | It is a kind of based on Multichannel fusion and the automatic proofreading for Chinese texts method that reorders |
CN111062205A (en) * | 2019-12-16 | 2020-04-24 | 北京大学 | Dynamic mask training method in Chinese automatic grammar error correction |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364990A (en) * | 2020-10-29 | 2021-02-12 | 北京语言大学 | Method and system for realizing grammar error correction and less sample field adaptation through meta-learning |
CN112364990B (en) * | 2020-10-29 | 2021-06-04 | 北京语言大学 | Method and system for realizing grammar error correction and less sample field adaptation through meta-learning |
CN112329447A (en) * | 2020-10-29 | 2021-02-05 | 语联网(武汉)信息技术有限公司 | Training method of Chinese error correction model, and Chinese error correction method and device |
CN112329447B (en) * | 2020-10-29 | 2024-03-26 | 语联网(武汉)信息技术有限公司 | Training method of Chinese error correction model, chinese error correction method and device |
CN112560846B (en) * | 2020-12-23 | 2022-03-15 | 北京百度网讯科技有限公司 | Error correction corpus generation method and device and electronic equipment |
CN112560846A (en) * | 2020-12-23 | 2021-03-26 | 北京百度网讯科技有限公司 | Error correction corpus generation method and device and electronic equipment |
CN113221545A (en) * | 2021-05-10 | 2021-08-06 | 北京有竹居网络技术有限公司 | Text processing method, device, equipment, medium and program product |
CN113221545B (en) * | 2021-05-10 | 2023-08-08 | 北京有竹居网络技术有限公司 | Text processing method, device, equipment, medium and program product |
CN113723080B (en) * | 2021-07-26 | 2023-10-10 | 山东建筑大学 | English article automatic grammar error correction method based on reverse translation |
CN113723080A (en) * | 2021-07-26 | 2021-11-30 | 山东建筑大学 | English article automatic grammar error correction method based on reverse translation |
CN113807081A (en) * | 2021-09-18 | 2021-12-17 | 北京云上曲率科技有限公司 | Method and device for correcting chat text content based on context |
CN113822044A (en) * | 2021-09-29 | 2021-12-21 | 深圳市木愚科技有限公司 | Grammar error correction data generating method, device, computer equipment and storage medium |
CN114510925A (en) * | 2022-01-25 | 2022-05-17 | 森纵艾数(北京)科技有限公司 | Chinese text error correction method, system, terminal equipment and storage medium |
CN114861597A (en) * | 2022-05-17 | 2022-08-05 | 北京飞象星球科技有限公司 | Training method and device for problem solving model for filling up null question |
CN114861597B (en) * | 2022-05-17 | 2024-07-12 | 北京飞象星球科技有限公司 | Training method and device for blank-filling problem solving model |
CN115062611A (en) * | 2022-05-23 | 2022-09-16 | 广东外语外贸大学 | Training method, device, equipment and storage medium of grammar error correction model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111767731A (en) | Training method and device of grammar error correction model and grammar error correction method and device | |
CN109190131B (en) | Neural machine translation-based English word and case joint prediction method thereof | |
CN109359309B (en) | Translation method and device, and translation model training method and device | |
CN109933808B (en) | Neural machine translation method based on dynamic configuration decoding | |
CN107798624B (en) | Technical label recommendation method in software question-and-answer community | |
CN111783423B (en) | Training method and device for solving problem model, and solving problem method and device | |
CN111401084A (en) | Method and device for machine translation and computer readable storage medium | |
CN110569505B (en) | Text input method and device | |
CN109710953B (en) | Translation method and device, computing equipment, storage medium and chip | |
US11797761B2 (en) | Device, method and program for natural language processing | |
CN110457719B (en) | Translation model result reordering method and device | |
CN111599340A (en) | Polyphone pronunciation prediction method and device and computer readable storage medium | |
CN111401064B (en) | Named entity identification method and device and terminal equipment | |
CN113536801A (en) | Reading understanding model training method and device and reading understanding method and device | |
CN115293138B (en) | Text error correction method and computer equipment | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
CN115204143A (en) | Method and system for calculating text similarity based on prompt | |
CN110298046B (en) | Translation model training method, text translation method and related device | |
CN113268989B (en) | Multi-tone word processing method and device | |
CN114238549A (en) | Training method and device of text generation model, storage medium and computer equipment | |
CN114626378A (en) | Named entity recognition method and device, electronic equipment and computer readable storage medium | |
CN113449529A (en) | Translation model training method and device, and translation method and device | |
CN115917554A (en) | System and method for bi-directional translation using a sum-product network | |
CN115906854A (en) | Multi-level confrontation-based cross-language named entity recognition model training method | |
CN113792550B (en) | Method and device for determining predicted answers, reading and understanding method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201013 |
|
RJ01 | Rejection of invention patent application after publication |