CN112084317A - Method and apparatus for pre-training a language model - Google Patents

Method and apparatus for pre-training a language model Download PDF

Info

Publication number
CN112084317A
CN112084317A CN202011009914.6A CN202011009914A CN112084317A CN 112084317 A CN112084317 A CN 112084317A CN 202011009914 A CN202011009914 A CN 202011009914A CN 112084317 A CN112084317 A CN 112084317A
Authority
CN
China
Prior art keywords
sample
word
statement
task
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011009914.6A
Other languages
Chinese (zh)
Other versions
CN112084317B (en
Inventor
王福东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202011009914.6A priority Critical patent/CN112084317B/en
Publication of CN112084317A publication Critical patent/CN112084317A/en
Application granted granted Critical
Publication of CN112084317B publication Critical patent/CN112084317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the specification provides a method and a device for pre-training a language model, wherein the method comprises the following steps: acquiring a first statement of a first role and a second statement of a second role in a historical conversation record; the historical conversation record comprises sentences of each conversation in multiple conversations; splicing the first statement and the second statement into a first sample; shielding words with a preset proportion in the first sample to obtain a second sample; superposing a word embedding vector, a word type embedding vector, a position embedding vector and an additional embedding vector of any word in a second sample to obtain an initial word expression vector of the word; and inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the shielded words in the second sample. The language model can be made more suitable for language characterization in the dialogue domain after being pre-trained.

Description

Method and apparatus for pre-training a language model
Technical Field
One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for pre-training a language model.
Background
With the development of artificial intelligence, a way of performing a conversation with a user by using a robot instead of a human appears, and the conversation often needs to be performed for multiple rounds, namely multiple rounds of conversation. In the multi-turn conversation process between the robot and the user, the intention expressed by the sentence of the user is identified through the intention identification model, the corresponding robot response sentence is given for the intention, and the established business target, such as solving the user question or prompting the user to perform the predetermined user behavior, is completed through the continuous interaction mode.
The intention recognition model is a classification model, and the intention expressed by the user's sentence is determined based on the language representation obtained by the language model. The existing language model is a general model trained on a public encyclopedia corpus, and cannot well represent sentences in a dialogue field, and accordingly, an intention recognition model cannot accurately recognize intentions expressed by the sentences of a user, and further cannot complete a set business target.
Therefore, it would be desirable to have an improved approach that enables language models to be more suitable for language characterization in the field of dialog after the language models have been pre-trained.
Disclosure of Invention
One or more embodiments of the present specification describe a method and apparatus for pre-training a language model, which can make the language model more suitable for language characterization in the dialogue domain after pre-training the language model.
In a first aspect, a method for pre-training a language model for language characterization in the field of dialog is provided, the method comprising:
acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;
splicing the first sentence and the second sentence into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;
superposing a word embedded vector of any word in the second sample, the word type embedded vector of the word, the position embedded vector of the word and an additional embedded vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;
inputting an initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the words which are shielded in the second sample.
In one possible embodiment, the words masked in the second sample are used as sample labels for determining the predicted loss of the first task.
In a possible implementation, the pre-training task further includes a second task, and the second task is configured to predict whether the first sentence and the second sentence are two sentences connected in sequence.
Further, the first sample corresponds to a positive sample of the second task, and the first statement and the second statement are two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence.
In one possible embodiment, the pre-training task further comprises a third task for predicting the pinyin for the masked words in the second sample.
Further, pinyin for the masked words in the second sample is used as a sample label for determining the predicted loss of the third task.
In a possible implementation manner, the additional embedded vector includes at least one of a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of pinyin corresponding to the word;
the pre-training task further comprises a fourth task, and the fourth task is used for predicting whether the first statement and the second statement are two statements of the same turn.
Further, the first sample corresponds to a positive sample of the fourth task, and the first statement and the second statement are two statements of the same turn; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.
In one possible embodiment, after the pre-training the language model based on at least one pre-training task including the first task, the method further comprises:
acquiring a third statement of the first role and a fourth statement of the second role in the historical conversation record; the third sentence and the fourth sentence belong to the same turn;
splicing the third sentence and the fourth sentence into a third sample;
inputting the initial word expression vector of each word in the third sample into the pre-trained language model to obtain a language representation vector of the third sample;
inputting the language characterization vector of the third sample into an intention recognition model to obtain a prediction intention category corresponding to the third sample;
and fine-tuning the language model according to the actual intention category and the predicted intention category corresponding to the third sample.
Further, after the fine-tuning the language model, the method further includes:
acquiring a fifth statement of a first role and a sixth statement of a second role in the current conversation; the fifth sentence and the sixth sentence belong to the same turn;
splicing the fifth sentence and the sixth sentence into a fourth sample;
inputting the fourth sample into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
and inputting the language characterization vector of the fourth sample into the intention recognition model to obtain a prediction intention category corresponding to the fourth sample.
In a second aspect, an apparatus for pre-training a language model for language characterization in the field of dialogs is provided, the apparatus comprising:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;
the first sample generation unit splices the first statement and the second statement acquired by the first acquisition unit into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;
an initial expression unit, configured to superimpose a word embedding vector of any word in the second sample obtained by the first sample generation unit, a word type embedding vector of the word, a position embedding vector of the word, and an additional embedding vector corresponding to the word, so as to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;
and the pre-training unit is used for inputting the initial word expression vector of each word in the second sample obtained by the initial expression unit into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the shielded words in the second sample.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, first statements of a first role in a historical dialogue record in the dialogue field and second statements of a second role in the historical dialogue record are obtained; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; then splicing the first statement and the second statement into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; then, overlapping a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; and finally, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. As can be seen from the above, in the embodiments of the present specification, a second sample is obtained based on a historical dialogue record in the dialogue field, and the language model is pre-trained by using the second sample, so that the trained language model is more suitable for language representation in the dialogue field; and when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superposed, but also the additional embedding vector corresponding to the word is superposed, the additional embedding vector includes at least one of the round embedding vector of the turn to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector embodies the specific information of the dialogue domain, and then the initial word expression vector of each word in the second sample is input into the language model to pre-train the language model, so that the language model can better extract the specific information of the dialogue domain, after pre-training the language model, making the language model more suitable for language characterization of the conversational domain.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a flow diagram of a method of pre-training a language model, according to one embodiment;
FIG. 3 illustrates a process diagram of a pre-trained language model according to one embodiment;
FIG. 4 shows a schematic block diagram of an apparatus to pre-train a language model, according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves a pre-trained language model for language characterization in the field of dialog. Referring to fig. 1, at least a sentence of a user is input to a language model, a corresponding language representation vector is output through the language model, the language representation vector is input to an intention recognition model, and a corresponding predicted intention category is output through the intention recognition model. It can be understood that the intention recognition model is a classification model, and is based on the language representation obtained by the language model, so whether the language model can well represent the sentence in the dialogue domain has a great influence on the recognition effect of the intention recognition model.
The language model can adopt a structural design of a bidirectional coding representation (BERT) model based on a converter, generally, the BERT model is obtained by pre-training a pre-training task, training data are from encyclopedic linguistic data, the pre-training task comprises word masking training and continuous statement prediction training, the word masking training is to mask out a plurality of words in a section of speech and then predict the masked out words, the continuous statement prediction training is to judge whether the two sentences are in a context relationship, and the BERT model obtained by the training in the mode is relatively universal and cannot well represent the sentences in the conversation field.
In the embodiment of the specification, a training sample is constructed based on a historical dialogue record in the dialogue field, and the training sample is utilized to pre-train a language model based on at least one pre-training task; and when the initial word expression vector of each word in the training sample is determined, the specific information of the dialogue field is reflected, the initial word expression vector of each word in the training sample is input into the language model, and the language model is pre-trained, so that the specific information of the dialogue field can be better extracted by the language model, and the language model can be more suitable for language representation of the dialogue field after the language model is pre-trained.
Fig. 2 shows a flow diagram of a method for pre-training a language model for language characterization in the field of dialog, which may be based on the implementation scenario shown in fig. 1, according to an embodiment. As shown in fig. 2, the method for pre-training the language model in this embodiment includes the following steps: step 21, acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; step 22, splicing the first statement and the second statement into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; step 23, superposing the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; and 24, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. Specific execution modes of the above steps are described below.
Firstly, in step 21, acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character. It will be appreciated that the two parties to a conversation typically belong to different roles, for example, one role is customer service and the other role is user.
In the embodiment of the present specification, the historical dialog record corresponds to a session (session) between the first role and the second role, for example, taking a dialog between the customer service and the user as an example, the historical dialog record includes multiple rounds of dialog between the robot customer service and the user, and multiple rounds of dialog between the human customer service and the user when the robot customer service cannot achieve the predetermined target. Wherein, one pair of words comprises customer service sentences and user sentences, and starts with the customer service sentences.
It is understood that the first sentence and the second sentence may belong to the same round of dialog or may belong to different rounds of dialog. In the embodiments of the present specification, a sentence is not limited to a single sentence, and may be a single word, a single sentence, or two sentences, etc., based on the actual expression in the dialog. The above statements are the actual expressions of the parties in the conversation and may therefore also be referred to as dialogs.
In the embodiment of the specification, the historical conversation record can be a historical conversation record of an intelligent outbound scene, wherein the intelligent outbound scene is that the robot interacts with a user through a telephone outbound mode to complete an outbound task and a specific target; the historical dialogue records can also be historical dialogue records of user call-in scenes, namely, the user interacts with a robot or a manual customer service through a telephone call-in mode to complete consultation of specific problems.
Then, in step 22, the first sentence and the second sentence are spliced into a first sample; and shielding words with a preset proportion in the first sample by using preset words to obtain a second sample. It will be appreciated that this second sample corresponds to the pre-training task of the word mask training of the BERT model.
The predetermined ratio may be a small value, for example, 10% or 15%.
The preset word may be a normal chinese character, or may be a special mark, for example, a special mark "mask". In one example, the word to be masked is replaced with a "[ mask ]" mark at a first scale, the word to be masked is replaced with a randomly sampled word at a second scale, and the word to be masked is not replaced at a third scale.
Then, in step 23, overlapping the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word. It can be understood that the word embedding vector of a word, the word type embedding vector of the word and the position embedding vector of the word are superposed to obtain an initial word expression vector of the word, which is a way adopted by a common BERT model.
When the pinyin embedded vector of the pinyin corresponding to a character is introduced into the initial character expression vector of the character, the method is favorable for restraining the Automatic Speech Recognition (ASR) error of the pre-trained language model.
Finally, in step 24, the initial word expression vector of each word in the second sample is input into the language model, which is pre-trained based on at least one pre-training task including a first task for predicting the words that are occluded in the second sample. It will be appreciated that since only a predetermined proportion of the words in the second sample are occluded, the language model can predict the occluded words in the second sample based on the context of the occluded words. This first task may correspond to a word mask training pre-training task of a typical BERT model, which may be performed to enable a language model to better implement language characterization of the dialog domain.
In one example, the masked words in the second sample serve as sample tags for determining a predicted loss for the first task.
In one example, the pre-training task further includes a second task for predicting whether the first statement and the second statement are two statements that are connected in sequence. This second task may correspond to a continuous statement prediction training pre-training task of a typical BERT model.
Further, the first sample corresponds to a positive sample of the second task, and the first statement and the second statement are two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence. The following takes the history dialog record shown in table one as an example to describe what is two sentences connected in sequence.
Table one: historical conversation record
Character Sentence Number of rounds
Customer service Statement 1 1
User' s Statement 2 1
Customer service Statement 3 2
User' s Statement 4 2
Customer service Statement 5 3
User' s Statement 6 3
Referring to table one, the statements in the history dialog record are sequentially recorded in time sequence, where statement 1 and statement 2 are two serially connected statements, and statement 2 and statement 3 are also two serially connected statements, but statement 1 and statement 3 are not two serially connected statements.
In one example, the pre-training tasks further include a third task for predicting pinyin for an occluded word in the second sample. The third task is adapted to a specific scene in the dialogue field, speech recognition is often performed in the dialogue process, namely, the speech recognition is performed as text, ASR errors sometimes occur in the process, and the third task can effectively suppress the errors.
Further, pinyin for the masked words in the second sample is used as a sample label for determining the predicted loss of the third task.
In one example, the additional embedded vector includes at least one of a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word; the pre-training task further comprises a fourth task, and the fourth task is used for predicting whether the first statement and the second statement are two statements of the same turn. The fourth task is also adaptive to the specific scene of the dialogue field, and is beneficial to the language model to express the round information.
Further, the first sample corresponds to a positive sample of the fourth task, and the first statement and the second statement are two statements of the same turn; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.
In the embodiment of the present specification, in order to make the language model better fit to the target task, the language model also needs to be fine-tuned and trained on the target task, which may be, but is not limited to, an intention recognition task.
In one example, after the pre-training the language model based on at least one pre-training task including a first task, the method further comprises:
acquiring a third statement of the first role and a fourth statement of the second role in the historical conversation record; the third sentence and the fourth sentence belong to the same turn;
splicing the third sentence and the fourth sentence into a third sample;
inputting the initial word expression vector of each word in the third sample into the pre-trained language model to obtain a language representation vector of the third sample;
inputting the language characterization vector of the third sample into an intention recognition model to obtain a prediction intention category corresponding to the third sample;
and fine-tuning the language model according to the actual intention category and the predicted intention category corresponding to the third sample.
In the embodiment of the specification, after the language model is subjected to fine tuning training on the target task, the target task can be executed based on the language model.
In one example, after the fine-tuning the language model, the method further comprises:
acquiring a fifth statement of a first role and a sixth statement of a second role in the current conversation; the fifth sentence and the sixth sentence belong to the same turn;
splicing the fifth sentence and the sixth sentence into a fourth sample;
inputting the fourth sample into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
and inputting the language characterization vector of the fourth sample into the intention recognition model to obtain a prediction intention category corresponding to the fourth sample.
FIG. 3 illustrates a process diagram of a pre-trained language model, according to one embodiment. Referring to fig. 3, the robot's dialogs (context) and the corresponding user's dialogs (query) are extracted from the historical dialog logs of different outbound application scenarios and are spliced together into a sample, and the historical dialog logs may also be referred to as historical dialog records. As an example in the figure, the robot says "pay back" and the user answers "no money". Aiming at a sample, firstly, acquiring a word embedded vector of any word in the sample, a word type embedded vector of the word and a position embedded vector of the word, wherein the three embedded vectors belong to three embedded vectors of an original BERT model, and on the basis, three additional embedded vectors are additionally added, including a round embedded vector of a turn to which a statement corresponding to the word belongs, a role embedded vector of a role to which the statement corresponding to the word belongs and a pinyin embedded vector corresponding to the word, wherein the round embedded vector can be used for helping a language model to better learn dialog knowledge of different turns, the role embedded vector introduces role information to help the language model to better learn different language style knowledge of a robot and a user, and the pinyin embedded vector is used for inhibiting instability of the sample caused by ASR errors; and then adding all the embedded vectors, inputting the embedded vectors into a language model through regularization, and adding two pre-training tasks on the basis of a traditional pre-training task of a BERT model, wherein the traditional pre-training task comprises the first task and the second task, the first task predicts a missing text by using surrounding texts, the second task predicts a binary task whether the dialect of the robot and the dialect of the user are sequentially connected, the added two pre-training tasks comprise the third task and the fourth task, the third task predicts the pinyin of the missing text by using the pinyin of the surrounding texts, and the fourth task predicts whether the dialect of the robot and the dialect of the user belong to the binary task of the same turn.
According to the method provided by the embodiment of the description, first statements of a first role in a historical dialogue record in a dialogue field and second statements of a second role in the historical dialogue record are obtained; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; then splicing the first statement and the second statement into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; then, overlapping a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; and finally, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. As can be seen from the above, in the embodiments of the present specification, a second sample is obtained based on a historical dialogue record in the dialogue field, and the language model is pre-trained by using the second sample, so that the trained language model is more suitable for language representation in the dialogue field; and when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superposed, but also the additional embedding vector corresponding to the word is superposed, the additional embedding vector includes at least one of the round embedding vector of the turn to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector embodies the specific information of the dialogue domain, and then the initial word expression vector of each word in the second sample is input into the language model to pre-train the language model, so that the language model can better extract the specific information of the dialogue domain, after pre-training the language model, making the language model more suitable for language characterization of the conversational domain.
According to an embodiment of another aspect, an apparatus for pre-training a language model is also provided, and the apparatus is used for executing the method for pre-training a language model provided by the embodiment of the present specification. FIG. 4 shows a schematic block diagram of an apparatus to pre-train a language model, according to one embodiment. As shown in fig. 4, the apparatus 400 includes:
a first obtaining unit 41, configured to obtain a first sentence of a first character in a history dialog record of a dialog field, and a second sentence of a second character in the history dialog record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;
a first sample generation unit 42, which splices the first sentence and the second sentence acquired by the first acquisition unit 41 into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;
an initial expression unit 43, configured to superimpose a word embedding vector of any word in the second sample obtained by the first sample generation unit 42, a word type embedding vector of the word, a position embedding vector of the word, and an additional embedding vector corresponding to the word, so as to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;
a pre-training unit 44, configured to input the initial word expression vector of each word in the second sample obtained by the initial expression unit 43 into the language model, and pre-train the language model based on at least one pre-training task including a first task, where the first task is used to predict a word masked in the second sample.
Optionally, as an embodiment, the masked words in the second sample are used as sample labels for determining the predicted loss of the first task.
Optionally, as an embodiment, the pre-training task further includes a second task, and the second task is configured to predict whether the first sentence and the second sentence are two sentences connected in sequence.
Further, the first sample corresponds to a positive sample of the second task, and the first statement and the second statement are two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence.
Optionally, as an embodiment, the pre-training task further includes a third task, and the third task is used for predicting pinyin of the masked words in the second sample.
Further, pinyin for the masked words in the second sample is used as a sample label for determining the predicted loss of the third task.
Optionally, as an embodiment, the additional embedded vector includes at least one of a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
the pre-training task further comprises a fourth task, and the fourth task is used for predicting whether the first statement and the second statement are two statements of the same turn.
Further, the first sample corresponds to a positive sample of the fourth task, and the first statement and the second statement are two statements of the same turn; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.
Optionally, as an embodiment, the apparatus further includes:
a second obtaining unit, configured to obtain a third sentence of the first role and a fourth sentence of the second role in the historical dialog record after the pre-training unit pre-trains the language model based on at least one pre-training task including the first task; the third sentence and the fourth sentence belong to the same turn;
a second sample generation unit, configured to splice the third statement and the fourth statement acquired by the second acquisition unit into a third sample;
the language characterization unit is used for inputting the initial word expression vector of each word in the third sample obtained by the second sample generation unit into the pre-trained language model to obtain a language characterization vector of the third sample;
the prediction unit is used for inputting the language representation vector of the third sample obtained by the language representation unit into an intention recognition model to obtain a prediction intention category corresponding to the third sample;
and the fine tuning unit is used for fine tuning the language model according to the actual intention category corresponding to the third sample and the prediction intention category obtained by the prediction unit.
Further, the apparatus further comprises:
a third obtaining unit, configured to obtain a fifth statement of the first role and a sixth statement of the second role in the current dialog after the language model is fine-tuned by the fine-tuning unit; the fifth sentence and the sixth sentence belong to the same turn;
a third sample generation unit, configured to splice the fifth statement and the sixth statement acquired by the third acquisition unit into a fourth sample;
the language characterization unit is further configured to input the fourth sample obtained by the third sample generation unit into the language model after the fine tuning, so as to obtain a language characterization vector of the fourth sample;
the prediction unit is further configured to input the language characterization vector of the fourth sample obtained by the language characterization unit into the intention recognition model, so as to obtain a predicted intention category corresponding to the fourth sample.
With the apparatus provided by the embodiment of the present specification, first, the first obtaining unit 41 obtains a first sentence of a first character in a historical dialog record of a dialog field, and a second sentence of a second character in the historical dialog record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; then the first sample generation unit 42 splices the first sentence and the second sentence into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; then the initial expression unit 43 superimposes the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain the initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; finally, a pre-training unit 44 inputs the initial word expression vector of each word in the second sample into the language model, which is pre-trained based on at least one pre-training task including a first task for predicting the words that are occluded in the second sample. As can be seen from the above, in the embodiments of the present specification, a second sample is obtained based on a historical dialogue record in the dialogue field, and the language model is pre-trained by using the second sample, so that the trained language model is more suitable for language representation in the dialogue field; and when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superposed, but also the additional embedding vector corresponding to the word is superposed, the additional embedding vector includes at least one of the round embedding vector of the turn to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector embodies the specific information of the dialogue domain, and then the initial word expression vector of each word in the second sample is input into the language model to pre-train the language model, so that the language model can better extract the specific information of the dialogue domain, after pre-training the language model, making the language model more suitable for language characterization of the conversational domain.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (22)

1. A method of pre-training a language model for language characterization in the field of dialog, the method comprising:
acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;
splicing the first sentence and the second sentence into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;
superposing a word embedded vector of any word in the second sample, the word type embedded vector of the word, the position embedded vector of the word and an additional embedded vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;
inputting an initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the words which are shielded in the second sample.
2. The method of claim 1, wherein the masked words in the second sample serve as sample tags for determining the predicted loss of the first task.
3. The method of claim 1, wherein the pre-training task further comprises a second task to predict whether the first statement and the second statement are two statements that are serially connected.
4. The method of claim 3, wherein the first sample corresponds to a positive sample of the second task, the first statement and the second statement being two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence.
5. The method of claim 1, wherein the pre-training tasks further comprise a third task for predicting pinyin for occluded words in the second sample.
6. The method of claim 5, wherein pinyins of the masked words in the second sample are used as sample labels for determining the predicted loss of the third task.
7. The method of claim 1, wherein the additional embedded vector comprises at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
the pre-training task further comprises a fourth task, and the fourth task is used for predicting whether the first statement and the second statement are two statements of the same turn.
8. The method of claim 7, wherein the first sample corresponds to a positive sample of the fourth task, the first statement and the second statement being two statements of a same round; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.
9. The method of claim 1, wherein after the pre-training the language model based on at least one pre-training task comprising a first task, the method further comprises:
acquiring a third statement of the first role and a fourth statement of the second role in the historical conversation record; the third sentence and the fourth sentence belong to the same turn;
splicing the third sentence and the fourth sentence into a third sample;
inputting the initial word expression vector of each word in the third sample into the pre-trained language model to obtain a language representation vector of the third sample;
inputting the language characterization vector of the third sample into an intention recognition model to obtain a prediction intention category corresponding to the third sample;
and fine-tuning the language model according to the actual intention category and the predicted intention category corresponding to the third sample.
10. The method of claim 9, wherein after the fine-tuning the language model, the method further comprises:
acquiring a fifth statement of a first role and a sixth statement of a second role in the current conversation; the fifth sentence and the sixth sentence belong to the same turn;
splicing the fifth sentence and the sixth sentence into a fourth sample;
inputting the fourth sample into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
and inputting the language characterization vector of the fourth sample into the intention recognition model to obtain a prediction intention category corresponding to the fourth sample.
11. An apparatus for pre-training a language model for language characterization in the field of dialog, the apparatus comprising:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;
the first sample generation unit splices the first statement and the second statement acquired by the first acquisition unit into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;
an initial expression unit, configured to superimpose a word embedding vector of any word in the second sample obtained by the first sample generation unit, a word type embedding vector of the word, a position embedding vector of the word, and an additional embedding vector corresponding to the word, so as to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;
and the pre-training unit is used for inputting the initial word expression vector of each word in the second sample obtained by the initial expression unit into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the shielded words in the second sample.
12. The apparatus of claim 11, wherein the masked words in the second sample serve as sample tags for determining the predicted loss of the first task.
13. The apparatus of claim 11, wherein the pre-training task further comprises a second task to predict whether the first statement and the second statement are two statements that are consecutive.
14. The apparatus of claim 13, wherein the first sample corresponds to a positive sample of the second task, the first statement and the second statement being two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence.
15. The apparatus of claim 11, wherein the pre-training tasks further comprise a third task for predicting pinyin for occluded words in the second sample.
16. The apparatus of claim 15, wherein pinyin for an occluded word in the second sample is used as a sample label for determining a predicted loss for the third task.
17. The apparatus of claim 11, wherein the additional embedded vector comprises at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
the pre-training task further comprises a fourth task, and the fourth task is used for predicting whether the first statement and the second statement are two statements of the same turn.
18. The apparatus of claim 17, wherein the first sample corresponds to a positive sample of the fourth task, the first statement and the second statement being two statements of a same round; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.
19. The apparatus of claim 11, wherein the apparatus further comprises:
a second obtaining unit, configured to obtain a third sentence of the first role and a fourth sentence of the second role in the historical dialog record after the pre-training unit pre-trains the language model based on at least one pre-training task including the first task; the third sentence and the fourth sentence belong to the same turn;
a second sample generation unit, configured to splice the third statement and the fourth statement acquired by the second acquisition unit into a third sample;
the language characterization unit is used for inputting the initial word expression vector of each word in the third sample obtained by the second sample generation unit into the pre-trained language model to obtain a language characterization vector of the third sample;
the prediction unit is used for inputting the language representation vector of the third sample obtained by the language representation unit into an intention recognition model to obtain a prediction intention category corresponding to the third sample;
and the fine tuning unit is used for fine tuning the language model according to the actual intention category corresponding to the third sample and the prediction intention category obtained by the prediction unit.
20. The apparatus of claim 19, wherein the apparatus further comprises:
a third obtaining unit, configured to obtain a fifth statement of the first role and a sixth statement of the second role in the current dialog after the language model is fine-tuned by the fine-tuning unit; the fifth sentence and the sixth sentence belong to the same turn;
a third sample generation unit, configured to splice the fifth statement and the sixth statement acquired by the third acquisition unit into a fourth sample;
the language characterization unit is further configured to input the fourth sample obtained by the third sample generation unit into the language model after the fine tuning, so as to obtain a language characterization vector of the fourth sample;
the prediction unit is further configured to input the language characterization vector of the fourth sample obtained by the language characterization unit into the intention recognition model, so as to obtain a predicted intention category corresponding to the fourth sample.
21. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-10.
22. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-10.
CN202011009914.6A 2020-09-23 2020-09-23 Method and apparatus for pre-training language model Active CN112084317B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011009914.6A CN112084317B (en) 2020-09-23 2020-09-23 Method and apparatus for pre-training language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011009914.6A CN112084317B (en) 2020-09-23 2020-09-23 Method and apparatus for pre-training language model

Publications (2)

Publication Number Publication Date
CN112084317A true CN112084317A (en) 2020-12-15
CN112084317B CN112084317B (en) 2023-11-14

Family

ID=73739659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011009914.6A Active CN112084317B (en) 2020-09-23 2020-09-23 Method and apparatus for pre-training language model

Country Status (1)

Country Link
CN (1) CN112084317B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905772A (en) * 2021-02-10 2021-06-04 网易有道信息技术(北京)有限公司 Semantic correlation analysis method and device and related products
CN113177113A (en) * 2021-05-27 2021-07-27 中国平安人寿保险股份有限公司 Task type dialogue model pre-training method, device, equipment and storage medium
CN113554168A (en) * 2021-06-29 2021-10-26 北京三快在线科技有限公司 Model training method, vector generating method, model training device, vector generating device, electronic equipment and storage medium
CN113609275A (en) * 2021-08-24 2021-11-05 腾讯科技(深圳)有限公司 Information processing method, device, equipment and storage medium
CN113688245A (en) * 2021-08-31 2021-11-23 中国平安人寿保险股份有限公司 Method, device and equipment for processing pre-training language model based on artificial intelligence
WO2024109546A1 (en) * 2022-11-22 2024-05-30 北京猿力未来科技有限公司 Dialogue detection model training method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001057851A1 (en) * 2000-02-02 2001-08-09 Famoice Technology Pty Ltd Speech system
CN109992648A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 The word-based depth text matching technique and device for migrating study
CN111291166A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for training language model based on Bert
US20200242302A1 (en) * 2019-01-29 2020-07-30 Ricoh Company, Ltd. Intention identification method, intention identification apparatus, and computer-readable recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001057851A1 (en) * 2000-02-02 2001-08-09 Famoice Technology Pty Ltd Speech system
US20200242302A1 (en) * 2019-01-29 2020-07-30 Ricoh Company, Ltd. Intention identification method, intention identification apparatus, and computer-readable recording medium
CN109992648A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 The word-based depth text matching technique and device for migrating study
CN111291166A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for training language model based on Bert

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张鹏远;卢春晖;王睿敏;: "基于预训练语言表示模型的汉语韵律结构预测", 天津大学学报(自然科学与工程技术版), no. 03 *
徐菲菲;冯东升;: "文本词向量与预训练语言模型研究", 上海电力大学学报, no. 04 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905772A (en) * 2021-02-10 2021-06-04 网易有道信息技术(北京)有限公司 Semantic correlation analysis method and device and related products
CN112905772B (en) * 2021-02-10 2022-04-19 网易有道信息技术(北京)有限公司 Semantic correlation analysis method and device and related products
CN113177113A (en) * 2021-05-27 2021-07-27 中国平安人寿保险股份有限公司 Task type dialogue model pre-training method, device, equipment and storage medium
CN113177113B (en) * 2021-05-27 2023-07-25 中国平安人寿保险股份有限公司 Task type dialogue model pre-training method, device, equipment and storage medium
CN113554168A (en) * 2021-06-29 2021-10-26 北京三快在线科技有限公司 Model training method, vector generating method, model training device, vector generating device, electronic equipment and storage medium
CN113609275A (en) * 2021-08-24 2021-11-05 腾讯科技(深圳)有限公司 Information processing method, device, equipment and storage medium
CN113609275B (en) * 2021-08-24 2024-03-26 腾讯科技(深圳)有限公司 Information processing method, device, equipment and storage medium
CN113688245A (en) * 2021-08-31 2021-11-23 中国平安人寿保险股份有限公司 Method, device and equipment for processing pre-training language model based on artificial intelligence
CN113688245B (en) * 2021-08-31 2023-09-26 中国平安人寿保险股份有限公司 Processing method, device and equipment of pre-training language model based on artificial intelligence
WO2024109546A1 (en) * 2022-11-22 2024-05-30 北京猿力未来科技有限公司 Dialogue detection model training method and device

Also Published As

Publication number Publication date
CN112084317B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN110413746B (en) Method and device for identifying intention of user problem
CN111309889B (en) Method and device for text processing
CN112084317A (en) Method and apparatus for pre-training a language model
WO2019200923A1 (en) Pinyin-based semantic recognition method and device and human-machine conversation system
CN111177324B (en) Method and device for carrying out intention classification based on voice recognition result
CN117521675A (en) Information processing method, device, equipment and storage medium based on large language model
CN111739519B (en) Speech recognition-based dialogue management processing method, device, equipment and medium
CN111339781A (en) Intention recognition method and device, electronic equipment and storage medium
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN110019742B (en) Method and device for processing information
CN111159364B (en) Dialogue system, dialogue device, dialogue method, and storage medium
CN113268610B (en) Intent jump method, device, equipment and storage medium based on knowledge graph
US11636272B2 (en) Hybrid natural language understanding
CN111625634A (en) Word slot recognition method and device, computer-readable storage medium and electronic device
CN110704597B (en) Dialogue system reliability verification method, model generation method and device
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
CN110781072A (en) Code auditing method, device and equipment based on machine learning and storage medium
CN118133233A (en) Business process mining method and device, electronic equipment and medium
CN117556057A (en) Knowledge question-answering method, vector database construction method and device
CN113326359A (en) Training method and device for dialogue response and response strategy matching model
CN117370512A (en) Method, device, equipment and storage medium for replying to dialogue
KR102448733B1 (en) Dialog system for response selecting considering turn configuration in context and the method thereof
CN111091011B (en) Domain prediction method, domain prediction device and electronic equipment
CN115510213A (en) Question answering method and system for working machine and working machine
CN116051151A (en) Customer portrait determining method and system based on machine reading understanding and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant