CN118350463B

CN118350463B - Question-answer model training method, text processing method and rewarding model training method

Info

Publication number: CN118350463B
Application number: CN202410779372.2A
Authority: CN
Inventors: 陈奕名; 刘海燕; 林金曙
Original assignee: Hundsun Technologies Inc
Current assignee: Hundsun Technologies Inc
Priority date: 2024-06-17
Filing date: 2024-06-17
Publication date: 2024-09-27
Anticipated expiration: 2044-06-17
Also published as: CN118350463A

Abstract

The embodiment of the specification provides a question-answer model training method, a text processing method and a reward model training method, wherein the question-answer model training method comprises the following steps: extracting sample questions from a question-answer sample pair, and determining thinking chain data corresponding to the sample questions; updating the sample questions into target sample questions by using the thinking chain data, and inputting the target sample questions into an initial question-answer model for processing to obtain predicted answers; scoring the predicted answers according to sample answers in the question-answer sample pair by using a reward model associated with the initial question-answer model to obtain optimized scores; and adjusting parameters of the initial question-answer model based on the optimization score until a target question-answer model meeting training stopping conditions is obtained.

Description

Question-answer model training method, text processing method and rewarding model training method

Technical Field

The embodiment of the specification relates to the technical field of machine learning, in particular to a question-answer model training method, a text processing method and a reward model training method.

Background

With the development of computer technology, large models are being applied in more and more scenes. The model can achieve the effect of meeting users on various tasks by using a large amount of data and computing resources. Such as text generation, text classification, named entity recognition, emotion analysis, etc., can be implemented by trained large models. In the prior art, reinforcement learning is a critical part in the training process of a large model; reinforcement learning is a machine learning method that can learn and optimize a model according to rewards given by an environment by interacting the model with the environment. However, the bottleneck in the reinforcement learning process is a reward model, which is used as a key of reinforcement learning and determines rewards acquired by the model after taking actions in the environment, and if the reward model can accurately reflect the degree of behavior, the reinforcement learning can effectively optimize a large model. However, because the practical application environment is complex, the rewarding model is difficult to accurately reflect the quality of the current environment, so that the large model optimized on the basis of the actual application environment does not have better prediction performance, and therefore an effective scheme is needed to solve the problems.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a question-answering model training method. One or more embodiments of the present specification relate to a text processing method, a reward model training method, a question-answer model training apparatus, a text processing apparatus, a reward model training apparatus, a computing device, a computer-readable storage medium, and a computer program product, to solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present disclosure, there is provided a question-answering model training method, including:

extracting sample questions from a question-answer sample pair, and determining thinking chain data corresponding to the sample questions;

updating the sample questions into target sample questions by using the thinking chain data, and inputting the target sample questions into an initial question-answer model for processing to obtain predicted answers;

Scoring the predicted answers according to sample answers in the question-answer sample pair by using a reward model associated with the initial question-answer model to obtain optimized scores;

and adjusting parameters of the initial question-answer model based on the optimization score until a target question-answer model meeting training stopping conditions is obtained.

According to a second aspect of embodiments of the present specification, there is provided a text processing method, including:

receiving a problem text uploaded by a client;

Determining the problem domain information corresponding to the problem text, and selecting a target question-answering model matched with the problem domain information, wherein the target question-answering model is obtained through a question-answering model training method;

And inputting the question text to the target question-answering model for processing, obtaining an answer text, and feeding back the answer text to the client.

According to a third aspect of embodiments of the present specification, there is provided a reward model training method, comprising:

Obtaining a reward sample and a score vector sample sequence corresponding to the reward sample, wherein the score vector sample sequence comprises a standard score vector corresponding to each word unit in the reward sample;

Inputting the reward sample into an initial reward model to score, and obtaining a score vector prediction sequence, wherein the score vector prediction sequence comprises a predicted score vector corresponding to each word unit in the reward sample;

determining a target loss function preset by the initial rewarding model, wherein the target loss function comprises a vector constraint item for constraining a target score vector;

And calculating the score vector sample sequence and the score vector prediction sequence by using the target loss function, and adjusting parameters of the initial reward model according to a calculation result until a reward model meeting a reward training stop condition is obtained.

According to a fourth aspect of embodiments of the present specification, there is provided a question-answering model training device, including:

an extraction module configured to extract a sample question in a question-answer sample pair and determine thought chain data corresponding to the sample question;

The updating module is configured to update the sample questions into target sample questions by using the thinking chain data, and input the target sample questions into an initial question-answer model for processing to obtain predicted answers;

the scoring module is configured to score the predicted answers according to sample answers in the question-answer sample pair by using a reward model associated with the initial question-answer model to obtain an optimized score;

and the parameter tuning module is configured to tune the initial question-answer model based on the optimization score until a target question-answer model meeting training stop conditions is obtained.

According to a fifth aspect of embodiments of the present specification, there is provided a text processing apparatus comprising:

The receiving module is configured to receive the problem text uploaded by the client;

The determining module is configured to determine the problem domain information corresponding to the problem text and select a target question-answering model matched with the problem domain information, wherein the target question-answering model is obtained through the question-answering model training method;

And the sending module is configured to input the question text into the target question-answer model for processing, obtain an answer text and feed the answer text back to the client.

According to a sixth aspect of embodiments of the present specification, there is provided a bonus model training apparatus comprising:

The acquisition sample module is configured to acquire a reward sample and a score vector sample sequence corresponding to the reward sample, wherein the score vector sample sequence comprises a standard score vector corresponding to each word unit in the reward sample;

the scoring sample module is configured to input the bonus sample into an initial bonus model for scoring, and a score vector prediction sequence is obtained, wherein the score vector prediction sequence comprises a predicted score vector corresponding to each word unit in the bonus sample;

A determining function module configured to determine a target loss function preset by the initial reward model, wherein the target loss function comprises a vector constraint term for constraining a target score vector;

And the parameter tuning model module is configured to calculate the score vector sample sequence and the score vector prediction sequence by using the target loss function, and tune the initial reward model according to a calculation result until a reward model meeting a reward training stop condition is obtained.

According to a seventh aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

The memory is configured to store computer-executable instructions that, when executed by the processor, implement the steps of the question-answer model training method, the text processing method, or the reward model training method described above.

According to an eighth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the question-answer model training method, text processing method, or reward model training method described above.

According to a ninth aspect of embodiments of the present specification, there is provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement the steps of the question-answer model training method, text processing method or reward model training method described above.

In order to accurately reflect the prediction accuracy of the current sample on the question-answering model after the training of the question-answering model through the reward model in the question-answering model training process, the question-answering model training method provided by the embodiment can firstly extract sample questions from the question-answering sample pair and determine thinking chain data corresponding to the sample questions; on the basis, in order to enable short questions and answers to still be capable of strengthening training of a question and answer model, sample questions can be updated into target sample questions by using thinking chain data at the moment, more text contents can be embodied through the target sample questions, and then the target sample questions can be input into an initial question and answer model for processing, so that predicted answers are obtained; on the basis, because the sample questions are updated to be target sample questions, the reward model can give a score capable of accurately reflecting the model prediction accuracy when the model prediction accuracy is marked, namely, the reward model associated with the initial question-answering model is utilized to mark the predicted answers according to sample answers in the question-answering sample pair, and the optimized score can be obtained; finally, the initial question-answer model is subjected to parameter adjustment based on the optimization score until a target question-answer model meeting the training stopping condition is obtained. In the reinforcement learning stage of the question-answer model, sample problems can be updated through thinking chain data, so that the sample problems carry richer text information, after the question-answer model predicts, the reward model can give more accurate scoring for the sample problems, the degree of the question-answer model needing to be optimized is accurately reflected through the scoring result, subsequent training is carried out on the basis of the degree of the scoring result, the model training precision can be effectively improved, and the purpose of meeting downstream business use in the application stage can be achieved.

Drawings

FIG. 1 is a schematic diagram of a question-answering model training method according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of question-answering model training provided in one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of scoring results of a reward model in a question-answering model training method according to one embodiment of the present disclosure;

FIG. 4 is a flow chart of a text processing method provided by one embodiment of the present description;

FIG. 5 is a flow chart of a reward model training method provided by one embodiment of the present description;

FIG. 6 is a process flow diagram of a question-answering model training method provided by one embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a question-answering model training device according to one embodiment of the present disclosure;

Fig. 8 is a schematic structural view of a text processing device according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a reward pattern training device according to an embodiment of the present disclosure;

FIG. 10 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.

In one or more embodiments of the present description, a large model refers to a deep learning model with large scale model parameters, typically including hundreds of millions, billions, trillions, and even more than one billion model parameters. The large Model can be called as a Foundation Model, a training Model is performed by using a large-scale unlabeled corpus, a pre-training Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model (Large Language Model, LLM), a multi-modal pre-training Model (multi-modal pre-training Model) and the like.

When the large model is actually applied, the pretrained model can be applied to different tasks by only slightly adjusting a small number of samples, the large model can be widely applied to the fields of natural language processing (Natural Language Processing, NLP for short), computer vision and the like, and particularly can be applied to the tasks of the computer vision fields such as vision question and answer (Visual Question Answering, VQA for short), image description (IC for short), image generation and the like, and the tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes of the large model comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like.

In this specification, a question-answering model training method is provided. One or more embodiments of the present specification relate to a text processing method, a reward pattern training method, a question-answer pattern training apparatus, a text processing apparatus, a reward pattern training apparatus, a computing device, a computer-readable storage medium, and a computer program product, which are described in detail in the following embodiments one by one.

Referring to the schematic diagram shown in fig. 1, in order to accurately reflect, by using the reward model, the prediction accuracy of the current sample on the question model after the question model is trained in the question model training process, the question model training method provided by the embodiment may first extract a sample problem in the question sample pair and determine the mental chain data corresponding to the sample problem; on the basis, in order to enable short questions and answers to still be capable of strengthening training of a question and answer model, sample questions can be updated into target sample questions by using thinking chain data at the moment, more text contents can be embodied through the target sample questions, and then the target sample questions can be input into an initial question and answer model for processing, so that predicted answers are obtained; on the basis, because the sample questions are updated to be target sample questions, the reward model can give a score capable of accurately reflecting the model prediction accuracy when the model prediction accuracy is marked, namely, the reward model associated with the initial question-answering model is utilized to mark the predicted answers according to sample answers in the question-answering sample pair, and the optimized score can be obtained; finally, the initial question-answer model is subjected to parameter adjustment based on the optimization score until a target question-answer model meeting the training stopping condition is obtained. In the reinforcement learning stage of the question-answer model, sample problems can be updated through thinking chain data, so that the sample problems carry richer text information, after the question-answer model predicts, the reward model can give more accurate scoring for the sample problems, the degree of the question-answer model needing to be optimized is accurately reflected through the scoring result, subsequent training is carried out on the basis of the degree of the scoring result, the model training precision can be effectively improved, and the purpose of meeting downstream business use in the application stage can be achieved.

Referring to fig. 2, fig. 2 shows a flowchart of a question-answering model training method according to one embodiment of the present disclosure, which specifically includes the following steps.

Step S202, sample questions are extracted from the question-answer sample pairs, and thinking chain data corresponding to the sample questions are determined.

The question-answering model training method provided by the embodiment can be applied to a large model applied to any question-answering scene, such as a large model for answering financial questions in a financial scene; for another example, a large model is used for replying to different orders in the teaching scene; for another example, a large model that replies to development questions in a development scenario, and the like. The present embodiment is not limited in any way herein.

Specifically, the question-answer sample pair is a sample pair used for training a question-answer model associated with a target field, and comprises sample questions and sample answers, wherein the sample questions are used as model inputs, and the sample answers are used as labels for use in a model training stage or a reinforcement learning stage. In practice, sample questions and sample pairs exist in text form and a question-answer model is trained. Correspondingly, the thinking chain data specifically refers to data which can be refitted for sample problems, and the purpose of the thinking chain data is to enrich short sample problems, so that the sample problems can have richer text information, the model training is convenient to use subsequently, and the semantic association relation in the text is easier to capture by the model. That is, the mental chain data can be understood as a data structure for updating the sample problem, which is used for replacing, updating or adding characters to the sample problem, so that the sample problem can be changed into a target sample problem with more abundant text information for subsequent use, and the mental chain data can be understood as adding problem content which enables the model to think in the sample problem, so that the model can learn the knowledge, and the prediction precision and generalization capability can be improved. It can be understood that the problem of short samples is complicated, and the model is promoted to capture semantic association relations in the complicated text, so that the model training precision is improved.

Based on the method, in order to accurately reflect the prediction precision of the current sample pair after the question-answering model is trained through the reward model in the question-answering model training process, sample questions can be extracted from the question-answering sample pair first, and thinking chain data corresponding to the sample questions can be determined; on the basis, in order to enable short questions and answers to still be capable of strengthening training of a question and answer model, sample questions can be updated into target sample questions by using thinking chain data at the moment, more text contents can be embodied through the target sample questions, and then the target sample questions can be input into an initial question and answer model for processing, so that predicted answers are obtained; on the basis, because the sample questions are updated to be target sample questions, the reward model can give a score capable of accurately reflecting the model prediction accuracy when the model prediction accuracy is marked, namely, the reward model associated with the initial question-answering model is utilized to mark the predicted answers according to sample answers in the question-answering sample pair, and the optimized score can be obtained; finally, the initial question-answer model is subjected to parameter adjustment based on the optimization score until a target question-answer model meeting the training stopping condition is obtained.

Furthermore, when determining the thinking chain data corresponding to the sample problem, in order to ensure that the updated sample problem can be used for model training, and avoid the influence of redundant information on model training accuracy, the thinking chain data can be selected according to the problem type. In this embodiment, the specific implementation manner is as follows:

Determining a problem type corresponding to the sample problem, and selecting candidate thinking chain data matched with the problem type in a target database as thinking chain data;

specifically, the problem type is a type of the field to which the specified position problem belongs, and different types correspond to different thinking chain data. Correspondingly, the target database specifically refers to a database for storing multiple question type associated candidate thinking chain data. Accordingly, the candidate mental chain data specifically refers to the mental chain data selected for the sample problem in the target database. In specific implementation, the operation of selecting the candidate mental chain data can be realized by adopting a mode of calculating text similarity, namely, the text similarity between the sample problem and the mental chain data is calculated, then the candidate mental chain data with the highest priority is sorted according to the similarity, and the candidate mental chain data with the highest priority is selected as the mental chain data corresponding to the sample problem.

Based on the above, after screening the sample questions from the question-answer sample pair, the question types corresponding to the sample questions can be determined first, and then candidate thinking chain data matched with the question types can be selected as the thinking chain data from a target database storing a large number of thinking chain databases; so that the sample questions are updated to target sample questions later, and the training of the question-answering model is realized.

For example, what are the question-answer sample pairs of the training question-answer model, including the sample questions { 1x 1+2 x 2? Answer only results } and sample answer {5}; on the basis, in order to improve the prediction precision of the question-answer model, thinking chain data of a matched sample question can be loaded at the moment, such as { need answer thinking process }, so that the sample question can be changed by combining the thinking chain data later, the sample question can contain richer question information, and the question-answer model can be accurately scored by combining the question and the answer in the question-answer model training stage, so that the question-answer model meeting the use requirement is trained.

In practical application, different thinking chain data can be set for different scenes, for example, in a financial scene, the 'return thinking process' can be added at the end of a question uniformly, and the 'thinking chain data' of a result is output according to a required format, so that a model can give a new answer in combination with a new question. In the process, a group of answers can be given through a trained large model, another group of answers can be given through a large model to be trained, then, positive answers in the group of answers of the trained large model and negative answers in the group of answers of the large model to be trained are selected through set rule judgment, and the positive answers are required to have steps and logic, and can be questions such as unreasonable question analysis, wrong classification, wrong output format and the like, so that positive and negative sample pairs are constructed, the large model to be trained is continuously trained by combining with a reward model, and the large model meeting the requirements can be trained to be deployed in an application scene.

In conclusion, through setting a plurality of thinking chain data and supporting to select according to the problem type in the model training stage, the update of the sample problem can be effectively ensured not to falsify the meaning of the problem, the purpose of reinforcement learning can be achieved, and the model training precision is improved.

And step S204, updating the sample questions into target sample questions by using the thinking chain data, and inputting the target sample questions into an initial question-answer model for processing to obtain predicted answers.

Specifically, after the sample questions and the corresponding thinking chain data thereof are read from the question-answer sample pair, in order to enable the question-answer model to learn the thinking process, and give accurate scores for questions with different lengths when the scoring of the reward model is improved, the sample questions can be updated into target sample questions by using the thinking chain data at the moment, so that the purpose of adding thinking knowledge which enables the model to learn in the sample questions is achieved, and then the target sample questions can be input into the initial question-answer model for processing, so that the reward model is used for scoring after the predicted answers are obtained, and the model optimization processing is completed by combining scoring results.

The target sample problem specifically refers to a problem text obtained by updating the sample problem by using thinking chain data, and the target sample problem carries text content for enabling a question-answer model to learn thinking knowledge; accordingly, the initial question-answering model specifically refers to a large language model capable of giving an answer based on a question. Correspondingly, the predicted answer specifically refers to the predicted answer text output by the initial question-answering model aiming at the target sample question. It should be noted that, since the target sample question carries text content corresponding to the thinking chain data, and the text content aims to enable the model to learn thinking knowledge, in the predicted answer outputted by the initial question-answer model, the predicted answer text corresponding to the sample question will be provided, and meanwhile, the answer thought text of the sample question will be provided, so that the target sample question and the currently obtained predicted answer can be combined to complete scoring with higher precision, and the question-answer model can determine parameters needing strengthening and weakening from the score, thereby ensuring model prediction precision and generalization capability.

Furthermore, when updating the sample problem, in order to avoid error during updating and influence the problem content of the sample problem which needs to be expressed, the method can be realized by adopting a word unit matching mode. In this embodiment, the specific implementation manner is as follows:

and determining a word unit to be changed, which is matched with the thinking chain data, in the sample problem, and updating the word unit to be changed by utilizing the thinking chain data to obtain a target sample problem.

Specifically, the word unit to be changed specifically refers to a word unit which is matched with the thinking chain data and needs to be changed in the sample problem. Based on the above, after the thinking chain data is obtained, the word unit to be changed, which is matched with the thinking chain data, can be determined in the sample problem, at the moment, the word unit to be changed is updated by utilizing the thinking chain data, the target sample problem can be obtained according to the updating result, and then the model is trained.

Along the above example, what is the result of obtaining the sample problem {1×1+2×2? After only answer the result and the mind chain data { need answer the mind process }, what is the result of the sample question {1×1+2×2? Answer only results }, update, based on the updated results, what will the result of the target sample problem {1 x 1+2 x 2? Answer thinking process is required }. Further, the target sample questions can be input into the question-answering model for processing, the obtained predicted answers are { firstly, we calculate the first step 1*1 =1, calculate the second step 2×2=4, and add the results of the first two steps together to be 1+4=5, so that the final result is 5}, the obtained predicted answers not only comprise the questions answers, but also comprise the questions solving step, and then the reward model is utilized to score on the basis, so that reinforcement learning of the question-answering model can be realized, and the model answer precision is improved.

In summary, the sample problem is updated by matching the word units to be changed, so that the core content of the updated target sample problem and the core content of the original sample problem are unchanged, but the model learning thought knowledge can be provided, and the purpose that samples with different lengths can improve the model prediction precision is achieved.

And S206, scoring the predicted answers according to sample answers in the question-answer sample pair by using a reward model associated with the initial question-answer model to obtain an optimized score.

Specifically, after the predicted answer output by the initial question-answer model is obtained, further, considering that the purpose of the reward model is to promote the initial question-answer model to be more accurately optimized in the strengthening stage, the scoring result of the reward model is a standard for determining how well the question-answer model is optimized, and in order to enable the scoring accuracy of the reward model to be improved, the reward model finishes scoring by combining the predicted answer containing the step content and the answer content when scoring. That is, the reward model associated with the initial question-answer model may be utilized to score the predicted answer according to the sample answers in the question-answer sample pair, thereby outputting an optimization score capable of reflecting the current prediction ability of the model, so that the parameter tuning of the model may be completed in combination with the optimization score. The higher the optimization score, the higher the model prediction accuracy is, whereas the lower the optimization score, the lower the model prediction accuracy is, and the optimization score can be a sequence formed by the scores corresponding to each token besides the integrity score, so that the prediction accuracy of the initial question-answer model on each token can be shown in a finer granularity, and the follow-up optimization is more targeted.

The reward model is deployed and trained in the reinforcement learning stage aiming at the initial question-answer model, and the input of the reward model is the splicing result of the predicted answer, the target sample question and the sample answer, and the result is output as an optimized score.

Further, when scoring is performed by using the reward model, in order to enable the reward model to give an optimized score with higher accuracy and higher interpretability, the reward model may be input to score after the question and the answer are spliced. In this embodiment, the specific implementation manner is as follows:

loading a reward model associated with the initial question-answer model, and splicing the target sample questions and the predicted answers into an optimized text; and extracting sample answers from the question-answer sample pair, and scoring the optimized text according to the sample answers by utilizing the reward model to obtain an optimized score.

Specifically, the optimized text specifically refers to text content obtained after the target sample questions and the predicted answers are spliced, the text content is used for inputting a reward model, and the accuracy of the output result of the current question-answer model is evaluated by using the reward model.

Based on the method, after obtaining the predicted answer containing the step information and the answer information, firstly loading a reward model associated with the initial question-answer model, simultaneously splicing the target sample question and the predicted answer, and obtaining an optimized text according to the splicing result; extracting sample answers from the question-answer sample pair; and then, scoring the optimized text according to the sample answers by using the reward model, namely optimizing the score, so that the model can be optimized for subsequent use.

Along the above example, after obtaining the predicted answer { first we calculate first 1*1 =1, second 2×2=4, and third adds the results of the first two steps together 1+4=5, so the final result is 5}, what is the result of the predicted answer and the target sample question {1×1+2×2? What is the result of the optimized text {1 x 1+2 x 2? A need to answer the thinking process; first we calculate the first step 1*1 =1, the second step 2×2=4, and the third step adds the results of the first two steps together 1+4=5, so the final result is 5}. Further, extracting a sample answer {5} from the question-answer sample pair, and inputting the sample answer {5} and the optimized text into the reward model together for scoring; assuming that the question-answering model gives a set of answers of 5 and 3, respectively, for the question, the score distribution for each token in the two predicted answers is shown in fig. 3. The predictive power of the question-answering model can be reflected through the score of the token, so that accurate model optimization can be conducted later.

In conclusion, the reward model is used for scoring for the optimized text through splicing, so that the reward model can be guaranteed to give more accurate scores for predicted answers, the scores can reflect the prediction capability of the question-answer model, and the follow-up more accurate model tuning can be realized.

In addition, in order to enable the reward model to fully and accurately reflect scores corresponding to texts with different lengths, the generalization capability of the reward model is improved, so that the question-answer model can be better strengthened to learn, and the reward model can be trained in the following mode. In this embodiment, the specific implementation manner is as follows:

Obtaining a reward sample and a score vector sample sequence corresponding to the reward sample, wherein the score vector sample sequence comprises a standard score vector corresponding to each word unit in the reward sample; inputting the reward sample into an initial reward model to score, and obtaining a score vector prediction sequence, wherein the score vector prediction sequence comprises a predicted score vector corresponding to each word unit in the reward sample; and calculating a loss value according to the score vector sample sequence and the score vector prediction sequence, and adjusting parameters of the initial reward model based on the loss value until a reward model meeting the condition of stopping reward training is obtained.

Specifically, the reward sample specifically refers to a positive/negative sample composed of a question text, an answer text and a real answer text, and is used for training a reward model. Correspondingly, the score vector sample sequence specifically refers to a sequence consisting of standard score vectors corresponding to each word unit in the reward sample, and the standard score vector corresponding to each word unit is the vector expression of the true score corresponding to each word unit. Correspondingly, the score vector prediction sequence specifically refers to a sequence formed by the obtained predicted score vectors after scoring each word unit by the reward model. That is, the predicted score vector included in the score vector predicted sequence is a vector expression of the predicted score corresponding to each word unit.

Based on the method, in the training stage of the reward model, in order to enable the reward model to have more accurate scoring capability, so that the question-answer model can be more accurately tuned in the reinforcement learning stage, a reward sample and a score vector sample sequence corresponding to the reward sample can be acquired first, wherein the score vector sample sequence consists of standard score vectors corresponding to each word unit in the reward sample; and then, inputting the reward sample into the initial reward model for scoring, and obtaining a score vector prediction sequence obtained after the reward sample is scored by the reward model, wherein the score vector prediction sequence consists of a predicted score vector corresponding to each word unit in the reward sample. After the score vector predicted sequence output by the reward model is obtained, the score vector sample sequence and the score vector predicted sequence can be combined to calculate a loss value, and then the initial reward model is subjected to parameter adjustment based on the loss value until the reward model meeting the condition of stopping reward training is obtained.

In practical application, when calculating the loss value by combining the prediction result and the label, the calculation is completed by the loss function of the following formula (1):

（1）

wherein, loss is a Loss value, AndThe predicted score vector and the true score vector output after scoring for each token in the reward sample for the reward model (if the lengths are different, pad forced completion may be used),(.) Represents a sigmoid function.

In addition, the reward training stop condition specifically refers to a condition for stopping training the reward model, which includes but is not limited to a loss value comparison condition, a verification set verification condition or an iteration number condition, and may be selected according to actual requirements when specifically implemented, and the embodiment is not limited in this regard.

In sum, through training the reward model, the reward model can have more accurate score output when the predicted answer is given, so that the question-answer model can complete parameter adjustment according to the more accurate score output by the reward model in the reinforcement learning stage.

Further, considering that the penalty function used by the reward model in the training phase focuses more on the score of the last token input, in order to enhance this feature, and can support that the reward model can give scores with higher accuracy for different lengths of text, the penalty calculation can be accomplished with a penalty function that contains vector constraint terms. In this embodiment, the specific implementation manner is as follows:

Determining a target loss function preset by the initial rewarding model, wherein the target loss function comprises a vector constraint item for constraining a target score vector; and calculating the score vector sample sequence and the score vector prediction sequence by using the target loss function to obtain a loss value, and executing a step of tuning the initial rewarding model based on the loss value.

Specifically, the objective loss function refers to a loss function including a vector constraint term that constrains the objective score vector, and the loss function may be understood as a loss function obtained by adding the vector constraint term on the basis of the above formula (1). The target score vector specifically refers to a score vector corresponding to the last token in a reward sample input by the reward model, and the vector constraint item is a calculation item for constraining the score vector corresponding to the last token.

Based on this, as can be seen from the above formula (1), the loss function is mainly to increase the distance between the overall scoring vectors, but in the subsequent reinforcement learning training process, the question-answering model is actually to emphasize the scoring vector of the last token, so in order to increase the encouragement of the difference of the scoring values of the last token, the vector constraint term can be added on the basis of the loss function, and then the target loss function can be used to calculate the scoring vector sample sequence and the scoring vector prediction sequence, so as to obtain the loss value, and the step of tuning the initial reward model based on the loss value is performed.

In practical application, after the encouragement of the difference of the score vectors of the last token is added on the basis of the loss function of the formula (1), the loss function corresponding to the formula (2) can be obtained:

（2）

the value of λ may be set according to actual requirements, for example, 0.4, etc., which is not limited in this embodiment.

As can be seen by combining the formulas (1) and (2), considering that under the theoretical state of the reward model, each token of the positive sample is high-scoring, and each token of the negative sample is low-scoring; however, in practice, since the reward model is not sufficiently trained, there may be cases where the token average score of the positive sample is higher than the negative sample average score, but the score corresponding to an individual token may be lower than the negative sample score, so in order to be able to constrain this case, a vector constraint term is added on the basis of the loss function of the formula (1), so as to increase the constraint on the last token score vector on the premise that the last token score vector is applied in the reinforcement learning stage, thereby ensuring that the loss function corresponding to the formula (2) can effectively tune the reward model, and the result is applied to the training stage of the question-answer model, so that the score can be more accurately scored.

In summary, by adopting the objective loss function including the vector constraint term to tune the reward model, the constraint term of the score vector which is considered seriously in the reinforcement learning stage can be added in the loss function, and the reward model is optimized based on the constraint term, so that when the reward model is scored, a reasonable score can be given by combining the constraint, and the model optimization effect in the reinforcement learning stage is improved.

Further, to enhance the generalization ability of the bonus model, the loss function may also be optimized in conjunction with SimCSE. In this embodiment, the specific implementation manner is as follows:

Determining an initial loss function preset by the initial rewarding model; updating the initial loss function in response to a model optimization request submitted for the initial rewards model, wherein the model optimization request is used for adding semantic learning information in the initial loss function; and determining a target loss function according to the updating result, and executing the step of determining the target loss function preset by the initial rewarding model.

Specifically, the optimization request specifically refers to a request for adding semantic learning information to an initial loss function after the initial loss function is obtained, and the request is used for changing the loss function in the optimization stage of the rewarding model, so that the changed loss function can enable the rewarding model to have better generalization capability.

Based on the method, an initial loss function preset by an initial rewarding model can be determined first; thereafter, the initial loss function may be updated in response to a model optimization request submitted for the initial rewards model, and the model optimization request is used to add semantic learning information to the initial loss function; and then determining a target loss function according to the updating result, and executing the step of determining the target loss function preset by the initial rewarding model.

In practical applications, the method SimCSE (Simple Contrastive Learning of Sentence Embeddings, simple contrast learning sentence embedding) may be used when optimizing the loss function corresponding to the above formula (2). The main idea of SimCSE is, among other things, to train sentence-embedding models by contrast learning (Contrastive Learning). Specifically, it will take two independent copies of the same sentence as a positive pair of samples and then train the model by maximizing the cosine similarity of the embedded vectors of the two copies and minimizing the similarity to other sentences (negative samples). SimCSE has the advantage that it is simple and efficient. It does not require complex data preprocessing or labeling, but only the original text data. In addition, simCSE can be used in conjunction with pre-trained language models (e.g., BERT, roBERTa, etc.), further improving performance. Wherein SimCSE is a comparative loss function, and the specific form is as follows formula (3):

L_simcse= -log(exp(sim(u, v) / τ) / Σexp(sim(u, v') / τ))（3）

Where u and v are embedded vectors of two independent copies of the same sentence, v' is an embedded vector of the other sentence, sim (u, v) represents the cosine similarity of u and v, τ is a temperature parameter, Σ represents the sum of all negative sample pairs. The goal of this loss function is to maximize the similarity to positive pairs of samples and minimize the similarity to negative pairs of samples. By optimizing the loss function, the model can learn to generate embedded vectors that can accurately represent sentence semantics. Therefore, on the basis of the loss function of the formula (3), the loss function of the formula (2) is optimized, so that the target loss function shown in the formula (4) can be obtained:

（4）

the generalization capability of the reward model can be improved through the loss function shown in the formula (4), so that the prediction precision of the model can be improved in the reinforcement learning stage.

And step S208, performing parameter tuning on the initial question-answer model based on the optimization score until a target question-answer model meeting the training stop condition is obtained.

Specifically, after the optimized score output by the reward model is obtained, further, because the optimized score can show the strength of the prediction capability of the current question-answer model, the parameter tuning direction of the question-answer model in the reinforcement learning stage can be determined based on the optimized score, and the initial question-answer model can be tuned based on the optimized score until the target question-answer model meeting the training stopping condition is obtained.

The training stopping condition specifically refers to a condition for stopping training the question-answer model, which includes but is not limited to a loss value comparison condition, a verification set verification condition or an iteration number condition, and may be selected according to actual requirements during implementation, and the embodiment is not limited in any way.

Furthermore, in the model parameter adjusting stage, a model optimization strategy can be constructed by combining the optimization values and the sample answers, and the parameters required to be strengthened and weakened by the question-answer model can be definitely determined based on the strategy, so that the model prediction precision is improved. In this embodiment, the specific implementation manner is as follows:

Constructing a model optimization strategy aiming at the initial question-answer model according to the optimization scores and the sample answers; adjusting parameters of the initial question-answer model according to the model optimization strategy, and detecting whether the initial question-answer model after parameter adjustment meets the training stop condition; if not, taking the initial question-answering model after the parameter adjustment as the initial question-answering model, and executing the step of extracting sample questions in the question-answering sample pair; if yes, taking the initial question-answering model after the parameter adjustment as a target question-answering model.

Specifically, the model optimization strategy specifically refers to an optimization strategy for planning an initial question-answer model by combining an optimization score and a sample answer, and the strategy can strengthen parameters to be lifted in the model and weaken parameters which are too strong in the model, so that the model prediction accuracy is improved, and the model overfitting is avoided.

Based on the above, in the model parameter adjustment stage, a model optimization strategy can be constructed for the initial question-answer model according to the optimization score and the sample answer; afterwards, the initial question-answer model can be called according to a model optimization strategy, and whether the initial question-answer model after the call meets the training stopping condition is detected; if not, the question-answering model is required to be trained continuously, so that the initial question-answering model after the parameter adjustment can be used as the initial question-answering model, and the step S202 is executed in a return manner; and (3) until a certain parameter tuning process is performed, the model meets the training stopping condition, and the initial question-answer model after parameter tuning can be used as a target question-answer model to be deployed in a specific business scene for use.

Along the above example, what is the result of obtaining the reward model for optimizing text {1 x 1+2 x 2? A need to answer the thinking process; firstly, we calculate the first step 1*1 =1, the second step 2×2=4, the third step adds the results of the first two steps together 1+4=5, so the final result is 5} after scoring the obtained optimization score, the question-answer model can be called with the optimization score at the moment, the question-answer model is verified through a verification set after the call, the expected prediction precision is determined to not be reached, then new samples are selected to continue training until the expected prediction precision is determined to be satisfied after the verification set is verified, and the question-answer model can be deployed to a downstream service scene for providing the question-answer service for the user.

Referring to fig. 4, fig. 4 shows a flowchart of a text processing method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step S402, receiving a question text uploaded by a client;

Step S404, determining problem domain information corresponding to the problem text, and selecting a target question-answering model matched with the problem domain information, wherein the target question-answering model is obtained through a question-answering model training method;

And step S406, inputting the question text into the target question-answering model for processing, obtaining an answer text, and feeding back the answer text to the client.

Specifically, the question text specifically refers to text to be replied to uploaded by the client. Accordingly, the question field information specifically refers to information describing the field to which the question belongs, and different question-answering models can be deployed in different fields, so that the question-answering models can be specially used for answering the question. Correspondingly, the answer text specifically refers to an answer obtained after the pointer predicts the question text.

Based on the above, after the question text uploaded by the client is received, the question field information corresponding to the question text can be determined, at the moment, a target question-answer model matched with the question field information can be selected, then the question text is input into the target question-answer model for processing, and an answer text can be obtained and fed back to the client for reference by a user.

For example, the user inputs a question text { who the author of song a }, at this time, a question-answer model matching the music field may be selected to process to obtain an answer text { a }, and at the same time, an introduction about the user a may be added and fed back to the user client for the user to watch, where the answer content watched by the user at the client may be { the author of song a, male, age, birth in the place of age, and representative of what is shown as a }.

Referring to fig. 5, fig. 5 shows a flowchart of a reward model training method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step S502, a reward sample and a score vector sample sequence corresponding to the reward sample are obtained, wherein the score vector sample sequence comprises a standard score vector corresponding to each word unit in the reward sample;

Step S504, inputting the rewarding sample into an initial rewarding model to score, and obtaining a score vector prediction sequence, wherein the score vector prediction sequence comprises a predicted score vector corresponding to each word unit in the rewarding sample;

step S506, determining a target loss function preset by the initial rewarding model, wherein the target loss function comprises a vector constraint item for constraining a target score vector;

and step S508, calculating the score vector sample sequence and the score vector prediction sequence by using the objective loss function, and adjusting parameters of the initial reward model according to a calculation result until the reward model meeting the reward training stop condition is obtained.

It should be noted that, the reward model training method provided in this embodiment is a training method corresponding to the reward model applied in the question-answer model training method, and description of the reward model training method provided in this embodiment may refer to description of the reward model training in the foregoing embodiment, and this embodiment is not repeated here.

The following describes the method for training a question-answer model by taking the application of the method for training a question-answer model provided in the present specification in a question-answer interaction scenario as an example with reference to fig. 6. Fig. 6 is a flowchart of a processing procedure of a question-answer model training method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step S602, extracting sample questions from the question-answer sample pair.

Step S604, determining a question type corresponding to the sample question, and selecting candidate thinking chain data matched with the question type in the target database as the thinking chain data.

Step S606, determining word units to be changed, which are matched with the thought chain data, in the sample problem.

Step S608, update the word unit to be changed with the mental chain data to obtain the target sample problem.

Step S610, inputting the target sample questions into the initial question-answering model for processing, and obtaining predicted answers.

Step S612, loading a reward model associated with the initial question-answer model, and splicing the target sample questions and the predicted answers into optimized text.

Step S614, sample answers are extracted from the question-answer sample pairs, and the optimized text is scored according to the sample answers by using the reward model, so as to obtain optimized scores.

And step S616, the initial question-answer model is subjected to parameter adjustment based on the optimization score until a target question-answer model meeting the training stop condition is obtained.

In step S618, the question text uploaded by the client is received.

And step S620, inputting the question text into a target question-answer model for processing, obtaining an answer text, and feeding back the answer text to the client.

In summary, in order to accurately reflect the prediction accuracy of the current sample pair after the question-answering model is trained through the reward model in the question-answering model training process, the sample problem can be extracted from the question-answering sample pair first, and the thinking chain data corresponding to the sample problem can be determined; on the basis, in order to enable short questions and answers to still be capable of strengthening training of a question and answer model, sample questions can be updated into target sample questions by using thinking chain data at the moment, more text contents can be embodied through the target sample questions, and then the target sample questions can be input into an initial question and answer model for processing, so that predicted answers are obtained; on the basis, because the sample questions are updated to be target sample questions, the reward model can give a score capable of accurately reflecting the model prediction accuracy when the model prediction accuracy is marked, namely, the reward model associated with the initial question-answering model is utilized to mark the predicted answers according to sample answers in the question-answering sample pair, and the optimized score can be obtained; finally, the initial question-answer model is subjected to parameter adjustment based on the optimization score until a target question-answer model meeting the training stopping condition is obtained. In the reinforcement learning stage of the question-answer model, sample problems can be updated through thinking chain data, so that the sample problems carry richer text information, after the question-answer model predicts, the reward model can give more accurate scoring for the sample problems, the degree of the question-answer model needing to be optimized is accurately reflected through the scoring result, subsequent training is carried out on the basis of the degree of the scoring result, the model training precision can be effectively improved, and the purpose of meeting downstream business use in the application stage can be achieved.

Corresponding to the method embodiment, the present disclosure further provides an embodiment of a question-answering model training device, and fig. 7 shows a schematic structural diagram of a question-answering model training device provided in one embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

An extraction module 702 configured to extract a sample question in a question-answer sample pair and determine thought chain data corresponding to the sample question;

The updating module 704 is configured to update the sample questions into target sample questions by using the thinking chain data, and input the target sample questions into an initial question-answer model for processing to obtain predicted answers;

A scoring module 706 configured to score the predicted answer according to the sample answers in the question-answer sample pair by using a reward model associated with the initial question-answer model, so as to obtain an optimized score;

a tuning module 708 configured to tune the initial question-answer model based on the optimization score until a target question-answer model satisfying a training stop condition is obtained.

In an alternative embodiment, the extraction module 702 is further configured to:

wherein the update module 704 is further configured to: and determining a word unit to be changed, which is matched with the thinking chain data, in the sample problem, and updating the word unit to be changed by utilizing the thinking chain data to obtain a target sample problem.

In an alternative embodiment, the scoring module 706 is further configured to:

In an alternative embodiment, the parameter tuning module 708 is further configured to:

In an alternative embodiment, the apparatus further comprises:

The system comprises a reward model training module, a score model generation module and a score model generation module, wherein the reward model training module is configured to acquire a reward sample and a score vector sample sequence corresponding to the reward sample, and the score vector sample sequence comprises a standard score vector corresponding to each word unit in the reward sample; inputting the reward sample into an initial reward model to score, and obtaining a score vector prediction sequence, wherein the score vector prediction sequence comprises a predicted score vector corresponding to each word unit in the reward sample; and calculating a loss value according to the score vector sample sequence and the score vector prediction sequence, and adjusting parameters of the initial reward model based on the loss value until a reward model meeting the condition of stopping reward training is obtained.

In an alternative embodiment, the reward model training module is further configured to:

In order to accurately reflect the prediction accuracy of the current sample on the question-answering model after the training of the question-answering model through the reward model in the question-answering model training process, the question-answering model training device provided by the embodiment can firstly extract sample questions in the question-answering sample pair and determine thinking chain data corresponding to the sample questions; on the basis, in order to enable short questions and answers to still be capable of strengthening training of a question and answer model, sample questions can be updated into target sample questions by using thinking chain data at the moment, more text contents can be embodied through the target sample questions, and then the target sample questions can be input into an initial question and answer model for processing, so that predicted answers are obtained; on the basis, because the sample questions are updated to be target sample questions, the reward model can give a score capable of accurately reflecting the model prediction accuracy when the model prediction accuracy is marked, namely, the reward model associated with the initial question-answering model is utilized to mark the predicted answers according to sample answers in the question-answering sample pair, and the optimized score can be obtained; finally, the initial question-answer model is subjected to parameter adjustment based on the optimization score until a target question-answer model meeting the training stopping condition is obtained. In the reinforcement learning stage of the question-answer model, sample problems can be updated through thinking chain data, so that the sample problems carry richer text information, after the question-answer model predicts, the reward model can give more accurate scoring for the sample problems, the degree of the question-answer model needing to be optimized is accurately reflected through the scoring result, subsequent training is carried out on the basis of the degree of the scoring result, the model training precision can be effectively improved, and the purpose of meeting downstream business use in the application stage can be achieved.

The above is a schematic scheme of a question-answering model training device of this embodiment. It should be noted that, the technical solution of the question-answering model training device and the technical solution of the question-answering model training method belong to the same concept, and details of the technical solution of the question-answering model training device which are not described in detail can be referred to the description of the technical solution of the question-answering model training method.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of a text processing device, and fig. 8 shows a schematic structural diagram of a text processing device provided in one embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:

a receiving module 802 configured to receive the question text uploaded by the client;

A determining module 804, configured to determine problem domain information corresponding to the problem text, and select a target question-answering model matched with the problem domain information, where the target question-answering model is obtained through the question-answering model training method;

and a sending module 806, configured to input the question text into the target question-answer model for processing, obtain an answer text, and feed back the answer text to the client.

The above is an exemplary scheme of a text processing apparatus of the present embodiment. It should be noted that, the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the text processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the text processing method.

Corresponding to the method embodiment, the present disclosure further provides an embodiment of a reward model training device, and fig. 9 shows a schematic structural diagram of the reward model training device provided in one embodiment of the present disclosure. As shown in fig. 9, the apparatus includes:

An obtaining sample module 902, configured to obtain a reward sample and a score vector sample sequence corresponding to the reward sample, where the score vector sample sequence includes a standard score vector corresponding to each word unit in the reward sample;

A scoring sample module 904 configured to input the bonus sample into an initial bonus model for scoring, and obtain a score vector prediction sequence, wherein the score vector prediction sequence comprises a predicted score vector corresponding to each word unit in the bonus sample;

A determining function module 906 configured to determine a target loss function preset by the initial reward model, where the target loss function includes a vector constraint term that constrains a target score vector;

And a parameter tuning model module 908, configured to calculate the score vector sample sequence and the score vector prediction sequence by using the target loss function, and tune the initial reward model according to the calculation result until obtaining a reward model meeting a reward training stop condition.

The above is an exemplary scheme of the bonus model training apparatus of the present embodiment. It should be noted that, the technical solution of the reward model training device and the technical solution of the reward model training method belong to the same concept, and details of the technical solution of the reward model training device, which are not described in detail, can be referred to the description of the technical solution of the reward model training method.

Fig. 10 illustrates a block diagram of a computing device 1000 provided in accordance with one embodiment of the present description. The components of the computing device 1000 include, but are not limited to, a memory 1010 and a processor 1020. Processor 1020 is coupled to memory 1010 via bus 1030 and database 1050 is used to store data.

Computing device 1000 also includes access device 1040, which access device 1040 enables computing device 1000 to communicate via one or more networks 1060. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 1040 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near Field Communication (NFC).

In one embodiment of the present description, the above-described components of computing device 1000, as well as other components not shown in FIG. 10, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 10 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 1000 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 1000 may also be a mobile or stationary server.

The processor 1020 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the question-answering model training method, the text processing method, or the reward model training method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the question-answer model training method, the text processing method or the rewarding model training method belong to the same concept, and the details of the technical solution of the computing device, which are not described in detail, can be described by referring to the technical solution of the question-answer model training method, the text processing method or the rewarding model training method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the question-answer model training method, the text processing method, or the reward model training method described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the question-answer model training method, the text processing method or the rewarding model training method belong to the same concept, and the details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the question-answer model training method, the text processing method or the rewarding model training method.

An embodiment of the present disclosure also provides a computer program product comprising a computer program or instructions which, when executed by a processor, implement the steps of the question-answer model training method, the text processing method, or the reward model training method described above.

The foregoing is a schematic version of a computer program product of this embodiment. It should be noted that, the technical solution of the computer program product and the technical solution of the question-answer model training method, the text processing method or the rewarding model training method belong to the same concept, and the details of the technical solution of the computer program product, which are not described in detail, can be referred to the description of the technical solution of the question-answer model training method, the text processing method or the rewarding model training method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention.

Claims

1. A question-answering model training method, comprising:

Determining a word unit to be changed, which is matched with the thinking chain data, in the sample questions, updating the word unit to be changed by utilizing the thinking chain data to obtain target sample questions, and inputting the target sample questions into an initial question-answering model for processing to obtain predicted answers, wherein the predicted answers comprise predicted answer texts corresponding to the sample questions and answer thought texts corresponding to the sample questions;

Adjusting parameters of the initial question-answer model based on the optimization score until a target question-answer model meeting training stopping conditions is obtained;

Wherein training of the reward model comprises: obtaining a reward sample and a score vector sample sequence corresponding to the reward sample, wherein the score vector sample sequence comprises a standard score vector corresponding to each word unit in the reward sample; inputting the reward sample into an initial reward model to score, and obtaining a score vector prediction sequence, wherein the score vector prediction sequence comprises a predicted score vector corresponding to each word unit in the reward sample; calculating a loss value of the score vector sample sequence and the score vector prediction sequence by using a preset target loss function, and adjusting parameters of the initial reward model based on the loss value until a reward model meeting a reward training stop condition is obtained; the target loss function comprises a vector constraint term for constraining a target score vector corresponding to the last word unit.

2. The method for training a question-answering model according to claim 1, wherein the determining of the thinking chain data corresponding to the sample question includes:

And determining a problem type corresponding to the sample problem, and selecting candidate thinking chain data matched with the problem type in a target database as the thinking chain data.

3. The method for training a question-answer model according to claim 1, wherein said scoring said predicted answer according to the sample answer in said question-answer sample pair using a reward model associated with said initial question-answer model to obtain an optimized score, comprising:

Loading a reward model associated with the initial question-answer model, and splicing the target sample questions and the predicted answers into an optimized text;

and extracting sample answers from the question-answer sample pair, and scoring the optimized text according to the sample answers by utilizing the reward model to obtain an optimized score.

4. The method for training a question-answering model according to claim 1, wherein the step of tuning the initial question-answering model based on the optimization score until a target question-answering model satisfying a training stop condition is obtained includes:

Constructing a model optimization strategy aiming at the initial question-answer model according to the optimization scores and the sample answers;

Adjusting parameters of the initial question-answer model according to the model optimization strategy, and detecting whether the initial question-answer model after parameter adjustment meets the training stop condition;

if not, taking the initial question-answering model after the parameter adjustment as the initial question-answering model, and executing the step of extracting sample questions in the question-answering sample pair;

if yes, taking the initial question-answering model after the parameter adjustment as a target question-answering model.

5. The question-answering model training method according to claim 1, further comprising:

Determining an initial loss function preset by the initial rewarding model;

updating the initial loss function in response to a model optimization request submitted for the initial rewards model, wherein the model optimization request is used for adding semantic learning information in the initial loss function;

And determining a target loss function according to the updating result, and executing the step of calculating a loss value for the score vector sample sequence and the score vector prediction sequence by using a preset target loss function.

6. A text processing method, comprising:

receiving a problem text uploaded by a client;

determining problem domain information corresponding to the problem text, and selecting a target question-answering model matched with the problem domain information, wherein the target question-answering model is obtained through the question-answering model training method according to any one of claims 1 to 5;

7. A method of rewarding model training comprising:

Determining a target loss function preset by the initial rewarding model, wherein the target loss function comprises a vector constraint item for constraining a target score vector corresponding to the last word unit;

8. A question-answering model training device, comprising:

The updating module is configured to determine a word unit to be changed, which is matched with the thinking chain data, in the sample questions, update the word unit to be changed by utilizing the thinking chain data to obtain target sample questions, and input the target sample questions into an initial question-answering model for processing to obtain predicted answers, wherein the predicted answers comprise predicted answer texts corresponding to the sample questions and answer thought texts corresponding to the sample questions;

the parameter tuning module is configured to tune the initial question-answer model based on the optimization score until a target question-answer model meeting training stop conditions is obtained;

9. A text processing apparatus, comprising:

a determining module configured to determine problem domain information corresponding to the problem text and select a target question-answering model matched with the problem domain information, wherein the target question-answering model is obtained by the question-answering model training method according to any one of claims 1 to 5;

10. A bonus model training apparatus, comprising:

The determining function module is configured to determine a target loss function preset by the initial rewarding model, wherein the target loss function comprises a vector constraint item for constraining a target score vector corresponding to the last word unit;

11. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 7.

12. A computer readable storage medium, characterized in that it stores computer executable instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7.

13. A computer program product comprising a computer program or instructions which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 7.