CN114490926A - Method and device for determining similar problems, storage medium and terminal - Google Patents
Method and device for determining similar problems, storage medium and terminal Download PDFInfo
- Publication number
- CN114490926A CN114490926A CN202111668984.7A CN202111668984A CN114490926A CN 114490926 A CN114490926 A CN 114490926A CN 202111668984 A CN202111668984 A CN 202111668984A CN 114490926 A CN114490926 A CN 114490926A
- Authority
- CN
- China
- Prior art keywords
- covariance matrix
- vector
- target
- text
- semantic model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method, a device, a storage medium and a terminal for determining similar problems, wherein the method comprises the following steps: receiving a target problem text to be processed; inputting the target problem text into a pre-trained semantic model, and outputting a target vector corresponding to the target problem text; calculating a covariance matrix transformation vector corresponding to the target problem text according to a pre-generated covariance matrix parameter and a target vector; and determining a similar problem corresponding to the target problem text based on the covariance matrix transformation vector. According to the method and the device, the problem text is converted into the sentence vector, and the pre-generated covariance matrix parameter is adopted to perform covariance matrix transformation on the sentence vector, so that the isotropy of the sentence vector is guaranteed, namely the sentence vector cannot be changed due to other influence factors, and the recommendation accuracy of similar problems is improved.
Description
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a storage medium, and a terminal for determining a similar problem.
Background
In the construction process of the question and answer field, the corpora are very important core assets. A good model can be trained only by the corpora, so that the corpora belonging to the field can be identified by the model. For the question-answer type task, the question-answer is more important for the quantity of the linguistic data, more linguistic data can enable the product to be more intelligent, and various strange questions of the user can be answered. Therefore, it is not difficult to find that the quantity and quality of the linguistic data of the question and answer plays a role in determining the influence on the end-to-end influence of the question and answer field and the user experience, and is particularly important for the question and answer type task, wherein the quantity of the linguistic data refers to that the linguistic data are enough and large, and the quality of the linguistic data refers to that the linguistic data are good in quality and can contain the question methods of various modes of the user.
In the prior art, retrieval-type recall recommendation is generally adopted when relevant problems are recommended to a user, and generally retrieval, recall and recommendation are performed through a search engine. For example, a user enters a question-answer pair, which is typically retrieved from a database to see which similar questions are in the database and may be recommended to enterprise users. Because most of the existing similar problem recommendation systems do not convert problems into sentence vectors for processing, only process from the perspective of keywords and the like, even if the problems are converted into the sentence vectors, subsequent processing is not always performed to ensure the isotropy of the sentence vectors, namely, the sentence vectors are changed along with other influence factors, and thus the recommendation accuracy of similar problems is seriously influenced.
Disclosure of Invention
The embodiment of the application provides a method and a device for determining similar problems, a storage medium and a terminal. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
In a first aspect, an embodiment of the present application provides a method for determining a similar problem, where the method includes:
receiving a target problem text to be processed;
inputting the target problem text into a pre-trained semantic model, and outputting a target vector corresponding to the target problem text;
calculating a covariance matrix transformation vector corresponding to the target problem text according to the pre-generated covariance matrix parameter and the target vector;
and determining a similar problem corresponding to the target problem text based on the covariance matrix transformation vector.
Optionally, the pre-trained semantic model is generated according to the following steps, including:
obtaining a bert network, and initializing the weight of the bert network to obtain a semantic model;
acquiring a label-free data set and a problem text base, and pre-training a semantic model according to the label-free data set and the problem text base to obtain a pre-trained semantic model;
constructing a positive sample and a negative sample for each question text in a question text library, and generating a plurality of training samples;
inputting each training sample into a pre-trained semantic model, and outputting a plurality of sample parameter vectors;
calculating a loss value according to the plurality of sample parameter vectors;
and when the loss value reaches a preset threshold value, generating a pre-trained semantic model.
Optionally, the pre-training of the semantic model according to the unlabeled data set and the question text library is performed to obtain a pre-trained semantic model, including:
performing word segmentation processing on each non-label data in the non-label data set to obtain a sub-word sequence of each non-label data;
inputting the label-free data set into a preset word2vec network for negative sampling mode training, and outputting a word vector of each word;
calculating cosine similarity between each sub-word in the sub-word sequence and the word vector of each word, and determining a similarity set of each sub-word according to the cosine similarity;
replacing words in the sub-word sequence corresponding to each sub-word according to the similarity set of each sub-word to obtain final non-label data;
inputting the final label-free data and all problem sentences in the problem text base into a semantic model for training, and obtaining an initial semantic model after training is finished;
and randomly combining each non-label data in the non-label data set with all the question texts in the question text library, inputting the combined data into the initial semantic model for training, and obtaining a pre-trained semantic model after training is finished.
Optionally, the pre-trained semantic model includes a bert network, a GRU network, and a pooling layer;
inputting each training sample into a pre-trained semantic model, and outputting a plurality of sample parameter vectors, wherein the steps comprise:
calculating the final vector of each parameter in each training sample;
inputting the final vector of each parameter into a bert network, a GRU network and a pooling layer in sequence, and outputting each sample parameter vector; a plurality of sample parameter vectors are generated.
Optionally, obtaining the pre-generated covariance matrix parameter according to the following steps, including:
inputting all problem sentences in a problem text base into a pre-trained semantic model respectively, and outputting a sentence vector set;
converting each sentence vector in the sentence vector set according to a preset covariance matrix conversion formula to obtain a converted data covariance matrix;
solving the transformed data covariance matrix to obtain a first solving parameter mu and a second solving parameter W;
and determining the first solving parameter mu and the second solving parameter W as pre-generated covariance matrix parameters.
Optionally, the method further comprises:
calculating a covariance matrix transformation result corresponding to each sentence vector in the sentence vector set according to a pre-generated covariance matrix parameter;
and storing the covariance matrix transformation result corresponding to each sentence vector into a database to obtain a covariance matrix transformation result set of the problem base.
Optionally, determining a similarity problem corresponding to the target problem text based on the covariance matrix transformation vector includes:
averagely distributing a covariance matrix transformation result set of the problem library to a plurality of preset service nodes;
calculating cosine similarity between the covariance matrix transformation vector and a plurality of covariance matrix transformation results on each service node, and generating a plurality of cosine similarity corresponding to each service node;
sequencing a plurality of cosine similarities corresponding to each service node, and extracting a preset number of cosine similarities to obtain an initial similarity set;
sequencing the similarity in the initial similarity set, and extracting cosine similarities of a preset number to obtain a plurality of target similarities;
and determining the question texts corresponding to the target similarity degrees as the similar questions corresponding to the target question texts.
In a second aspect, an embodiment of the present application provides an apparatus for determining a similarity problem, where the apparatus includes:
the question text receiving module is used for receiving a target question text to be processed;
the question text input module is used for inputting the target question text into a pre-trained semantic model and outputting a target vector corresponding to the target question text;
the covariance matrix transformation vector calculation module is used for calculating a covariance matrix transformation vector corresponding to the target problem text according to a pre-generated covariance matrix parameter and a target vector;
and the similarity problem determining module is used for determining a similarity problem corresponding to the target problem text based on the covariance matrix transformation vector.
In a third aspect, embodiments of the present application provide a computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.
In a fourth aspect, an embodiment of the present application provides a terminal, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
in the embodiment of the application, a device for determining similar problems firstly receives a target problem text to be processed; and then inputting the target problem text into a pre-trained semantic model, outputting a target vector corresponding to the target problem text, calculating a covariance matrix transformation vector corresponding to the target problem text according to pre-generated covariance matrix parameters and the target vector, and finally determining a similar problem corresponding to the target problem text based on the covariance matrix transformation vector. According to the method and the device, the problem text is converted into the sentence vector, and the pre-generated covariance matrix parameter is adopted to perform covariance matrix transformation on the sentence vector, so that the isotropy of the sentence vector is guaranteed, namely the sentence vector cannot be changed due to other influence factors, and the recommendation accuracy of similar problems is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a schematic flowchart of a method for determining a similarity problem according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a semantic model training method according to an embodiment of the present disclosure;
FIG. 3 is a diagram of a semantic model according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a device for determining a similar problem according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The application provides a method, a device, a storage medium and a terminal for determining similar problems, so as to solve the problems in the related art. In the technical scheme provided by the application, the problem text is converted into the sentence vector, and the pre-generated covariance matrix parameter is adopted to perform covariance matrix transformation on the sentence vector, so that the isotropy of the sentence vector is ensured, namely the sentence vector is not changed due to other influence factors, and further the precision of recommending similar problems is improved.
A method for determining similar problems provided by embodiments of the present application will be described in detail below with reference to fig. 1 to 3. The method may be implemented in dependence on a computer program, which may be run on a determination device based on a similar problem of the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.
Referring to fig. 1, a schematic flow chart of a method for determining a similarity problem is provided in an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may include the following steps:
s101, receiving a target problem text to be processed;
the text refers to a representation of a written language, and is usually a sentence or a combination of sentences having a complete and systematic meaning, and a text may be a sentence, a paragraph or a chapter.
Generally, the text is a word composed of several characters or a sentence composed of several words, or a paragraph composed of several sentences, and the user can describe his own idea by language text, and the description using the text can change the complicated idea into an instruction that is easy to be understood by others. For texts, different expression modes can be used to enable complex ideas to be popular and easy to understand, and communication is easier to understand. One or more natural languages contained in the text may be referred to as sentences for short, or colloquial as sentences, or may be divided into sentences according to punctuations in the text, that is, contents ending with periods, question marks, exclamation marks, commas, and the like are taken as a sentence.
The target question text refers to a language text input to the user terminal by the user, and may be a language text generated by editing the user through user terminal text editing software, or a language text generated by voice information recorded by the user through user terminal voice recording software, and the generation of the target language text has multiple modes, which is not limited here.
In a possible implementation mode, a user enters a chat interface by clicking software with a question-answering system installed on a user terminal, then pops up a text editor by clicking a text input box, after the text editor is popped up, the user can input ideas expressed by the user into the text editing box in a text description mode, and the user terminal generates a target question text aiming at the operation of the user. It should be noted that there are various ways to obtain the target language text to be understood, and the method is not limited herein.
S102, inputting a target problem text into a pre-trained semantic model, and outputting a target vector corresponding to the target problem text;
the pre-trained semantic model is a mathematical model for converting a problem text input by a user into a vector. The model includes a bert network, a GRU network, and a pooling layer.
In the embodiment of the application, when a pre-trained semantic model is generated, a bert network is firstly obtained, the weight of the bert network is initialized to obtain a semantic model, then a non-tag data set and a problem text library are obtained, the semantic model is pre-trained according to the non-tag data set and the problem text library to obtain the pre-trained semantic model, a positive sample and a negative sample are constructed for each problem text in the problem text library to generate a plurality of training samples, then each training sample is input into the pre-trained semantic model to output a plurality of sample parameter vectors, a loss value is calculated according to the plurality of sample parameter vectors, and finally the pre-trained semantic model is generated when the loss value reaches a preset threshold value.
In a possible implementation manner, after receiving a target problem text to be processed, the target problem text to be processed may be input into a pre-trained semantic model, and after being processed by a bert network, a GRU network, and a pooling layer of the model, a target vector corresponding to the target problem text may be output and recorded as a target vector
S103, calculating a covariance matrix transformation vector corresponding to the target problem text according to the pre-generated covariance matrix parameter and the target vector;
in a possible implementation manner, the pre-generated covariance matrix parameters are the first solving parameter μ and the second solving parameter W respectively, and then the first solving parameter μ and the second solving parameter W are obtainedThen, the formula can be transformed according to the preset covariance matrixCalculating a covariance matrix transformation vector corresponding to the target problem text
And S104, determining a similar problem corresponding to the target problem text based on the covariance matrix transformation vector.
In the examples of the present application, in obtainingAfter that, useCovariance matrix transformation result set with question bankAnd calculating scores according to the cosine similarity, and recommending the most similar problems to the user.
In a possible implementation manner, firstly, a covariance matrix transformation result set of a question bank is evenly distributed to a plurality of preset service nodes, then cosine similarity between a covariance matrix transformation vector and a plurality of covariance matrix transformation results on each service node is calculated, a plurality of cosine similarity corresponding to each service node is generated, then the plurality of cosine similarity corresponding to each service node is sequenced, a preset number of cosine similarity is extracted, an initial similarity set is obtained, secondly, the similarity in the initial similarity set is sequenced, a preset number of cosine similarity is extracted, a plurality of target similarity is obtained, and finally, a question text corresponding to a plurality of target similarity is determined as a similarity question corresponding to a target question text.
For example, willRespectively arranged in N service nodes master, node1, node2, node3, node … and noden-1The above. And assuming that the number of similar problems needing to be recommended to the user is K, taking out Top-K results through cosine similarity for each node. And then combining the Top-K results of all the N nodes by the master service node, and sorting the similarity to obtain the final Top-K results to recommend to the user.
In another possible implementation, the method will be describedThe method for constructing the ball tree specifically comprises the following steps: 1) a hypersphere is first constructed, which is the smallest sphere that can contain all samples. 2) Selecting a point from the sphere that is farthest from the center of the sphere, then selecting a second point that is farthest from the first point, assigning all points in the sphere to the closest of the centers of the two clusters, then calculating the center of each cluster, and the cluster can contain the minimum radius required for all its data points. Thus, two sub-hyperspheres are obtained, which correspond to the left and right subtrees in the KD tree. 3) For these two sub-superspheres, recursively performing step 2) results in a ball tree. When each node calculates topK results, the calculation speed can be greatly improved by searching in the sphere constructed in the steps.
It should be noted that the response time for recommending similar problems can be greatly shortened through the two methods, so that the system can have good response time under the condition of mass data, the usability of the actual engineering environment is greatly improved, and the user experience is improved.
In the embodiment of the application, the device for determining the similarity problem firstly receives a target problem text to be processed, then inputs the target problem text into a pre-trained semantic model, outputs a target vector corresponding to the target problem text, secondly calculates a covariance matrix transformation vector corresponding to the target problem text according to pre-generated covariance matrix parameters and the target vector, and finally determines the similarity problem corresponding to the target problem text based on the covariance matrix transformation vector. According to the method and the device, the problem text is converted into the sentence vector, and the pre-generated covariance matrix parameter is adopted to perform covariance matrix transformation on the sentence vector, so that the isotropy of the sentence vector is guaranteed, namely the sentence vector cannot be changed due to other influence factors, and the recommendation accuracy of similar problems is improved.
Referring to fig. 2, a schematic flow chart of a semantic model training method is provided in the embodiment of the present application.
As shown in fig. 2, the method of the embodiment of the present application may include the following steps:
s201, acquiring a bert network, and initializing the weight of the bert network to obtain a semantic model;
in one possible implementation, the model framework of the bert network is obtained first, then the weight of the model is initialized, and the initialization is initialized by using the weight of the bert-base-chip model of google.
S202, acquiring a label-free data set and a problem text base, and pre-training a semantic model according to the label-free data set and the problem text base to obtain a pre-trained semantic model;
in the embodiment of the application, firstly, a non-tag data set and a problem text library are obtained, then, word segmentation processing is carried out on each non-tag data in the non-tag data set to obtain a sub-word sequence of each non-tag data, then, the non-tag data set is input into a preset word2vec network to carry out negative sampling mode training, a word vector of each word is output, cosine similarity between each sub-word in the sub-word sequence and the word vector of each word is calculated, a similarity set of each sub-word is determined according to the cosine similarity, words in the sub-word sequence corresponding to each sub-word are replaced according to the similarity set of each sub-word to obtain final non-tag data, then, the final non-tag data and all problems in the problem text library are input into a semantic model to be trained, an initial semantic model is obtained after training is finished, and finally, the non-tag data in the non-tag data set and all the problem texts in the problem text library are randomly combined and input into the initial semantic model to be input into the initial semantic model And (5) performing training, and obtaining a pre-trained semantic model after the training is finished.
In one possible implementation, a large amount of in-domain unlabeled data is collected according to the current domain of the problem recommendation system to obtain an unlabeled data setUsing CRF algorithm to convert each unlabeled data { xdomain}iPerforming word segmentation processing, and obtaining a sub-word sequence { piece ] of each non-label data after the word segmentation processing is finished1,piece2,piece3,...,pieceMAnd (5) masking the candidate tokens of the sub-word sequence according to a masking strategy of N-Gram, wherein the percentage of masking from one piece to four consecutive pieces is set to be 45%, 35%, 15% and 5%. Wherein the 4-dimensional model can be represented as:
when the sub-word sequence is covered, similar sub-words are used for covering, word2vec algorithm is firstly used for non-label data, a negative sampling mode is adopted for training to obtain word vectors of each word, cosine similarity between each piece and the word vectors of each word is calculated, and a similarity set { p } is obtainediTake the set { p }iThe first quartile after ascending sorting is the gate value M.
Specifically, when the sub-words are covered, the most similar word similarity obtained through cosine similarity calculation is PsimWhen P issim>M then masks with that most similar word substitution, when P issim<M is masked by random substitution. Transfusion systemAnd (3) masking piece in the sentence with a probability of 15%, wherein 90% of piece is replaced by similar words according to the mode, 5% of piece is replaced by random words, the rest is kept unchanged, final label-free data is obtained after final replacement, the final label-free data is input into a semantic model for pre-training, and the epoch is 80. After the pre-training is finished, all sentences of the question bank are put into practiceThe pre-training is continued by inputting the pre-training model in the same manner as the pre-training in the field, and the epoch is 80. After the pre-training is finished, collecting the label-free data setAll sentences of the question bankAfter the shuffling, the input model is pre-trained again in the same way, with an epoch of 100. After the pre-training is finished in the above way, the pre-trained semantic model is obtained.
S203, constructing a positive sample and a negative sample for each question text in the question text library, and generating a plurality of training samples;
in one possible implementation, all questions in the question bank are addressedFirst for each question qiOne positive sample positive is selectediAnd two negative samples negative1iAnd negative2iRespectively screening positive and negative samples from a question bank, and if artificial construction is not carried out, finally obtaining N training samples, wherein each training sample can be expressed as (q)i,positivei,negative 1i,negative 2i)。
S204, inputting each training sample into a pre-trained semantic model, and outputting a plurality of sample parameter vectors;
the pre-trained semantic model comprises a bert network, a GRU network and a pooling layer.
In the embodiment of the application, the final vector of each parameter in each training sample is calculated firstly, then the final vector of each parameter is input into a bert network, a GRU network and a pooling layer in sequence, each sample parameter vector is output, and finally a plurality of sample parameter vectors are generated.
Specifically, when calculating the final vector of each parameter in each training sample, first, each parameter in each training sample is subjected to the embedding (embedding represents vectorization) of token, the embedding of position embedding and segment (segment) to obtain a plurality of embedding results of each parameter, and finally, the plurality of embedding results of each parameter are summed to obtain final embedding, that is, the final vector of each parameter in each training sample.
For example, compute training samples (q)i,positivei,negative 1i,negative 2i) Q in (1)iAt final embedding of, qiThe formula of embedding passing through token, position and segment is as follows: embedding (q)i)=TokenEmbedding(qi)+PosEmbedding(qi)+SegEmbedding(qi) Specifically, for example, as shown in fig. 3, the final vector of each parameter in each training sample is imbedding-q, imbedding-positive 1, imbedding-positive 2, which are sequentially input into the bert network, the GRU network, and the pooling layer, respectively, to output each sample parameter vector, and finally generate a plurality of sample parameter vectors, which are out-q, out-positive 1, and out-negative 2.
For example, when generating out-q, entering imbedding-q into bert to obtainThen willFurther encoding by GRUWill be provided withAverage pooling to obtain final output
S205, calculating a loss value according to the plurality of sample parameter vectors;
in one possible implementation, such as shown in FIG. 3, after obtaining out-q, out-positive, out-negative1, out-negative2, the Loss is calculated using the proposed Quaternary Contrast Loss function.
Specifically, the formula is as follows:
whereinAre respectively (q)i,positivei,negative 1i,negative 2i) And (3) expressing the output vector after the model, namely out-q, out-positive, out-negative1 and out-negative2, wherein epsilon represents the edge distance, the size of the edge distance is 1, and l | · | | | represents distance measurement, wherein Euclidean distance is used, and alpha represents a scaling coefficient, and the value range is 0-1.
It should be noted that, a new pre-training mode is adopted to pre-train the model first. And a multi-tower model structure based on a pre-training model shown in FIG. 3 is used, and a Quaternary Contrast Loss function is creatively provided to train the model, so that the capability of the model for accurately capturing deep semantic information of user problems and mining potential relations between the user problems and other problems is greatly improved.
And S206, generating a pre-trained semantic model when the loss value reaches a preset threshold value.
In one possible implementation, a pre-trained semantic model is generated when the loss value reaches a preset threshold.
In another possible implementation manner, when the loss value does not reach the preset threshold, the loss value is propagated back to the updated model parameters, and the step of inputting each training sample into the pre-trained semantic model is continuously performed, so as to continuously train the model of fig. 3.
Further, after a pre-trained semantic model is obtained, pre-generated covariance matrix parameters can be obtained according to the following steps, firstly, all problem sentences in a problem text base are respectively input into the pre-trained semantic model, a sentence vector set is output, then, each sentence vector in the sentence vector set is transformed according to a preset covariance matrix transformation formula, a transformed data covariance matrix is obtained, then, the transformed data covariance matrix is solved, a first solving parameter mu and a second solving parameter W are obtained, and finally, the first solving parameter mu and the second solving parameter W are determined as the pre-generated covariance matrix parameters.
And further, calculating a covariance matrix transformation result corresponding to each sentence vector in the sentence vector set according to the pre-generated covariance matrix parameters, and finally storing the covariance matrix transformation result corresponding to each sentence vector into a database to obtain a covariance matrix transformation result set of the question bank.
Specifically, all question sentences in the question bank are combinedThe output sentence vectors obtained after the bert, GRU and pooling layers of the pre-trained semantic model are respectively expressed asWill be provided withThe transformation formula is executed as follows:so thatIs 0 and the covariance matrix is the identity matrix.WhereinHere, the covariance matrix of the raw data is noted as:
WTΣ W ═ I can be deduced as (W)T)-1W-1=(W-1)TW-1Since the covariance matrix Σ is a semi-positive symmetric matrix, it has the following SVD decomposition form:
Σ=UΛUT(ii) a Where U is an orthogonal matrix and Λ is a diagonal matrix, and the diagonal elements are all positive, thus directly lettingThe solution can be completed:
after obtaining mu and W in the above manner, respectivelyBy passingTo obtainNamely all question sentences in the question bankThrough cov-tras a result of the covariance matrix transformation performed by the ansform layer, willThe sentence vectors of (a) are saved in a database for use in the recall stage.
In the embodiment of the application, the device for determining the similarity problem firstly receives a target problem text to be processed, then inputs the target problem text into a pre-trained semantic model, outputs a target vector corresponding to the target problem text, secondly calculates a covariance matrix transformation vector corresponding to the target problem text according to pre-generated covariance matrix parameters and the target vector, and finally determines the similarity problem corresponding to the target problem text based on the covariance matrix transformation vector. According to the method and the device, the problem text is converted into the sentence vector, and the pre-generated covariance matrix parameter is adopted to perform covariance matrix transformation on the sentence vector, so that the isotropy of the sentence vector is guaranteed, namely the sentence vector cannot be changed due to other influence factors, and the recommendation accuracy of similar problems is improved.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
Referring to fig. 4, a schematic structural diagram of a device for determining a similar problem according to an exemplary embodiment of the present invention is shown. The means for determining the similarity problem may be implemented as all or part of the terminal in software, hardware, or a combination of both. The device 1 comprises a question text receiving module 10, a question text input module 20, a covariance matrix transformation vector calculation module 30 and a similar question determination module 40.
The question text receiving module 10 is used for receiving a target question text to be processed;
the question text input module 20 is used for inputting the target question text into a pre-trained semantic model and outputting a target vector corresponding to the target question text;
a covariance matrix transformation vector calculation module 30, configured to calculate a covariance matrix transformation vector corresponding to the target problem text according to a pre-generated covariance matrix parameter and the target vector;
and the similarity problem determining module 40 is used for determining a similarity problem corresponding to the target problem text based on the covariance matrix transformation vector.
It should be noted that, when the determining apparatus for determining similar problems provided in the foregoing embodiments performs the determining method for similar problems, only the division of the functional modules is taken as an example, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the determination apparatus for similar problems provided in the above embodiments and the determination method embodiment for similar problems belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the embodiment of the application, the device for determining the similarity problem firstly receives a target problem text to be processed, then inputs the target problem text into a pre-trained semantic model, outputs a target vector corresponding to the target problem text, secondly calculates a covariance matrix transformation vector corresponding to the target problem text according to pre-generated covariance matrix parameters and the target vector, and finally determines the similarity problem corresponding to the target problem text based on the covariance matrix transformation vector. According to the method and the device, the problem text is converted into the sentence vector, and the pre-generated covariance matrix parameter is adopted to perform covariance matrix transformation on the sentence vector, so that the isotropy of the sentence vector is guaranteed, namely the sentence vector cannot be changed due to other influence factors, and the recommendation accuracy of similar problems is improved.
The present invention also provides a computer readable medium having stored thereon program instructions which, when executed by a processor, implement a method of determining similar problems as provided by the various method embodiments described above.
The present invention also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform a method of determining a similarity problem of the various method embodiments described above.
Please refer to fig. 5, which provides a schematic structural diagram of a terminal according to an embodiment of the present application. As shown in fig. 5, terminal 1000 can include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.
Wherein a communication bus 1002 is used to enable connective communication between these components.
The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 5, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a determination application program of a similar problem.
In the terminal 1000 shown in fig. 5, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke a determination application of similar problems stored in the memory 1005 and specifically perform the following operations:
receiving a target problem text to be processed;
inputting the target problem text into a pre-trained semantic model, and outputting a target vector corresponding to the target problem text;
calculating a covariance matrix transformation vector corresponding to the target problem text according to the pre-generated covariance matrix parameter and the target vector;
and determining a similar problem corresponding to the target problem text based on the covariance matrix transformation vector.
In one embodiment, the processor 1001, when generating the pre-trained semantic model, specifically performs the following operations:
obtaining a bert network, and initializing the weight of the bert network to obtain a semantic model;
acquiring a label-free data set and a problem text base, and pre-training a semantic model according to the label-free data set and the problem text base to obtain a pre-trained semantic model;
constructing a positive sample and a negative sample for each question text in a question text library, and generating a plurality of training samples;
inputting each training sample into a pre-trained semantic model, and outputting a plurality of sample parameter vectors;
calculating a loss value according to the plurality of sample parameter vectors;
and when the loss value reaches a preset threshold value, generating a pre-trained semantic model.
In an embodiment, when the processor 1001 performs pre-training on the semantic model according to the unlabeled data set and the question text base to obtain a pre-trained semantic model, the following operations are specifically performed:
performing word segmentation processing on each non-tag data in the non-tag data set to obtain a sub-word sequence of each non-tag data;
inputting the label-free data set into a preset word2vec network for negative sampling mode training, and outputting a word vector of each word;
calculating cosine similarity between each sub-word in the sub-word sequence and the word vector of each word, and determining a similarity set of each sub-word according to the cosine similarity;
replacing words in the sub-word sequence corresponding to each sub-word according to the similarity set of each sub-word to obtain final non-label data;
inputting the final label-free data and all problem sentences in the problem text base into a semantic model for training, and obtaining an initial semantic model after training is finished;
and randomly combining each non-label data in the non-label data set with all the question texts in the question text library, inputting the combined data into the initial semantic model for training, and obtaining a pre-trained semantic model after training is finished.
In one embodiment, the processor 1001 specifically performs the following operations when inputting each training sample into the pre-trained semantic model and outputting a plurality of sample parameter vectors:
calculating the final vector of each parameter in each training sample;
inputting the final vector of each parameter into a bert network, a GRU network and a pooling layer in sequence, and outputting each sample parameter vector;
a plurality of sample parameter vectors are generated.
In one embodiment, the processor 1001 specifically performs the following operations when generating the pre-generated covariance matrix parameters:
inputting all problem sentences in a problem text base into a pre-trained semantic model respectively, and outputting a sentence vector set;
converting each sentence vector in the sentence vector set according to a preset covariance matrix conversion formula to obtain a converted data covariance matrix;
solving the transformed data covariance matrix to obtain a first solving parameter mu and a second solving parameter W;
and determining the first solving parameter mu and the second solving parameter W as pre-generated covariance matrix parameters.
In one embodiment, the processor 1001 also performs the following operations:
calculating a covariance matrix transformation result corresponding to each sentence vector in the sentence vector set according to a pre-generated covariance matrix parameter;
and storing the covariance matrix transformation result corresponding to each sentence vector into a database to obtain a covariance matrix transformation result set of the problem base.
In one embodiment, when the processor 1001 determines the similarity problem corresponding to the target problem text based on the covariance matrix transformation vector, it specifically performs the following operations:
averagely distributing a covariance matrix transformation result set of the problem library to a plurality of preset service nodes;
calculating cosine similarity between the covariance matrix transformation vector and a plurality of covariance matrix transformation results on each service node, and generating a plurality of cosine similarity corresponding to each service node;
sequencing a plurality of cosine similarities corresponding to each service node, and extracting a preset number of cosine similarities to obtain an initial similarity set;
sequencing the similarity in the initial similarity set, and extracting cosine similarities of a preset number to obtain a plurality of target similarities;
and determining the question texts corresponding to the target similarity degrees as the similar questions corresponding to the target question texts.
In the embodiment of the application, the device for determining the similarity problem firstly receives a target problem text to be processed, then inputs the target problem text into a pre-trained semantic model, outputs a target vector corresponding to the target problem text, secondly calculates a covariance matrix transformation vector corresponding to the target problem text according to pre-generated covariance matrix parameters and the target vector, and finally determines the similarity problem corresponding to the target problem text based on the covariance matrix transformation vector. According to the method and the device, the problem text is converted into the sentence vector, and the pre-generated covariance matrix parameter is adopted to perform covariance matrix transformation on the sentence vector, so that the isotropy of the sentence vector is guaranteed, namely the sentence vector cannot be changed due to other influence factors, and the recommendation accuracy of similar problems is improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct related hardware, and the program for determining similar problems can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.
Claims (10)
1. A method for determining a similarity problem, the method comprising:
receiving a target problem text to be processed;
inputting the target problem text into a pre-trained semantic model, and outputting a target vector corresponding to the target problem text;
calculating a covariance matrix transformation vector corresponding to the target problem text according to a pre-generated covariance matrix parameter and the target vector;
and determining a similar problem corresponding to the target problem text based on the covariance matrix transformation vector.
2. The method of claim 1, wherein generating a pre-trained semantic model comprises:
obtaining a bert network, and initializing the weight of the bert network to obtain a semantic model;
acquiring a non-tag data set and a problem text base, and pre-training the semantic model according to the non-tag data set and the problem text base to obtain a pre-trained semantic model;
constructing a positive sample and a negative sample for each question text in the question text library, and generating a plurality of training samples;
inputting each training sample into a pre-trained semantic model, and outputting a plurality of sample parameter vectors;
calculating a loss value from the plurality of sample parameter vectors;
and when the loss value reaches a preset threshold value, generating a pre-trained semantic model.
3. The method of claim 2, wherein the pre-training the semantic model according to the unlabeled dataset and the question text base to obtain a pre-trained semantic model comprises:
performing word segmentation processing on each non-label data in the non-label data set to obtain a sub-word sequence of each non-label data;
inputting the label-free data set into a preset word2vec network for negative sampling mode training, and outputting a word vector of each word;
calculating cosine similarity between each sub-word in the sub-word sequence and the word vector of each word, and determining a similarity set of each sub-word according to the cosine similarity;
replacing words in the sub-word sequence corresponding to each sub-word according to the similarity set of each sub-word to obtain final non-label data;
inputting the final label-free data and all problem sentences in the problem text base into the semantic model for training, and obtaining an initial semantic model after training is finished;
and randomly combining each non-tag data in the non-tag data set with all the problem texts in the problem text library, inputting the combined data into the initial semantic model for training, and obtaining a pre-trained semantic model after training is finished.
4. The method of claim 2, wherein the pre-trained semantic model comprises a bert network, a GRU network, and a pooling layer;
inputting each training sample into a pre-trained semantic model, and outputting a plurality of sample parameter vectors, including:
calculating a final vector of each parameter in each training sample;
inputting the final vector of each parameter into a bert network, a GRU network and a pooling layer in sequence, and outputting each sample parameter vector;
a plurality of sample parameter vectors are generated.
5. The method of claim 2, wherein obtaining pre-generated covariance matrix parameters comprises:
inputting all question sentences in the question text base into the pre-trained semantic model respectively, and outputting a sentence vector set;
transforming each sentence vector in the sentence vector set according to a preset covariance matrix transformation formula to obtain a transformed data covariance matrix;
solving the transformed data covariance matrix to obtain a first solving parameter mu and a second solving parameter W;
and determining the first solving parameter mu and the second solving parameter W as pre-generated covariance matrix parameters.
6. The method of claim 5, further comprising:
calculating a covariance matrix transformation result corresponding to each sentence vector in the sentence vector set according to the pre-generated covariance matrix parameters;
and storing the covariance matrix transformation result corresponding to each sentence vector into a database to obtain a covariance matrix transformation result set of the problem base.
7. The method of claim 6, wherein the determining the similarity problem corresponding to the target problem text based on the covariance matrix transformation vector comprises:
averagely distributing the covariance matrix transformation result set of the problem library to a plurality of preset service nodes;
calculating cosine similarity between the covariance matrix transformation vector and a plurality of covariance matrix transformation results on each service node, and generating a plurality of cosine similarity corresponding to each service node;
sequencing a plurality of cosine similarities corresponding to each service node, and extracting a preset number of cosine similarities to obtain an initial similarity set;
sequencing the similarity in the initial similarity set, and extracting cosine similarities of a preset number to obtain a plurality of target similarities;
and determining the question texts corresponding to the target similarity degrees as the similar questions corresponding to the target question texts.
8. An apparatus for determining a similarity problem, the apparatus comprising:
the question text receiving module is used for receiving a target question text to be processed;
the question text input module is used for inputting the target question text into a pre-trained semantic model and outputting a target vector corresponding to the target question text;
the covariance matrix transformation vector calculation module is used for calculating a covariance matrix transformation vector corresponding to the target problem text according to a pre-generated covariance matrix parameter and the target vector;
and the similarity problem determining module is used for determining a similarity problem corresponding to the target problem text based on the covariance matrix transformation vector.
9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to perform the method steps according to any of claims 1-7.
10. A terminal, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111668984.7A CN114490926A (en) | 2021-12-30 | 2021-12-30 | Method and device for determining similar problems, storage medium and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111668984.7A CN114490926A (en) | 2021-12-30 | 2021-12-30 | Method and device for determining similar problems, storage medium and terminal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114490926A true CN114490926A (en) | 2022-05-13 |
Family
ID=81507499
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111668984.7A Pending CN114490926A (en) | 2021-12-30 | 2021-12-30 | Method and device for determining similar problems, storage medium and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114490926A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115019235A (en) * | 2022-06-15 | 2022-09-06 | 天津市国瑞数码安全系统股份有限公司 | Method and system for scene division and content detection |
CN116414958A (en) * | 2023-02-06 | 2023-07-11 | 飞算数智科技(深圳)有限公司 | Text corpus generation method and device, storage medium and electronic equipment |
-
2021
- 2021-12-30 CN CN202111668984.7A patent/CN114490926A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115019235A (en) * | 2022-06-15 | 2022-09-06 | 天津市国瑞数码安全系统股份有限公司 | Method and system for scene division and content detection |
CN116414958A (en) * | 2023-02-06 | 2023-07-11 | 飞算数智科技(深圳)有限公司 | Text corpus generation method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110795543B (en) | Unstructured data extraction method, device and storage medium based on deep learning | |
CN109376222B (en) | Question-answer matching degree calculation method, question-answer automatic matching method and device | |
CN111444320A (en) | Text retrieval method and device, computer equipment and storage medium | |
CN111046275B (en) | User label determining method and device based on artificial intelligence and storage medium | |
CN109271493A (en) | A kind of language text processing method, device and storage medium | |
CN110457708B (en) | Vocabulary mining method and device based on artificial intelligence, server and storage medium | |
CN110795913B (en) | Text encoding method, device, storage medium and terminal | |
CN110096567A (en) | Selection method, system are replied in more wheels dialogue based on QA Analysis of Knowledge Bases Reasoning | |
CN113705313A (en) | Text recognition method, device, equipment and medium | |
CN113255328B (en) | Training method and application method of language model | |
CN118575173A (en) | Enhancing machine learning language models using search engine results | |
CN108304376B (en) | Text vector determination method and device, storage medium and electronic device | |
CN113836295B (en) | Text abstract extraction method, system, terminal and storage medium | |
CN113569018A (en) | Question and answer pair mining method and device | |
CN115131698A (en) | Video attribute determination method, device, equipment and storage medium | |
CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
CN114490926A (en) | Method and device for determining similar problems, storage medium and terminal | |
CN110852066B (en) | Multi-language entity relation extraction method and system based on confrontation training mechanism | |
CN116955591A (en) | Recommendation language generation method, related device and medium for content recommendation | |
CN117494815A (en) | File-oriented credible large language model training and reasoning method and device | |
CN110795544A (en) | Content search method, device, equipment and storage medium | |
CN114490949B (en) | Document retrieval method, device, equipment and medium based on BM25 algorithm | |
CN117609612A (en) | Resource recommendation method and device, storage medium and electronic equipment | |
CN116680379A (en) | Text processing method, text processing device, electronic equipment and computer readable storage medium | |
CN114417824B (en) | Chapter-level relation extraction method and system based on dependency syntax pre-training model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |