CN115186065A

CN115186065A - Target word retrieval method and device

Info

Publication number: CN115186065A
Application number: CN202210842766.9A
Authority: CN
Inventors: 綦红镀
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-10-14

Abstract

The application discloses a target word retrieval method and a target word retrieval device, which can be applied to the field of artificial intelligence. Obtaining an original retrieval phrase; determining a candidate set according to the original retrieval phrase; and querying each query phrase in the candidate set, and determining a query result corresponding to each query phrase. The method extracts the potential semantic information contained in the corpus of the search library by introducing the potential semantics, and constructs the candidate set query with similar semantics based on the original query, so as to solve the problem that the existing full-text search lacks semantic matching capability, and improve the intelligent degree and the user experience of the full-text search.

Description

Target word retrieval method and device

Technical Field

The application relates to the technical field of computers, in particular to a target word retrieval method and device.

Background

Full-text retrieval is retrieval in which arbitrary content information in the entire book or the entire article stored in a database is searched out. The method can obtain information about chapters, sections, paragraphs, sentences, words and the like in the whole text as required, namely, similarly, a label is added to each word in the whole book, and various statistics and analysis can be carried out. The TFIDF algorithm cannot dig out deep semantic relations among vocabularies, so that the traditional ES search engine cannot process the condition of one meaning of multiple words, stays at a low-level keyword search level and cannot provide search for a user semantic level. For example: a user searches for "automobile", i.e. car, a traditional full text search would only return records containing the "automobile" word, while records containing the "car" word may actually be desired by the user.

That is, in the current search method for target words, the search result is usually limited by the word face of the request sentence input by the user, and cannot deeply capture the real intention behind the sentence input by the user. The retrieval method of the words has poor recall and poor accuracy.

Disclosure of Invention

In view of this, the embodiment of the present application provides a method and an apparatus for retrieving a target word, which aim to implement accurate full-text retrieval of the target word.

In a first aspect, an embodiment of the present application provides a method and an apparatus for retrieving a target word, where the method includes:

acquiring an original retrieval phrase;

determining a candidate set according to the original search phrase; the candidate set comprises an original search phrase and a plurality of first search phrases, wherein the first search phrases are phrases semantically similar to the original search phrase

And querying each query phrase in the candidate set, and determining a query result corresponding to each query phrase.

Optionally, the determining a candidate set according to the original search phrase includes:

acquiring a potential semantic computation model;

determining a plurality of first search terms through the latent semantic calculation model and a first rule, wherein the first search terms are terms similar to the original search terms in semantic meaning; the first rule is used for determining the number of the first search phrases;

merging the original search phrase with the plurality of first search phrases to form a candidate set.

Optionally, the determining the query result corresponding to each query phrase includes:

determining a text search record according to the query phrase, wherein the text search record comprises text related terms;

and determining the query result according to a second rule, wherein the second rule is used for determining the number of the text related terms in the query result.

Optionally, after determining the query result corresponding to each query phrase, the method further includes:

and combining the query results corresponding to each query phrase, and taking the combined results as a final query result set.

Optionally, the latent semantic calculation model is a latent semantic analysis model or a word vector model.

In a second aspect, an embodiment of the present application provides an apparatus for retrieving a target word, where the apparatus includes:

the original retrieval phrase acquisition module is used for acquiring an original retrieval phrase;

a candidate set determining module, configured to determine a candidate set according to the original search phrase; the candidate set comprises an original search phrase and a plurality of first search phrases, wherein the first search phrases are phrases which are similar to the original search phrase in semantic meaning;

and the query result determining module is used for querying each query phrase in the candidate set and determining the query result corresponding to each query phrase.

Optionally, the candidate set determining module includes:

the calculation model acquisition module is used for acquiring a potential semantic calculation model;

a first search phrase determination module, configured to determine a plurality of first search phrases according to the latent semantic calculation model and a first rule, where the first search phrases are phrases semantically similar to the original search phrase; the first rule is used for determining the number of the first search phrase;

and the candidate set forming module is used for combining the original search phrase and the plurality of first search phrases to form a candidate set.

Optionally, the query result determining module includes:

the text search record determining module is used for determining text search records according to the query phrases, wherein the text search records comprise text related terms;

and the query result determining module is used for determining the query result according to a second rule, and the second rule is used for determining the number of the text related terms in the query result.

Optionally, the apparatus further comprises:

and the merging module is used for merging the query results corresponding to each query phrase, and taking the merged results as a final query result set.

The embodiment of the application provides a target word retrieval method and device. When the method is executed, obtaining an original retrieval phrase; determining a candidate set according to the original search phrase; and querying each query phrase in the candidate set, and determining a query result corresponding to each query phrase. Therefore, after the user inputs the target retrieval phrase, the system analyzes the target retrieval phrase to obtain the synonymous phrase, and obtains a plurality of retrieval results by taking the original text to be retrieved and a plurality of semantic similar text records as the retrieval conditions. Therefore, the comprehensive retrieval effect of the target words is achieved. Therefore, the search result integrates keyword search and semantic search, and the recall ratio and precision ratio of the conventional full-text search are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and obviously, the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow diagram of a method for retrieving a target term provided by an embodiment of the present application;

FIG. 2 is a flow diagram of a method for retrieving a target term provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of the retrieval of the target word provided in the embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As mentioned above, the TFIDF algorithm is used to calculate the correlation degree for the current full-text search engine, such as the ES search bottom layer. However, the inventor finds that the method is relatively limited, the TFIDF algorithm cannot dig out deep semantic relations among vocabularies, so that the retrieval method cannot capture the real intention behind the sentence input by the user deeply, and the method has the defects of poor comprehensiveness checking performance and low accuracy.

In order to solve the problem, the embodiment of the application provides a method and a device for searching a target word, wherein when the method is executed, an original search phrase is obtained; determining a candidate set according to the original search phrase; and querying each query phrase in the candidate set, and determining a query result corresponding to each query phrase. Therefore, after the user inputs the target search phrase, the system analyzes the target search phrase to obtain the synonymous phrase, and obtains a plurality of search results by taking the original text to be searched and a plurality of semantic similar text records as search conditions. Therefore, the comprehensive retrieval effect of the target words is achieved. Therefore, the search result integrates keyword search and semantic search, and the recall ratio and precision ratio of the conventional full-text search are improved.

The method provided by the embodiment of the application is executed by a search engine and a background server, for example, the background server comprises a retrieval system with a retrieval function and an integration function. After the retrieval system obtains the semantic calculation model in the retrieval library, the original retrieval phrases are analyzed to obtain phrases with similar semantics, the system inputs the query object into a search engine such as an ES (electronic storage system) for retrieval, and a query result is formed according to the relevancy. The background server may be one server device, or may be a server cluster composed of a plurality of servers.

The following describes a method for retrieving a target word provided by the present application, by using an embodiment. Referring to fig. 1, fig. 1 is a flowchart of a method for retrieving a target word according to an embodiment of the present application, including:

s101: the original search phrase is obtained.

The original search phrase is an initial search target phrase. In a specific application scenario, the original search phrase may be input by a user, or may be set by the system according to a query requirement.

S102: determining a candidate set from the original search phrase.

The candidate set comprises an original search phrase and a plurality of first search phrases, wherein the first search phrases are phrases semantically similar to the original search phrase.

Assuming that the query original retrieval phrase is 'big data investigation', calculating the first 2 records with most similar semantemes in the retrieval library of the to-be-retrieved phrase as 'investigation' and 'learning' through a potential semantic model, and then the two records are the first retrieval phrase. The system can include the original search phrase and the plurality of first search phrases obtained from the query into a candidate set. In an actual application scenario, the system can set different marks for each first search phrase according to different correlation degrees of different first search phrases and original search phrases, and can be used for distinguishing search results with different similarity degrees in a subsequent process.

For how to determine the candidate set according to the original search phrase, it is specifically referred to the following text and will not be described herein.

S103: and querying each query phrase in the candidate set, and determining a query result corresponding to each query phrase.

The query text is phrases that are queried one by one in the candidate set. And sequentially extracting the original retrieval phrase and the plurality of first retrieval phrases from the candidate set, and taking the original retrieval phrase and the plurality of first retrieval phrases as query phrases one by one. A query result corresponding to each query phrase is determined.

In an actual application scenario, the system may merge the query results corresponding to each query phrase to form a final query result set.

The following describes in detail a method for retrieving a target word provided in an embodiment of the present application. Referring to fig. 2, fig. 2 is another schematic flow chart of the retrieval of the target words by the embodiments of the present application. The specific process is as follows:

s201: the original search phrase is obtained.

The system obtains the original search phrase to be searched.

S202: and acquiring a potential semantic calculation model.

Obtaining a corpus offline calculation potential semantic model in a search library, wherein the model can be an LSA model or a Word2vec model, and is not limited herein, the following steps are implemented by using an LSA as an example, and specifically include:

and analyzing the document set to establish a vocabulary-text matrix A. Let A be a matrix of m x n text data (n < < m), indicating that the corpus contains m words, n documents.

Singular value decomposition is carried out on the vocabulary-text matrix, dimension reduction is carried out on the matrix after SVD decomposition, a potential semantic space LSA model is constructed by using the matrix after dimension reduction, and the formula is as follows:

in the formula, A _m×n For m x n text data matrix, the formula decomposes large matrix A into product matrix of 3 matrixes, U _m×k In the form of a word-topic matrix,

for the topic text matrix, a can decompose k eigenvalues, where k can refer to the number of topics, and we select r eigenvalues with larger values after sorting, and the value of r can be calculated according to the following formula:

in the formula P _r The sum of squares of the first r larger eigenvalues of the diagonal matrix is obtained, P is the sum of squares of all eigenvalues of the diagonal matrix, the calculated r can have more than 95% of the information content of the original matrix, and r is far less than k.

Thus, the device is provided with

The matrix a can be approximated. U is a word-topic matrix, each column represents a latent semantic meaning, the meaning of the latent semantic meaning is formed by combining m words according to different weights, the row represents a word, and the column represents a document. Typically, an element of a word-document matrix is the number of occurrences of the word in the document. Because each column in U is independent, r latent semantics form a semantic space, each column in the matrix U represents a keyword, and the larger the value, the more relevant, therefore, the more U is passed through _m×r The correlation between words and word senses can be seen. Each row in the matrix V represents a category of topics, wherein each non-zero element represents the relevance of a topic to a document, such as a document

The relevance of the text to the topic can be seen. And Σ V ^T Is a topic-document matrix, Σ V ^T Each column in represents a document that is mapped into a semantic space, each singular value in Σ indicates the importance of the latent semantic, and the matrix Σ represents the correlation between the article topic and the keyword.

S203: a plurality of first search terms is determined from the latent semantic computation model and a first rule.

Wherein the first rule is used for determining the number of the first search phrase. In a specific application scenario, the first rule may be set by the user, or may be set by the system according to the query requirement.

In some possible implementation manners, a user may select or remove search records with similar semantics provided by the system according to personal search requirements, or the user may select different first search phrases and original search phrases to form different combinations according to requirements. For example, when a user deletes a semantic second nearby search term, the semantic third nearby search term may actively complement the bit. The first search phrase can also be set by the user himself, and other search phrases with similar semantemes obtained by the search of the system are selected or the content of the first search phrase is input by himself.

For example, when the first rule indicates that the number of the first search terms is 2, in an actual application scenario, the system calculates the first 2 records with the most similar semantics in the search base of the terms to be searched through the latent semantic model. Suppose the query is: query = "big data investigation", and the top 2 records with most similar semantemes in the search base of the phrase to be searched are calculated through a potential semantic model as: similar _ top _2= { "hadoop investigation", "spark learning" }. Wherein, the "research" and the "learning" are the first search phrase corresponding to the current first rule.

Wherein, hadoop and Spark are both representative signs of different first search phrases. In practical application, the system can distinguish the first retrieval phrases with different similarity of the different representative signs, and in some possible implementation manners, the system can set and adaptively modify the representative signs of the first retrieval phrases.

In practical applications, both Hadoop and Spark are big data frames, spark is a fast and general computing engine designed specifically for large-scale data processing. Hadoop is a distributed system infrastructure developed by the Apache Foundation, and data processing is performed by Hadoop in a reliable, efficient, and scalable manner. During the application process, the data retrieved by the system can be further processed according to the two large data frame frames.

As a further optimization, in S203, the first N records with the most similar semantics in the search base of the phrase to be searched are calculated through the latent semantic model, which specifically includes:

for a given query, we base on the words A contained in this query _q Constructing a pseudo document: v _q ＝A _q U∑ ^-1 The cosine similarity is then calculated for each column in the pseudo document and V to obtain the N documents that are most similar to a given query. Suppose that the text vector corresponding to the t-th column in V is V _t Then a pseudo document vector V _q And V _t The cosine similarity between them is calculated by the formula:

in the formula V _q And V _t For the vector representation of the query text with the text corresponding to the t-th column in the semantic space matrix, | V _q I and V _t Is the vector V respectively _q And V _t Die of (c), cos (V) _q ,V _t ) The cosine similarity between the text vector and the document vector.

S204: merging the original search phrase with the plurality of first search phrases to form a candidate set.

Merging the original search phrase with a plurality of first search phrases corresponding to the first rule to form a candidate set, for example, merging the original search phrase "big data research" with the first search phrase "hadoop research" and "spark learning" to obtain candidate _ list = { "big data research", "hadoop research", "spark learning" }.

S205: and inquiring each query phrase in the candidate set, and determining text search records according to the query phrases.

The text search records comprise a plurality of text related terms, and the text related terms are the terms which are inquired and have relevance with the inquiry phrases. The text search record is a collection of a plurality of related terms.

Specifically, the system sequentially takes out phrases to be queried from the candidate set, and determines text search records corresponding to the current phrase according to the query phrase, for example, the query text is an original search phrase, that is, query1= "big data investigation", the text search records corresponding to the current query phrase may be "big data development status investigation", "big data related component research", "big data and artificial intelligence relationship", "big data development prospect exploration", or "application expansion of big data", the text search records are full-text search results based on the current query text, that is, if N contents directly related to the current query phrase are in the full text, the text search records may be N.

In some possible implementations, the system can rank the respective relevance terms based on the relevance to the query phrase. In other words, in the process of acquiring the text related phrases by the system, the acquired text related phrases are sorted according to the relevancy to form a sequential text search record. Therefore, in the current text search record, the sequence relation of the relevancy exists among a plurality of text related terms, the query phrase big data research is used as the relevancy judgment standard, the research relevancy of the big data development status is higher than that of the research big data related component, and the relevancy gradually decreases backwards.

In some possible implementations, the system retrieves the full text based only on the query phrase, without ordering the terms of relevance obtained during the retrieval process. Therefore, the formed text search records have no high-low order relation of the relevance among the relevant words of the texts. Regarding extracting the text related words in the text search records according to the relevancy, after the system obtains the number request corresponding to the second rule, the system can sort the text related words in the text search records, and can also set a relevancy threshold value to directly screen a plurality of text related words in the text search records, so as to obtain the number of text related words corresponding to the second rule. Namely, the system selects a plurality of text related phrases with the relevance reaching the standard from the text search record set without the sequence relation.

S206: and determining the query result according to a second rule.

The second rule is used for determining the number of the text related terms in the query result. In a specific application scenario, the second rule may be set by the user, or may be set by the system according to the query requirement.

For example, when the second rule indicates that the number of the text related phrases is 3, in an actual application scenario, the system takes a text record of 3 before the degree of correlation to form a query result. Assuming that the text search record corresponding to the current query phrase is "big data development status investigation", "research big data related component", "big data and artificial intelligence relationship", "big data development prospect exploration", "application expansion of big data", selecting a text record with a correlation degree of 3 before to form a query result according to the text related phrase selection rule mentioned in the step S205, and then selecting a corresponding result set result1= { "big data development status investigation", "research big data related component", "big data and artificial intelligence relationship" }.

S207: and combining the query results corresponding to each query phrase, and taking the combined results as a final query result set.

According to the step S206, query results corresponding to a plurality of query texts in the candidate set are determined.

For example, the query texts are sequentially extracted from the candidate set and input into a full-text search engine such as an ES for retrieval, and the text records with the highest relevance degree of 3 are extracted to form a query result. Query text query1= "big data investigation", query2= "hadoop investigation", query3= "spark learning", corresponding result set result1= { "big data development status investigation", "research big data related component", "big data and artificial intelligence relationship" }, result2= { "hadoop investigation", "hadoop technical investigation", "hadoop fast entry" }, result3 { "spark learning", "spark learning note", "spark basic course" }. In the current step, combining the query results corresponding to each query phrase to form a search result set, wherein the search result set is as follows: result _ list = { "big data development status investigation", "research big data related component", "big data and artificial intelligence relationship", "hadoop investigation", "hadoop technology investigation", "hadoop fast entry", "spark learning note", "spark basic course" }.

In an actual application scenario, the query results may be displayed in a classified manner on the user-side search interface, for example, the original search phrase and the first search phrase with different similarity may be distinguished, the query result of the original search phrase is set to be located in the first row, and then, the similarity of the query result with the row number corresponding to the first search phrase is decreased progressively. Or the system can set different colors for different phrases, so as to achieve the distinguishing effect on the display interface.

The foregoing provides some specific implementation manners of a retrieval method based on latent semantic analysis for the embodiments of the present application, and based on this, the present application also provides a corresponding apparatus. The device provided by the embodiment of the present application will be described in terms of functional modularity.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a target word retrieval device according to an embodiment of the present application.

In this embodiment, the apparatus may include:

an original search phrase obtaining module 300, configured to obtain an original search phrase;

a candidate set determining module 301, configured to determine a candidate set according to the original search phrase; the candidate set comprises an original search phrase and a plurality of first search phrases, wherein the first search phrases are phrases which are similar to the original search phrase in semantic meaning;

a query result determining module 302, configured to query each query phrase in the candidate set from the target text, and determine a query result corresponding to each query phrase.

Optionally, the candidate set determining module includes:

Optionally, the query result determining module includes:

Optionally, the apparatus further comprises:

the latent semantic calculation model is a latent semantic analysis model or a word vector model.

It should be noted that the target word retrieval method and device provided by the invention can be used in the field of artificial intelligence. The above description is only an example, and does not limit the application field of the target word retrieval method and apparatus provided by the present invention.

The above provides a detailed description of a method and apparatus for retrieving a target word. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It should also be noted that, in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for retrieving a target word, the method comprising:

acquiring an original retrieval phrase;

determining a candidate set according to the original retrieval phrase; the candidate set comprises an original search phrase and a plurality of first search phrases which are semantically similar to the original search phrase

2. The method of claim 1, wherein said determining a candidate set from said original search phrase comprises:

acquiring a potential semantic computation model;

determining a plurality of first search terms through the latent semantic calculation model and a first rule, wherein the first search terms are terms similar to the original search terms in semantic meaning; the first rule is used for determining the number of the first search phrase;

3. The method for retrieving the target term according to claim 1, wherein the determining the query result corresponding to each query phrase comprises:

4. The method for retrieving the target term in claim 1, wherein after determining the query result corresponding to each query phrase, the method further comprises:

5. The method for retrieving target words according to claim 2, wherein the latent semantic calculation model is an implicit semantic analysis model or a word vector model.

6. An apparatus for retrieving a target word, the apparatus comprising:

and the query result determining module is used for querying each query phrase in the candidate set and determining a query result corresponding to each query phrase.

7. The apparatus of claim 6, wherein the candidate set determination module comprises:

8. The apparatus of claim 6, wherein the query result determination module comprises:

9. The apparatus of claim 6, further comprising:

10. The apparatus of claim 7, wherein the latent semantic computation model is an implicit semantic analysis model or a word vector model.