CN114218948A

CN114218948A - Keyword recognition method and device, equipment, medium and product thereof

Info

Publication number: CN114218948A
Application number: CN202111536935.8A
Authority: CN
Inventors: 王�锋
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Huaduo Network Technology Co Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-22

Abstract

The application discloses a keyword identification method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: acquiring a text to be identified for the named entity identification to be executed; vectorizing the text to be recognized to obtain embedded vectors corresponding to all characters in the text to be recognized, wherein the embedded vectors comprise word vectors of the characters and word vectors obtained by classifying and coding all possible participles of the characters according to the occurrence positions of the characters in the participles; and extracting a text characteristic vector from the embedded vector by adopting a named entity recognition model trained to be in a convergence state, and extracting keywords corresponding to the named entity from the text to be recognized according to the text characteristic vector. According to the method and the device, in the encoding process of the text to be recognized, the participles containing the characters are classified and encoded according to different positions of the characters appearing in the participles, the semantic representation capability of embedded vectors obtained through encoding is improved, and the named entity recognition model can improve the accuracy of named entity recognition.

Description

Keyword recognition method and device, equipment, medium and product thereof

Technical Field

The present application relates to the field of e-commerce information technologies, and in particular, to a keyword recognition method and a corresponding apparatus, computer device, computer-readable storage medium, and computer program product.

Background

The named entity recognition plays an important role in searching, recommending, user portrait analyzing and the like in the E-commerce field, for example, when a user searches, a search box associates words searched by the user, the user can be guided to search commodity words wanted to be searched, the searching efficiency is improved, the associated words in the word association need to be based on commodities in a commodity library, and the commodity words can be recognized by the named entity recognition method. According to the search of the user, tag of the search keyword, such as commodity words, brands and the like, is identified, and TermWei light can be used for improving the fine search effect. Meanwhile, the user searches commodities, clicks commodities, purchases and places orders for commodity registration behaviors, and the preferences of the behaviors of the user can be identified by using a nested entity identification method and used for user portrait analysis.

Named Entity Recognition (NER) can be solved as a sequence labeling problem, where entity boundaries and class labels are jointly predicted. Different from English named entity recognition, Chinese has no obvious word boundary, and different participles under a character sequence have different meanings, so that the named entity recognition task is more difficult. An intuitive method is to perform word segmentation first and then perform word-level sequence labeling, but this method can cause the problem of wrong delivery of word segmentation. The other is based directly on sequence labeling at the character level, but this approach ignores information at the word level.

There are many efforts to improve the performance of the Chinese NER by using dictionaries. New benchmark is obtained on various public Chinese NER datasets as a representative Lattice LSTM. The Lattice LSTM structure fuses word information into word information, is improved on the basis of LSTM, allows nodes to receive farther information, expands a model from a chain structure to a schema structure, introduces various word segmentation results into the model, can remotely transmit the word information into the nodes, and finally explores paths through the model. The Lattice LSTM reserves all possible dictionary matching results, avoids error propagation caused by heuristically selecting one matching result to each character, and simultaneously introduces a pre-training model to improve performance. However, the implementation principle of Lattice LSTM has many problems, for example, in the first one, Lattice LSTM, the jth word can only acquire information of a word ending with the jth word and information of a word in a state of a time before the jth word, but cannot acquire information of a word including the jth word. Secondly, the context with the closest distance is important for predicting the current token, but in the Lattice LSTM, the expression of the word fuses state information of all previous moments, including the word and the word, which can interfere with the meaning of the word; thirdly, based on the coding mechanism, the training speed of the Lattice LSTM is relatively slow, and a large number of data samples are needed to promote convergence.

Thus, there is still room for improvement, at least in the encoding process of named entity recognition related models.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide a keyword recognition method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

a keyword recognition method adapted to one of the objects of the present application includes the steps of:

acquiring a text to be identified for the named entity identification to be executed;

vectorizing the text to be recognized to obtain embedded vectors corresponding to all characters in the text to be recognized, wherein the embedded vectors comprise word vectors of the characters and word vectors obtained by classifying and coding all possible participles of the characters according to the occurrence positions of the characters in the participles;

and extracting a text characteristic vector from the embedded vector by adopting a named entity recognition model trained to be in a convergence state, and extracting keywords corresponding to the named entity from the text to be recognized according to the text characteristic vector.

In a further embodiment, vectorizing the text to be recognized to obtain an embedded vector corresponding to each character in the text to be recognized includes the following steps:

matching the text to be recognized with a preset dictionary to obtain all the participles corresponding to each character in the text to be recognized;

aiming at each character in the text to be recognized, dividing all participles of the character into a plurality of participle subsets according to different positions of the character in the participles containing the character;

for each character in the text to be recognized, performing word compression on word vectors corresponding to the words in each word segmentation subset corresponding to the character to obtain each classification vector encoding the word vectors of each word segmentation subset;

and splicing each character in the text to be recognized and all the classification vectors thereof to obtain the embedded vector corresponding to each character.

In an embodiment, for each character in the text to be recognized, dividing all participles of the character into a plurality of participle subsets according to different positions of the character appearing in the participle including the character, the method includes the following steps:

constructing all participles with first characters as the characters into a first participle subset aiming at each character in the text to be recognized;

aiming at each character in the text to be recognized, constructing all participles with the tail characters as the character into a second participle subset;

constructing all participles with middle positions containing the characters as a third participle subset aiming at each character in the text to be recognized;

and constructing the participles only containing the character into a fourth participle subset aiming at each character in the text to be recognized.

In a specific embodiment, for each character in the text to be recognized, performing word compression on a word vector corresponding to a word in each participle subset corresponding to the character to obtain each classification vector encoding the word vector of each participle subset, the compression is performed in any one of the following manners:

averaging word vectors of all the participles in the participle subset to realize word compression, and obtaining a classification vector corresponding to the participle set;

and weighting the word vectors of all the participles in the participle subset, and then calculating the average value to realize word compression, thereby obtaining the classification vector corresponding to the participle set.

In a deepened embodiment, a named entity recognition model trained to a convergence state is adopted to extract a text feature vector from the embedded vector, and keywords corresponding to named entities in the text to be recognized are extracted from the text to be recognized according to the text feature vector, and the method comprises the following steps:

extracting deep semantic information of the embedded vector by adopting a text feature extraction model in the named entity model to obtain a corresponding text feature vector;

and performing part-of-speech tagging by adopting a conditional random field model in the named entity model according to the text feature vector, and extracting a plurality of keywords representing named entities from the text to be recognized according to part-of-speech tagging results.

In an extended embodiment, after the step of extracting the text feature vector from the embedded vector by using the named entity recognition model trained to the convergence state and extracting the keywords corresponding to the named entity from the text to be recognized according to the text feature vector, the method comprises the following steps:

and constructing a search expression according to the keywords, calling a commodity search engine to obtain a commodity list matched with the search expression, and pushing the commodity list to a search requester providing the text to be identified.

In another embodiment of the expansion, after the step of extracting the text feature vector from the embedded vector by using the named entity recognition model trained to the convergence state and extracting the keywords corresponding to the named entity from the text to be recognized according to the text feature vector, the method comprises the following steps:

and labeling the commodity object carrying the text to be recognized by using the keyword so that the keyword forms an portrait label of the commodity object.

A keyword recognition apparatus adapted to one of the objects of the present application includes: the system comprises a text acquisition module, a text coding module and an entity identification module, wherein the text acquisition module is used for acquiring a text to be identified for the named entity identification to be executed; the text coding module is used for vectorizing the text to be recognized to obtain embedded vectors corresponding to all characters in the text to be recognized, wherein the embedded vectors comprise word vectors of the characters and word vectors obtained by classifying and coding all possible participles of the characters according to the occurrence positions of the characters in the participles; and the entity recognition module is used for extracting a text characteristic vector from the embedded vector by adopting a named entity recognition model trained to be in a convergence state, and extracting keywords corresponding to the named entity from the text to be recognized according to the text characteristic vector.

In a further embodiment, the text encoding module includes: the text word segmentation sub-module is used for matching the text to be recognized with a preset dictionary to obtain all word segments corresponding to each character in the text to be recognized; the classification structure sub-module is used for dividing all participles of the character into a plurality of participle subsets according to different positions of the character in the participle containing the character aiming at each character in the text to be recognized; the classification compression submodule is used for performing word compression on word vectors corresponding to the words in each word segmentation subset corresponding to each character aiming at each character in the text to be recognized to obtain each classification vector of the word vectors of each word segmentation subset; and the vector synthesis submodule is used for splicing each character in the text to be recognized and all the classification vectors thereof to obtain the embedded vector corresponding to each character.

In a specific embodiment, the classification construction sub-module includes: the first construction unit is used for constructing all participles with first characters as the characters into a first participle subset aiming at each character in the text to be recognized; the second construction unit is used for constructing all participles with the tail characters as the characters into a second participle subset aiming at each character in the text to be recognized; the third construction unit is used for constructing all participles with middle positions containing the characters into a third participle subset aiming at each character in the text to be recognized; and the fourth construction unit is used for constructing the participles only containing the character into a fourth participle subset aiming at each character in the text to be recognized.

In an embodiment, the classification compression sub-module is implemented by any one of the following units: the average compression unit is used for averaging word vectors of all the participles in the participle subset to realize word compression and obtain a classification vector corresponding to the participle set; and the weighted compression unit is used for weighting the word vectors of all the participles in the participle subset and then solving the average value to realize word compression so as to obtain the classification vector corresponding to the participle set.

In a further embodiment, the entity identification module includes: the expression learning submodule is used for extracting deep semantic information of the embedded vector by adopting a text feature extraction model in the named entity model to obtain a corresponding text feature vector; and the entity extraction submodule is used for performing part-of-speech tagging according to the text feature vector by adopting a conditional random field model in the named entity model, and extracting a plurality of keywords representing named entities from the text to be recognized according to part-of-speech tagging results.

In an expanded embodiment, the keyword recognition apparatus of the present application further includes: and the search execution module is used for constructing a search expression according to the keyword, calling a commodity search engine to obtain a commodity list matched with the search expression, and pushing the commodity list to a search requester providing the text to be identified.

In another embodiment of the present application, the keyword recognition apparatus further includes: and the label execution module is used for labeling the commodity object carrying the text to be recognized by using the keyword so that the keyword forms an portrait label of the commodity object.

A computer device adapted for one of the purposes of the present application comprises a central processing unit and a memory, the central processing unit being configured to invoke and run a computer program stored in the memory to perform the steps of the keyword recognition method described herein.

A computer-readable storage medium, which stores a computer program implemented according to the keyword recognition method in the form of computer-readable instructions, and when the computer program is called by a computer, executes the steps included in the method.

A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.

Compared with the prior art, the application has the following advantages:

firstly, the method mainly improves the coding process of the text to be recognized, which needs named entity recognition, vectorizes the text to be recognized to obtain embedded vectors corresponding to all characters in the text to be recognized, enables the embedded vectors to contain word vectors of the characters and word vectors obtained by classifying and coding all possible participles of the characters according to the occurrence positions of the characters in the participles, strengthens the representation of common features of all classifications through classification, so as to guide the representation learning process of a named recognition model, enables deep semantic information obtained by model representation learning to more accurately represent semantic correlation among all characters, enables the model to carry out part of speech tagging based on more accurate deep semantic information, and accurately recognizes each named entity on the basis of part of speech tagging to obtain corresponding keywords.

Secondly, in the encoding process, the characteristic that Chinese does not have obvious word boundaries is considered, different participles under a character sequence have different meanings, and therefore classification is carried out according to specific appearance positions of the characters in the participles, each classification shows the meaning borne by the positions of the characters and is more consistent with the characteristics of the Chinese, and therefore the named entity recognition model can be used for recognizing the named entities more easily in the Chinese according to the embedded vectors obtained by encoding, and the named entity recognition task based on the Chinese is more efficient and accurate.

Secondly, the accuracy of named entity recognition is improved by aiming at the coding process, so that the total amount of data samples required by the training process of the corresponding named entity model can be reduced, the model can be trained to be in a convergence state more easily, the model training efficiency is improved, and the model training cost is saved.

In addition, the keywords are obtained from the text to be recognized based on the improvement of the accuracy, and can be used in scenes such as online search, keyword search association, data imaging, recommendation search and the like, so that a more accurate semantic matching effect can be obtained.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of a keyword recognition method of the present application;

FIG. 2 is a schematic flow chart illustrating a process of encoding a text to be recognized according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a process of encoding according to a character position according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a process of identifying a named entity identification model according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram of one of the expanded embodiments of the keyword recognition method of the present application;

FIG. 6 is a schematic flowchart of another expanded embodiment of the keyword recognition method of the present application;

FIG. 7 is a functional block diagram of a keyword recognition apparatus of the present application;

fig. 8 is a schematic structural diagram of a computer device used in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

The keyword recognition method can be programmed into a computer program product, is deployed in a client or a server to run, and is generally deployed in the server to implement, for example, in an e-commerce platform application scenario of the present application, so that the method can be executed by accessing an open interface after the computer program product runs and performing human-computer interaction with a process of the computer program product through a graphical user interface.

Referring to fig. 1, in an exemplary embodiment of the keyword recognition method of the present application, the method includes the following steps:

step S1100, acquiring a text to be recognized for the named entity recognition to be executed:

the text to be recognized can be obtained from a user request, or can be called from a database, such as a commodity title of a commodity object in a commodity database or a local storage space, such as a system pasting board, and the input text is flexibly obtained as the text to be recognized according to different downstream tasks applied by the application and is used for executing named entity recognition, and one or more keywords corresponding to the named entities are extracted from the text to be recognized.

The downstream task is determined according to the specific application scenario of the technical scheme of the application, such as online search, search keyword association, data portrayal, recommendation search, and the like.

In one example, in an online search scenario, after a user submits an original search expression of the user on a client device, the original search expression is used as the text to be recognized to perform named entity recognition, a plurality of keywords are extracted, and then the keywords are logically combined to form an optimized search expression, and then online search is performed.

In the second example, in the search keyword association scenario, in the process of inputting the search keyword by the user, the input text is used as the text to be recognized for named entity recognition, then the background performs word association according to the keyword obtained by the named entity recognition, and the associated word is obviously selected by the user.

In a third example, in a data image scene, taking the e-commerce field as an example, text information such as a commodity title or a commodity detail of a commodity object can be directly acquired from a client device or a commodity database, and is used as the text to be identified, named entity identification is performed, corresponding keywords are acquired, the corresponding commodity object is labeled, and the keywords are used as data image tags of the commodity object.

In the recommended search scenario, still taking the e-commerce field as an example, the input of the client device may be acquired as a text to be recognized, named entity recognition is performed on the text, a corresponding commodity object is searched and acquired in the commodity database according to the keyword obtained through recognition, and then a commodity object recommendation list is constructed and pushed to the user side.

And the like, which indicates that the corresponding input text information can be obtained as the text to be recognized according to different application scenes thereof, so as to start the naming recognition process of the application. In fact, named entity recognition is a fundamental function in natural language processing technology, and can be widely applied to various fields, and for this reason, it is known to those skilled in the art, and therefore, the scope of protection covered by the inventive spirit of the present application should not be limited by the application field.

Preferably, in view of the advantages of the coding principle implemented by the present application, the text to be recognized is a chinese text or other languages capable of representing the meaning of a word independently in a single word and representing the meaning of a word independently in a combination of multiple words.

Step S1200, vectorizing the text to be recognized, obtaining an embedded vector corresponding to each character in the text to be recognized, where the embedded vector includes a word vector of the character and word vectors obtained by performing classification and coding on all possible participles of the character according to the occurrence positions of the character in the participles:

the step aims to encode the text to be recognized, so as to realize vectorization and obtain a corresponding embedded vector. Specifically, for the text to be recognized, taking chinese as an example, each character in the text is encoded separately.

When each character is coded, a word segmentation set corresponding to each character is obtained according to a preset dictionary, and each word segmentation set comprises all possible word segmentation of the corresponding character in a text to be recognized. For example, taking the "mountain" word of "Zhongshan West road" as an example, the word segmentation set determined by the word segmentation method according to the dictionary may include { "Zhongshan"; "Zhongshanxi"; "Zhongshan West Lu"; "Shanxi"; "Shanxi Lu"; "mountain" } equally-divided words. As can be seen from this example, there are a variety of position situations for the occurrence of the single word "mountain" in each of the segments of its segment set, including the occurrence in the first character position, the end character position, the middle character position, the single character position, where for ease of understanding the first character position may be labeled B, the end character position labeled E, the middle character position labeled M, the single character position labeled S. According to the different appearance positions, different classifications can be constructed according to the different appearances, and accordingly:

b { "shanxi"; "Shanxi Lu" }

E { "zhongshan" }

M { "shanxi" in shanxi; "Zhongshan West road" }

S { "mountain" }

It can be seen that a plurality of corresponding participle subsets can be divided according to the different positions of each character in the possible participles. And then word vectors are obtained for different word segmentation subsets to be synthesized, the synthesis mode comprises the steps of calculating the mean value or the weighted value of each word vector, the vector representation obtained by synthesis is combined into the character vector of the character to form the vector representation corresponding to the character, and the vector representation of each character is formed into the embedded vector corresponding to the text to be recognized. In the process, a plurality of word segmentation subsets are obtained according to different positions of possible word segmentation of each character, then a comprehensive result of word vectors of each word segmentation is obtained for each word segmentation subset, the comprehensive result independently represents common features of the word segmentation at the corresponding position, and finally the common features are superposed into the word vectors to form vector representations corresponding to single words, so that the vector representations of the single words are combined with the common features corresponding to the occurrence positions, accordingly, representations of the common features highlight representations of corresponding meanings at different occurrence positions, finally obtained embedding vectors of the text to be recognized are combined with comprehensive semantic information, named entity recognition is carried out according to the representation of the common features, and richer semantics can be naturally obtained.

Step 1300, extracting a text feature vector from the embedded vector by adopting a named entity recognition model trained to a convergence state, and extracting keywords corresponding to named entities from the text to be recognized according to the text feature vector:

various known named entity recognition models in the prior art can be used for implementing recognition of named entities, and such models are usually implemented based on a transform underlying network architecture, including but not limited to a network architecture integrating a Lattice LSTM base model and a conditional random field model, a network architecture independently assumed by Bert or built by combining the Bert model with the conditional random field model, and the like. Certainly, the named entity model is trained to be in a convergence state in advance before being used in the application, so that the named entity model learns the corresponding capability, the embedded vector obtained by coding the text to be recognized according to the principle disclosed in the previous step of the application can be represented and learned by the model to obtain the text feature vector corresponding to deep semantic information, part-of-speech tagging is realized on the basis of the text feature vector, and the keywords corresponding to each named entity are obtained according to the part-of-speech tagging result.

In the process of extracting the text to be recognized by the named entity recognition model, because the embedded vector obtained by the coding of the method expresses corresponding semantic features according to the appearance position of the character in the participle, the semantic information is richer, the semantics contained in the corresponding high-level semantic vector is richer, and good representation learning effect can be realized for languages which can be represented by single characters and multi-character combination, such as Chinese, so that the named entity recognition can be more accurately realized for the text of the languages.

Through the introduction of the embodiment, it can be seen that the present application has various positive effects, including but not limited to the following aspects:

Referring to fig. 2, in a further embodiment, the step S1200 of vectorizing the text to be recognized to obtain an embedded vector corresponding to each character in the text to be recognized includes the following steps:

step S1210, matching the text to be recognized with a preset dictionary to obtain all the participles corresponding to each character in the text to be recognized:

the named entity recognition model is correspondingly provided with a dictionary, and the dictionary contains various vocabularies for referring named entities. In order to realize word embedding of a text to be recognized, firstly, according to the dictionary, based on each character of the text to be recognized, all possible participles containing the character are matched from the dictionary, a participle set is constructed for each character, and the participle set contains all possible participles in which the character appears.

Step S1220, for each character in the text to be recognized, dividing all the participles of the character into a plurality of participle subsets according to the difference of the position where the character appears in the participle including the character:

next, for each character in the text to be recognized, dividing the corresponding word segmentation set into a plurality of word segmentation subsets, where the division is expressed by the following formula as described in the previous embodiment:

wherein c is_iExpressing specific characters, and L expressing a preset dictionary corresponding to the named entity recognition model; n is the total length of the text to be recognized, i, j, k each represents the serial number of the appearance position of the specific character in the text to be recognized, w represents the participle corresponding to the position in the text to be recognized defined by the two subscripts thereof, and B, M, E, S represents the initial character position, the final character position, the middle character position and the single character position respectively as described above.

Referring to fig. 3, in one embodiment, according to the formula example herein, the step can be implemented as the following specific steps:

step S1221, for each character in the text to be recognized, constructing all the participles with the first character as the character as a first participle subset:

for each character, all the participles with first character for that character are extracted, constructed as a first participle subset, such as in the previous example: b { "shanxi"; "Shanxi Lu".

Step S1222, for each character in the text to be recognized, constructing all the participles with the end word as the character as a second participle subset:

for each character, the end word is extracted for all the participles of that character, constructed as a second participle subset, such as in the previous example: e { "zhongshan" }

Step S1223, for each character in the text to be recognized, constructing all the participles whose middle positions include the character as a third participle subset:

for each character, all the participles whose intermediate positions contain the character are extracted and constructed as a third participle subset, such as in the previous example: m { "shanxi" in shanxi; "Zhongshan West road" }

Step S1224, for each character in the text to be recognized, constructing the participle including only the character as a fourth participle subset:

for each character, the participle only containing the character, i.e. the participle corresponding to its unique word, is extracted and constructed as a fourth participle subset, such as in the previous example: s { "mountain" }.

Thus, the classification of the participles corresponding to each character according to the formula is completed, each character corresponds to B, M, E, S four sets (allowing empty sets), a plurality of corresponding participle subsets are obtained, and then the word vectors corresponding to the participles in the corresponding participle subsets can be obtained according to the respective participle subsets.

Step S1230, for each character in the text to be recognized, performing word compression on the word vector corresponding to the word in each participle subset corresponding to the character to obtain each classification vector encoding the word vector of each participle subset:

for each character in the text to be recognized, word compression can be performed on each participle subset of the character to be recognized so as to obtain a classification vector corresponding to each participle subset. Specifically, word vectors corresponding to the participles in each participle subset are obtained, and word vector compression is performed on the participle subset based on the word vectors of the same participle subset, wherein a specific compression mode can be any one of the following modes:

in an optional compression mode, word compression is implemented by averaging word vectors of all the participles in the participle subset to obtain a classification vector corresponding to the participle set:

this method is mainly implemented based on the principle of averaging, please refer to the formula:

in the formula, S represents a specific word subset, w represents a word in the word subset, e^wA word vector representing a participle w.

According to the formula, it can be understood that the classification vector corresponding to the word segmentation subset, namely v, can be obtained by adding and averaging the word vectors of the individual segmentation words in the word segmentation subset^sTherefore, comprehensive representation of word vectors of word segmentation in the specific word segmentation subset is achieved, and the common representation of the positions of the corresponding characters where the word segmentation appears in the word segmentation subset is achieved through the classification vectors.

In another optional compression mode, word vectors of all the participles in the participle subset are weighted and then averaged to realize word compression, and a classification vector corresponding to the participle set is obtained:

this method is mainly implemented based on the weighted mean principle, please refer to the formula:

comparing this formula with the previous formula, it is easy to understand that the difference is only that in this compression method, the corresponding weight z (w) is matched for each word vector. In the preferred embodiment, the weights matched here are static weights, using as weights the frequency with which each participle appears on a static data set. By adopting the method to match the weights, the training speed of the named entity recognition model can be accelerated in the training stage. In the process of matching the weights, the condition of the whole participle set should be considered for the normalization of the weights of the same participle set. Generally, the static data set is derived from training data and validation data. Wherein if the substring a containing substring w is matched, the frequency of w does not need to be increased.

So far, by adopting a weighted averaging manner, classification vectors corresponding to the word segmentation subsets of each character are correspondingly obtained for each character, and each classification vector of each character can be respectively expressed as: v. of^s(B)、v^s(E)、v^s(M)、v^s(S)。

Step S1240, splicing each character in the text to be recognized and all the classification vectors thereof to obtain the embedded vector corresponding to each character:

finally, let the formula be:

e^s(B,M,E,S)＝,v^s(B)；v^s(E)；v^s(M)；v^s(S)-

and then:

x^c←,x^c；e^s(B,M,E,S)-

that is, for each character in the text to be recognized, the word vector and all the classification vectors corresponding to the word segmentation subsets thereof are simply spliced, so that the embedded vector corresponding to each character can be obtained, and the combination of the embedded vectors corresponding to each character also forms the embedded vector of the text to be recognized, thereby obtaining the vectorized representation of the text to be recognized, and completing the encoding process of the text to be recognized.

The embodiment adapts to the requirement of representing and learning a named entity model, discloses the specific process of coding the text to be recognized in detail, and can be seen that the compression of the word vectors of all the word subsets is realized through the classified vectors, the semantics of the common characteristics of all the word subsets related to the character position information are extracted, the semantic representation effect in the process of representing and learning is improved, and therefore a reliable foundation is laid for the named entity recognition. In practice, the specific encoding process may be flexibly varied, for example, to allow the respective segmentation subsets for each character to be empty sets, without requiring that each segmentation subset contain at least one segmentation.

Referring to fig. 4, in a further embodiment, in the step S1300, extracting a text feature vector from the embedded vector by using a named entity recognition model trained to a convergence state, and extracting a keyword corresponding to a named entity from a text to be recognized according to the text feature vector, the method includes the following steps:

step S1310, extracting deep semantic information of the embedded vector by using the text feature extraction model in the named entity model, and obtaining a corresponding text feature vector:

the text feature extraction model is preferably implemented by using Lattice LSTM, and the model refers to context to perform representation learning on the embedded vector of the text to be recognized, which is obtained by pre-coding, so as to obtain the corresponding text feature vector.

Step S1320, performing part-of-speech tagging according to the text feature vector by using the conditional random field model in the named entity model, and extracting a plurality of keywords representing the named entity from the text to be recognized according to the part-of-speech tagging result:

and continuously inputting the text feature vector into a conditional random field model (CRF) for part-of-speech tagging, predicting by combining a probability matrix output by Lattice LSTM and a state transition matrix of CRF under the action of the conditional random field model to finish part-of-speech tagging, and extracting keywords corresponding to a plurality of named entities in the text to be recognized according to part-of-speech tagging results.

Similarly, LSTM can be replaced by a transform kernel-based model such as Bert, and although the models can also independently serve as the task of part-of-speech tagging, the combination of the conditional random field can significantly improve the accuracy of named entity extraction, so that the named entity extraction is recommended.

In the embodiment, a more specific model structure is given to further show the practicability of the technical scheme of the application, in practice, the model is adopted, the embedding vector of the text to be recognized, which is obtained according to the coding of the application, is used for recognizing the named entity more accurately, and by adopting the embedding vector, the model convergence speed can be increased in the training process of the model, so that the training cost is saved.

Referring to fig. 5, in an extended embodiment, after the step S1300 of extracting the text feature vector from the embedded vector by using the named entity recognition model trained to the convergence state, and extracting the keyword corresponding to the named entity from the text to be recognized according to the text feature vector, the method includes the following steps:

step S1400, constructing a search expression according to the keywords, calling a commodity search engine to obtain a commodity list matched with the search expression, and pushing the commodity list to a search requester providing the text to be identified:

in the embodiment, the requirement of the e-commerce field on-line search application scene is met, the text to be recognized is an original search expression input by a user, the user completes the input of the text to be recognized on a terminal interface of the user, the text to be recognized is obtained by the server after confirmation and submission, the server codes the text to be recognized through the relevant steps of the above embodiments according to the technical scheme of the application to obtain the corresponding embedded vector, and then the named entity is predicted according to the embedded vector by adopting a named entity recognition model to obtain the corresponding keyword. On the basis, combining the keywords according to a preset rule, for example, setting the keywords as logic and operation to form a new search expression. And calling a commodity search engine according to the corrected search expression, searching the commodity database to obtain a commodity list matched with the search expression, and pushing the commodity list to the client equipment of the user serving as a requester for display.

In this embodiment, the technical solutions of other embodiments of the present application are further applied to an online search application scenario in the e-commerce field, so that the technical advantages of the present application are shown. It is easy to understand that, due to the beneficial effects introduced earlier in the present application, the keywords obtained by naming and identifying are more accurate, so that the new search expression determined according to the preset logic combination rule is used to search the commodity list obtained in the commodity database, and the intention of the user expressed in the text to be identified can be better matched, thereby realizing accurate search of the commodity.

Referring to fig. 6, in another embodiment of the expansion, after the step S1300 of extracting the text feature vector from the embedded vector by using the named entity recognition model trained to the convergence state, and extracting the keyword corresponding to the named entity from the text to be recognized according to the text feature vector, the method includes the following steps:

step S1500, labeling the commodity object carrying the text to be recognized by using the keywords, so that the keywords form the portrait label of the commodity object:

in this embodiment, the method is suitable for the requirement of data imaging of a commodity object in a commodity database in the e-commerce field, and takes a commodity title and/or a commodity detail text of the commodity object as a material for determining a data imaging label as the text to be identified. These keywords can be used as labels required for data imaging of the commodity object, and the data imaging of the commodity object can be completed by directly labeling the commodity object with the labels.

In this embodiment, the technical solutions of other embodiments of the present application are further applied to an online search application scenario in the e-commerce field, so that the technical advantages of the present application are shown. It is understood that due to the beneficial effects introduced earlier in the present application, the keywords obtained through naming and identifying are more accurate, and therefore, the data portrait tagging of the commodity object can be quickly, efficiently and accurately realized by using the keywords as the data portrait tags of the commodity object.

Referring to fig. 7, a keyword recognition apparatus adapted to one of the objectives of the present application is a functional implementation of the keyword recognition method of the present application, and the apparatus includes: the system comprises a text acquisition module 1100, a text coding module 1200 and an entity identification module 1300, wherein the text acquisition module 1100 is used for acquiring a text to be identified for named entity identification to be executed; the text encoding module 1200 is configured to perform vectorization on the text to be recognized to obtain an embedded vector corresponding to each character in the text to be recognized, where the embedded vector includes a word vector of the character and word vectors obtained by performing classification encoding on all possible participles of the character according to occurrence positions of the character in the participles; the entity recognition module 1300 is configured to extract a text feature vector from the embedded vector by using a named entity recognition model trained to a convergence state, and extract a keyword corresponding to a named entity from the text to be recognized according to the text feature vector.

In a further embodiment, the text encoding module 1200 includes: the text word segmentation sub-module is used for matching the text to be recognized with a preset dictionary to obtain all word segments corresponding to each character in the text to be recognized; the classification structure sub-module is used for dividing all participles of the character into a plurality of participle subsets according to different positions of the character in the participle containing the character aiming at each character in the text to be recognized; the classification compression submodule is used for performing word compression on word vectors corresponding to the words in each word segmentation subset corresponding to each character aiming at each character in the text to be recognized to obtain each classification vector of the word vectors of each word segmentation subset; and the vector synthesis submodule is used for splicing each character in the text to be recognized and all the classification vectors thereof to obtain the embedded vector corresponding to each character.

In a further embodiment, the entity identification module 1300 includes: the expression learning submodule is used for extracting deep semantic information of the embedded vector by adopting a text feature extraction model in the named entity model to obtain a corresponding text feature vector; and the entity extraction submodule is used for performing part-of-speech tagging according to the text feature vector by adopting a conditional random field model in the named entity model, and extracting a plurality of keywords representing named entities from the text to be recognized according to part-of-speech tagging results.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 8, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can make the processor implement a keyword recognition method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the keyword recognition method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 7, and the memory stores program codes and various data required for executing the modules or the sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data necessary for executing all modules/sub-modules in the keyword recognition apparatus of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.

The present application also provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the keyword recognition method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

In summary, in the encoding process of the text to be recognized, the method and the device classify and encode the participles containing the characters according to different positions of the characters appearing in the participles, improve the semantic representation capability of embedded vectors obtained by encoding, enable the named entity recognition model to improve the accuracy of named entity recognition, are particularly suitable for processing Chinese texts, and enable the obtained keywords to improve the execution effect of various downstream tasks.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A keyword recognition method is characterized by comprising the following steps:

2. The method for recognizing the keywords according to claim 1, wherein the vectorization of the text to be recognized is performed to obtain the embedded vector corresponding to each character in the text to be recognized, and the method comprises the following steps:

3. The keyword recognition method according to claim 2, wherein for each character in the text to be recognized, dividing all participles of the character into a plurality of participle subsets according to the different positions of the character appearing in the participles containing the character, comprises the steps of:

4. The keyword recognition method according to claim 2, wherein, for each character in the text to be recognized, the step of performing word compression on the word vector corresponding to the word in each participle subset corresponding to the character to obtain each classification vector encoding the word vector of each participle subset is performed in any one of the following manners:

5. The keyword recognition method according to claim 1, wherein a named entity recognition model trained to a convergence state is used to extract text feature vectors from the embedded vectors, and keywords corresponding to named entities are extracted from the text to be recognized according to the text feature vectors, comprising the following steps:

6. The keyword recognition method according to any one of claims 1 to 5, wherein after the step of extracting the text feature vector from the embedded vector by using the named entity recognition model trained to the convergence state and extracting the keywords corresponding to the named entity from the text to be recognized according to the text feature vector, the method comprises the following steps:

7. The keyword recognition method according to any one of claims 1 to 5, wherein after the step of extracting the text feature vector from the embedded vector by using the named entity recognition model trained to the convergence state and extracting the keywords corresponding to the named entity from the text to be recognized according to the text feature vector, the method comprises the following steps:

8. A keyword recognition apparatus, comprising:

the text acquisition module is used for acquiring a text to be identified for the named entity identification to be executed;

the text coding module is used for vectorizing the text to be recognized to obtain embedded vectors corresponding to all characters in the text to be recognized, wherein the embedded vectors comprise word vectors of the characters and word vectors obtained by classifying and coding all possible participles of the characters according to the occurrence positions of the characters in the participles;

and the entity recognition module is used for extracting a text characteristic vector from the embedded vector by adopting a named entity recognition model trained to be in a convergence state, and extracting a keyword corresponding to the named entity from the text to be recognized according to the text characteristic vector.

9. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 6, which, when invoked by a computer, performs the steps comprised by the corresponding method.

11. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 6.