CN113051919B - Method and device for identifying named entity - Google Patents
Method and device for identifying named entity Download PDFInfo
- Publication number
- CN113051919B CN113051919B CN201911369966.1A CN201911369966A CN113051919B CN 113051919 B CN113051919 B CN 113051919B CN 201911369966 A CN201911369966 A CN 201911369966A CN 113051919 B CN113051919 B CN 113051919B
- Authority
- CN
- China
- Prior art keywords
- named
- named entities
- named entity
- priority
- entities
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method for identifying a named entity, which comprises the following steps: matching, namely matching the fields in the text with the slot positions according to the sequence from high priority to low priority based on the preset priority of the slot positions in the named entity; connecting, namely connecting the fields in the matched slot positions according to the priority and a preset logical relation to obtain a connection result; and a query step, in which the connection result is queried in a named entity alternative list, one or more named entities matched with the connection result in the named entity alternative list are determined, and the one or more named entities are used as identified named entities.
Description
Technical Field
The present disclosure relates to a method of natural language processing, and more particularly, to a method and apparatus for identifying named entities in natural language.
Background
In research involving natural language processing (e.g., information extraction, information retrieval, machine translation, question and answer systems, etc.), it is often necessary to identify Entity names in the text of natural language, i.e., to extract Named entities (Named Entity) identifications from unstructured information of natural narratives. Named entities are collections of words of specific types including item names, person names, place names, organization names, time, quantity characteristics, proper nouns, etc., and more broadly, named entities can be any special text paragraphs that meet specific needs. Named entity recognition is a very important fundamental task of natural language processing. Named entity recognition has wide application in natural language based information extraction and retrieval.
The slot (slot) corresponds to an information element that needs to be acquired in the processing of the natural language. For example, to find a named entity that corresponds to a particular appliance requires knowledge of the elements of the brand, model, etc., which may be considered slots.
Named entity recognition is usually performed by means of machine learning (conditional random fields, etc.), keyword matching, etc. When the named entities in the input natural language text are common nouns, the named entities are identified simply.
However, in certain domains, named entities may have more complex structures. For example, in the field of electronic product names, there may be a name of a named entity of a multi-layered structure such as brand name-series-model number. If the named entity is identified by using the traditional keyword matching method, various variations (e.g., alternative names) of the named entity, incomplete named entities (e.g., lack of series names or models), and disorder of fields in the named entity can not be successfully identified. If a machine learning method based on statistics is adopted for identification, a large number of professional manual labeled corpora are needed, so that the machine learning mode is high in cost and low in harvest.
Therefore, there is a need for a method and system for efficiently and accurately identifying complex named entities.
Disclosure of Invention
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
According to one aspect of the present disclosure, there is provided a method of identifying a named entity, comprising: matching, namely matching the fields in the text with the slot positions according to the sequence from high priority to low priority based on the preset priority of the slot positions in the named entity; connecting, namely connecting the fields in the matched slot positions according to the priority and a preset logical relation to obtain a connection result; and a query step, in which the connection result is queried in a named entity alternative list, one or more named entities matched with the connection result in the named entity alternative list are determined, and the one or more named entities are used as identified named entities.
According to another aspect of the present disclosure, there is provided an apparatus for identifying named entities, comprising: the matching unit is configured to match fields in the text with the slots according to the sequence from high priority to low priority based on the preset priority of the slots in the named entity; the connection unit is configured to connect the fields in the matched slots according to the priority and a preset logical relation to obtain a connection result; and the query unit is configured to query the connection result in a named entity alternative list, determine one or more named entities in the named entity alternative list, which are matched with the connection result, and take the one or more named entities as the identified named entities.
According to another aspect of the invention, there is provided a system for identifying named entities, the system comprising: one or more processors; and one or more memories configured to store a series of computer-executable instructions, wherein the series of computer-executable instructions, when executed by the one or more processors, cause the one or more processors to perform the method as described above.
According to another aspect of the invention, there is provided a non-transitory computer-readable medium having stored thereon computer-executable instructions which, when executed by one or more processors, cause the one or more processors to perform a method as described above.
Other features of the present disclosure and advantages thereof will become more apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is an exemplary flow diagram illustrating a method of identifying a named entity in accordance with one exemplary embodiment of the present invention.
FIG. 2 is an exemplary flowchart illustrating a method of identifying a named entity according to another exemplary embodiment of the present invention.
FIG. 3 is a detailed flow diagram illustrating a method of identifying a named entity in accordance with an exemplary embodiment of the present invention.
Fig. 4 is a detailed flowchart illustrating a similarity calculation step using a dynamic programming matrix according to an exemplary embodiment of the present invention.
Fig. 5 is a schematic diagram showing the constitution of a system according to an exemplary embodiment of the present invention.
Fig. 6 is an exemplary configuration diagram illustrating a computing device in which embodiments in accordance with the invention may be implemented.
Note that in the embodiments described below, the same reference numerals are used in common between different drawings to denote the same portions or portions having the same functions, and a repetitive description thereof will be omitted. In some cases, similar items are indicated using similar reference numbers and letters, and thus, once an item is defined in a figure, it need not be discussed further in subsequent figures.
For convenience of understanding, the positions, dimensions, ranges, and the like of the respective structures shown in the drawings and the like do not necessarily indicate actual positions, dimensions, ranges, and the like. Therefore, the present disclosure is not limited to the positions, dimensions, ranges, and the like disclosed in the drawings and the like.
Detailed Description
Various exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. That is, the structures and methods herein are shown by way of example to illustrate different embodiments of the structures and methods of the present disclosure. Those skilled in the art will understand, however, that they are merely illustrative of exemplary ways in which the disclosure may be practiced and not exhaustive. Furthermore, the figures are not necessarily to scale, some features may be exaggerated to show details of particular components.
The present disclosure provides a method for identifying a name of a named entity, comprising a matching step, wherein fields in a text are matched with slots in the named entity according to a sequence from high priority to low priority based on a preset priority of the slots in the named entity. If the highest priority slot does not match any result, the named entity identifies no result. And if the slot with the highest priority level is matched with the field, matching the slot with the next level of priority level. And connecting the fields in the matched slot positions according to the priority order and a preset logical relation, and outputting all connection results. Then, the connection result is inquired in a named entity alternative list, one or more named entities matched with the connection result in the named entity alternative list are determined, and then the one or more named entities are used as the identified named entities.
Compared with the traditional method for directly matching keywords, the technical scheme disclosed by the invention can accurately cover the condition of deformation of various named entities (for example, alias names of smart phones, different models of electric appliances under the same brand, name connection failure in the sequence of brand-model, and the like).
FIG. 1 is an exemplary flow chart illustrating a method 100 of identifying a named entity in accordance with one exemplary embodiment of the present invention. As shown in FIG. 1, the method 100 of identifying a named entity may include: a matching step 110, a connecting step 120 and a querying step 130.
First, in a matching step 110, based on the priority of slots in a predetermined named entity, the slots are matched for fields in the input text (e.g., natural language sentence) in the order of priority from high to low. If the highest priority slot does not match any result, the named entity identifies no result. And if the slot with the highest priority level is matched with the field, matching the slot with the next level of priority level. Then, the process proceeds to the connection step 120.
In some embodiments, the predetermined priority of slots in the named entity is obtained by classifying fields of the named entities in the candidate list into a plurality of slots and giving priority to each slot based on statistics of all named entities in the candidate list of named entities. For example, when a named entity in a natural language to be recognized may belong to the field of intelligent terminals, first, name information of all terminals is counted (e.g., information collected/crawled on various terminal brand websites may be based). And after counting corresponding information, designing the category of the slot position aiming at the intelligent terminal information.
In some embodiments, there may be multiple slots to be matched of the same priority. In this case, a plurality of slots of the same priority may be matched in parallel and then a slot of the next priority may be matched. Continuing with the example of the smart terminal, the designed slot may include the brand, series, model, alias, etc. of the smart terminal. These slots are then given priority, e.g., the highest priority being brand and family, the lower priority being model and alias.
In the linking step 120, the fields in the matched slots are linked according to the priority order and the predetermined logical relationship, and all linking results are output. For example, the predetermined logical relationship may include "and", "or", and the like. Based on the connection logical relationship, the connection results typically include different combinations of fields of the multiple slots that match. Processing then proceeds to query step 130.
In some embodiments, the predetermined logical relationship is set to logical relationship between fields in slots having the same priority as "or", and logical relationship between fields in slots of different priorities is set to "and".
In the query step 130, all the connection results are queried in the named entity alternative list, and if one or more named entities in the named entity alternative list can be matched, the one or more named entities are all output as named entity identification results.
In some embodiments, there are cases where all of the join results entered in query step 130 fail to match a named entity in the list of named entity alternatives, in which case a similarity calculation step may be added to the method of identifying named entities in natural language to find similar named entities. An exemplary flow chart of a method 200 of identifying named entities in a natural language including a similarity calculation step is shown below with reference to fig. 2. As shown in FIG. 2, the method 200 of identifying named entities in a natural language may include: a matching step 210, a connecting step 220, a querying step 230 and a similarity calculation step 240. For the sake of brevity, only the similarity calculation step 240, which is different from the method 100, is described in detail.
When in the query step 230 none of the connection results match any named entity in the list of named entity alternatives, processing proceeds to a similarity calculation step 240. In the similarity calculation step 240, the similarity between each connection result and the named entities in the named entity alternative list is calculated by using a character string similarity calculation method, and one or more named entities with the similarity higher than a predetermined threshold are output as a named entity identification result.
In some embodiments, the above-mentioned method for calculating string similarity may include calculating a longest common subsequence length of the connection result and the named entities in the named entity alternative list. In addition, the named entity in the named entity alternative list with the maximum similarity to the connection result can be calculated by utilizing the dynamic planning matrix and output as the named entity identification result. Hereinafter, a specific example of the character string similarity calculation method using the dynamic programming matrix will be described in detail with reference to fig. 4.
In some embodiments, if the named entity in the named entity candidate list is not queried based on the character string similarity calculation method, only the field in the slot with the highest priority level may be retained and queried in the named entity candidate list, and if one or more named entities in the named entity candidate list can be matched, the one or more named entities are output as the named entity identification result.
In order to more clearly embody the method flow of the present invention, a specific embodiment according to the present invention will be described below with reference to fig. 3. Fig. 3 is a detailed step diagram illustrating a method of identifying a named entity of a name of a smart terminal according to an exemplary embodiment of the present invention.
First, at step S301, a natural language input statement of a named entity to be identified is received. The statement may include any number of levels of named entities to be identified, such as an electronic product name, a package name, an address, and the like. In some embodiments, for example, the natural language input may be a question of querying the name of a particular intelligent terminal.
As described above with reference to fig. 1, in the present example, information collected/captured from each terminal brand website has been counted, a named entity alternative list for the smart terminal has been sorted, and a category of slots for the smart terminal information has been designed, so that corresponding slot priority information has been acquired. As shown in table 1 below, four slots are designed for each terminal name, which are respectively a brand, a series, a model and an alias, and are given with priority information, the slot with the highest priority corresponds to the brand and the series, and the slot with the lower priority is the model and the alias.
Groove position | Hua is AscendP7 | Samsung GalaxyA3 |
Brand | Huawei | Three stars |
Series of | Ascend | Galaxy |
Model number | P7 | A3 |
Alias name | / | A3009 |
TABLE 1
Subsequently, the process proceeds to step S302. At step S302, matching is performed in the sentence input in the natural language using the slot priority information. In some embodiments, the highest priority slot (i.e., brand name and family name) may be matched first in parallel.
It is determined at step S303 whether the matching was successful, and if the highest priority slot is not successfully matched, the named entity identification is ended and no result is returned. If the matching is successful, the process proceeds to step S304.
At step S304, slots with lower priority (e.g., model name and alias) are matched in parallel. If the matching is successful at step S305, the process proceeds to step S306.
At step S306, the field information in each matched slot is connected according to a preset logical relationship, for example, the field information may be connected according to a logical relationship of [ ("brand" or "series") and ("model" or "alias") ]. For example, in the case where the fields corresponding to the brand slot are samsung, galaxy series, A3 model, and 3009 alias, the output connection result may include samsung A3, galaxyA3, samsung 3009, and Galaxy3009. Subsequently, the process proceeds to step S308. In some other embodiments, the fields in the slots may be connected in any logical relationship, for example, the fields in the slots may also be connected in a brand-series-model-alias order.
At step S308, querying all connection results in the pre-sorted named entity candidate list for the intelligent terminal, and outputting a final named entity identification result if the named entities in the candidate list can be matched. If any named entity cannot be matched (for example, in the case of a smart terminal name, if a (brand = red meter) and a (model = 9) are matched in a sentence of an input natural language, since a mobile phone of red meter 9 does not exist, the process will not match any named entity), the process proceeds to step S310.
At step S310, the similarity of the connection result to the named entity in the named entity alternative list is calculated by a character string similarity calculation method based on the improved dynamic programming matrix. If the similarity is greater than or equal to the preset threshold, the connection result is output as a result of similar named entity recognition at step S311 and the process proceeds to step S312. At step S312, the similar named entities are further examined, which is described in detail below with reference to fig. 4, and the identified named entities are output. If less than the threshold value set in advance, the process proceeds to step S313.
At step S313, only the fields in the slot with the highest priority (i.e. corresponding to the brand and series slot information) are retained, and then the connection is performed according to the logical relationship of [ ("brand" or "series") ], and then the process returns to S308, and the fields of the modified connection result are queried in the pre-sorted named entity alternative list for the intelligent terminal, and if the named entity containing the fields in the alternative list can be matched, the final named entity identification result is output.
A method 300 of identifying named entities is shown in fig. 3, taking a smart terminal as an example, it should be understood by those skilled in the art that named entities are not limited thereto, and the identification method of named entities of other fields is similar to the method 300, except that the priority of slots will be designed differently for the statistics of the respective fields. For example, in the case where the named entity to be identified is a cell phone call/traffic package, information about each package provided by the operator may be collected and collated. For example, five slots, name, price, traffic, voice, alias, can be designed for each package and assigned a corresponding priority. In this example, the higher priority is name and alias, and the lower priority is traffic, price, voice. Taking "Tian Yi Chang 69 yuan package" as an example, in the statistical process, the named entities can be classified as the slot positions as shown in the following table 2:
slot position | Tian Yi Xiang 69 yuan set meal |
Name (R) | Ceiling wing |
Price | 69 yuan |
Flow rate | / |
|
500 minutes |
Alias name | Enjoy the good luck |
TABLE 2
As described above in step 240 of method 200 and step S310 of method 300, the similarity calculation step may be performed when the named entities cannot be matched in the query step. In the similarity calculation step, in addition to the traditional similarity calculation, the similarity between the connection result and the named entities in the named entity alternative list can be calculated through a character string similarity calculation method based on an improved dynamic programming matrix, and the similar named entities are checked based on the dynamic programming matrix to obtain an optimized named entity recognition result. A specific example of the method of calculating the similarity of character strings by the improved dynamic programming matrix-based method will be described in detail below with reference to fig. 4.
First, in step 410, a character string a is set as the connection result output in the connection step (i.e., all connection results connecting the fields in the matched slots according to the priority order and the predetermined logical relationship), and the character string is set as each named entity given in the alternative named entity list. Subsequently, the process proceeds to step 420.
At step 420, the longest common subsequence of string a and string B is obtained and its length is obtained. It will be understood by those skilled in the art that the longest common subsequence refers to the longest subsequence between two or more strings, wherein the subsequence need not occupy contiguous positions in the original sequence. In contrast, the substrings need to be continuous.
At step 430, the similarity between string a and string B is calculated by the normalized Longest Common Subsequence (LCS) length, as shown in equation 1:
equation 1
In some embodiments, a dynamic programming matrix may be utilized to cycle through the similarity of string a and string B. For example, in the dynamic programming matrix, the length of a character string A is defined as m, the length of a character string B is defined as n, dp [ i ] [ j ] is the longest common subsequence from the first character to the ith character string of the character string A and from the first character to the jth character of the character string B, and the size of the whole matrix is (n + 1) x (m + 1). In the initial state: dp [ i ] [0] =0, dp [ 2 ], [0] [ j ] =0.
The subsequent state transition equations are:
when A [ i-1]! In case of d [ j-1], dp [ i ] [ j ] = max { dp [ i-1] [ j ], dp [ i ] [ j-1] }
When a [ i-1] = = B [ j-1], dp [ i ] [ j ] = dp [ i-1] [ j-1] +1
Subsequently, the process proceeds to step 440. In this step, all similar results equal to or greater than a predetermined first threshold are obtained. For example, taking the intelligent terminal name "red rice 9" that cannot be matched to any named entity as an example, in the case that the predetermined threshold is set to 0.6, it can be queried that the similarity of red rice 8, millet 9, millet CC9, red pepper 9X and "red rice 9" is 0.667 (2/3). Subsequently, the process proceeds to step 450.
At step 450, it is checked whether the length of the shortest sub-string (i.e., consecutive characters) of the longest consecutive common sub-sequence of character strings a and B meets a predetermined second threshold, and similar results that do not meet the condition are filtered out. In some embodiments, such checking may be accomplished by directly checking the dynamic programming matrix generated in step 440. For example, the inspection method may include the steps of:
1) Finding out the initial coincident position, setting the checking mark value as 1, and starting circular checking along the diagonal line of the matrix at the position;
2) If the numerical value on the diagonal line is in an increasing state, the checking mark value is added by 1;
3) If the values on the diagonal are not increased or decreased,
a) When the current check mark value is not 1, the check mark value is returned to 0;
b) When the check mark value is 1, directly skipping 4), ending the check, and setting the similarity as 0, namely not considering that the two character strings are matched;
4) And continuously checking the next value on the diagonal line, and repeating the operations 2) and 3) until the value on the diagonal line is empty.
And finally, the character string B with the similarity not set to 0 is used as the searched entity name.
Taking the above example of "red rice 9" (string a) and "red rice 8" (string B) and "red pepper 9X" (string B') in the alternative list failing to match the named entity, the dynamic programming matrix can be represented as shown in tables 3A and 3B below:
0 | red wine | Rice and its production process | 8 | |
0 | 0 | 0 | 0 | 0 |
Red wine | 0 | 1 | 1 | 1 |
Rice and its production process | 0 | 1 | 2 | 2 |
9 | 0 | 1 | 2 | 2 |
TABLE 3A
0 | Red wine | Spicy sauce | Pepper | 9 | X | |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
Red wine | 0 | 1 | 1 | 1 | 1 | 1 |
Rice and its production process | 0 | 1 | 1 | 1 | 1 | 1 |
9 | 0 | 1 | 1 | 1 | 2 | 2 |
TABLE 3B
After the examination based on the above examination method, since the length of the shortest substring in the common subsequence of "red pepper 9X" is only 1, the similarity to "red rice 9" is set to 0, and the result of the similarity is excluded. Similarly, "millet CC9" would be excluded. Only red rice 8, millet 9, as named entities were identified as similar. Therefore, the character similarity comparison method based on the dynamic programming matrix and the inspection method based on the dynamic programming matrix ensure the similarity and the rationality of the named entity matching result.
Fig. 5 is a block diagram showing a basic configuration of an apparatus 500 for identifying a named entity according to an exemplary embodiment of the present invention.
As shown in fig. 5, the apparatus 500 for identifying a named entity includes: a matching unit 510, a connecting unit 520, a querying unit 530 and a similarity calculating unit 540. The matching unit 510 performs slot matching on fields in the input natural sentence according to a predetermined priority of slots in the named entity and from high to low in priority; the connection unit 520 connects the fields in the matched slot according to the priority and a predetermined logical relationship, and outputs all connection results; the querying unit 530 queries the connection result in the named entity candidate list, and if one or more named entities in the named entity candidate list can be matched, outputs the one or more named entities as a named entity identification result; when the connection result does not match the named entity in the named entity candidate list, the similarity calculation unit 540 calculates the similarity between the connection result and the named entity in the named entity candidate list using a string similarity calculation method, and outputs one or more named entities having a similarity higher than a predetermined threshold as a named entity recognition result. It will be appreciated by those skilled in the art that the components comprised by the apparatus 500 for identifying a named entity may not be limited to the components 510-540 described above, but may comprise components for carrying out further steps of the aforementioned method according to embodiments of the present invention. The various components of apparatus 500 may be implemented by hardware, software, firmware, or any combination thereof. In addition, those skilled in the art will also appreciate that the various components of the apparatus 500 may be combined or divided into sub-components as desired. The above-described respective components of the apparatus 500 are not limited to the above-described respective functions, but may implement the functions of the respective steps of the respective methods according to the embodiments of the present invention as described previously.
FIG. 6 illustrates an exemplary configuration of a computing device 2000, in which embodiments in accordance with the invention may be implemented. Computing device 2000 is an example of a hardware device in which the above-described aspects of the invention may be applied. Computing device 2000 may be any machine configured to perform processing and/or computing. The computing device 2000 may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a Personal Data Assistant (PDA), a smart phone, an in-vehicle computer, or a combination thereof. The aforementioned apparatus 500 may be implemented, in whole or at least in part, by the aforementioned computing device 2000 or a device or system similar thereto.
As shown in fig. 6, computing device 2000 may include one or more elements connected or in communication with bus 2002, possibly via one or more interfaces. For example, computing device 2000 may include a bus 2002, one or more processors 2004, one or more input devices 2006, and one or more output devices 2008. Bus 2002 may include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus, among others. One or more treatment devices 2004 mayIs any kind of processor and may include, but is not limited to, one or more general-purpose processors or special-purpose processors (such as special-purpose processing chips). Input device 2006 may be any type of input device capable of inputting information to a computing device and may include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control. Output device 2008 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The computing device 2000 may also include or be connected to a non-transitory storage device 2010, which non-transitory storage device 2010 may be any non-transitory and may implement a storage device for data, and may include, but is not limited to, a disk drive, an optical storage device, a solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, a compact disk, or any other optical medium, a ROM (read only memory), a RAM (random access memory), a cache memory, and/or any other memory chip or unit, and/or any other medium from which a computer may read data, instructions, and/or code. The non-transitory storage device 2010 may be removably connected with any interface. The non-transitory storage device 2010 may have stored thereon data/instructions/code for implementing the aforementioned methods and/or steps for consensus in a blockchain network. Computing device 2000 may also include a communication device 2012, which communication device 2012 may be any kind of device or system capable of enabling communication with external devices and/or networks and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset (such as bluetooth) TM Devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communications facilities, etc.).
The computing device 2000 may also include a working memory 2014. The working memory 2014 may be any type of working memory capable of storing instructions and/or data useful to the processor 2004 and may include, but is not limited to, random Access Memory (RAM) and Read Only Memory (ROM).
The software elements located on the above-described working memory may include, but are not limited to, an operating system 2016, one or more application programs 2018, drivers, and/or other data and code. One or more of the applications 2018 may include instructions for performing the methods and steps for identifying named entities as described above. The components/units/elements of the system 300 for identifying named entities described above, such as the matching unit 310, the connection unit 320, the query unit 330, the similarity comparison unit 340, and so on, may be implemented by a processor that reads and executes one or more application programs 2018. Executable code or source code of the instructions of the software elements may be stored in a non-transitory computer-readable storage medium (such as storage device 2010 as described above) and may be read into working memory 2014 by compilation and/or installation. Executable or source code for the instructions of the software elements may also be downloaded from a remote location.
It will be appreciated that variations may be made in accordance with specific requirements. For example, customized hardware might be used and/or particular elements might be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. In addition, connections to other computing devices (such as network input/output devices) may be employed. For example, some or all of the methods and apparatus of the present invention may be implemented in accordance with the present invention by a hardware programming language (e.g., VERILOG, VHDL, C + +) using assembly language programming hardware (e.g., programmable logic circuits including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs)) or logic and algorithms.
It should be further understood that the elements of computing device 2000 may be distributed throughout a network. For example, some processes may be performed using one processor while other processes are performed using other remote processors. Other elements of the computer system 2000 may be similarly distributed. Thus, the computing device 2000 may be understood as a distributed computing system that performs processing at multiple sites.
The method and apparatus of the present invention can be implemented in a number of ways. For example, the methods and apparatus of the present invention may be implemented in software, hardware, firmware, or any combination thereof. The order of the method steps described above is merely illustrative and the method steps of the present invention are not limited to the order specifically described above unless explicitly stated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, which includes machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for implementing the method according to the present invention.
While some specific embodiments of the present invention have been shown in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are intended to be illustrative only and are not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that the above-described embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.
Claims (16)
1. A method of identifying a named entity, comprising:
matching, namely matching the fields in the text with the slot positions according to the sequence from high priority to low priority based on the preset priority of the slot positions in the named entity;
connecting, namely connecting the fields in the matched slot positions according to the priority and a preset logical relation to obtain a connection result; and
a query step, in which the connection result is queried in a named entity alternative list, one or more named entities matched with the connection result in the named entity alternative list are determined, and the one or more named entities are used as identified named entities;
and a similarity calculation step of calculating the similarity between the connection result and the named entities in the named entity alternative list by using a character string similarity calculation method under the condition that the named entities matched with the connection result do not exist in the named entity alternative list, taking one or more named entities with the similarity higher than a preset threshold value as similar named entities, and determining the named entities with the length of the shortest continuous substring in the longest common subsequence in the similar named entities larger than a second threshold value as identified named entities.
2. The method of claim 1, wherein the string similarity calculation method comprises calculating a longest common subsequence length of the concatenation result and the named entities in the named entity alternative list, and taking a ratio of the longest common subsequence length and a length of the concatenation result as the similarity.
3. The method of claim 2, wherein the string similarity calculation method further comprises calculating the longest common subsequence length using a dynamic programming matrix.
4. The method of claim 1, wherein:
and determining the named entity with the length of the shortest continuous substring in the longest common subsequence in the similar named entities larger than a second threshold value as the identified named entity according to the dynamic programming matrix.
5. The method of claim 1, wherein the predetermined logical relationship comprises: the logical relationship between fields in slots having the same priority is set to "or", and the logical relationship between fields in slots of different priorities is set to "and".
6. The method of claim 1, wherein the predetermined priority of slots in the named entities is obtained by classifying fields of named entities in a list into a plurality of slots and assigning a priority to each slot based on a statistic of all named entities in the named entity alternative list.
7. The method of claim 1, wherein the querying step further comprises, in the case that there is no named entity in the named entity alternative list with a similarity higher than a predetermined threshold, retaining only the first field in the text that matches the highest-priority slot, and querying the first field in the named entity alternative list, determining one or more named entities in the named entity alternative list that contain the first field, and treating the one or more named entities as the identified named entities.
8. An apparatus to identify named entities, comprising:
the matching unit is configured to match the fields in the text with the slots in the order from high priority to low priority based on the preset priority of the slots in the named entity;
the connection unit is configured to connect the fields in the matched slots according to the priority and a preset logical relation to obtain a connection result; and
a query unit configured to query the connection result in a named entity alternative list, determine one or more named entities in the named entity alternative list, which match the connection result, and use the one or more named entities as identified named entities;
and the similarity calculation unit is configured to calculate the similarity between the connection result and the named entities in the named entity alternative list by using a character string similarity calculation method under the condition that the named entities matched with the connection result do not exist in the named entity alternative list, regard one or more named entities with the similarity higher than a preset threshold value as similar named entities, and determine the named entities with the length of the shortest continuous substring in the longest common subsequence in the similar named entities larger than a second threshold value as identified named entities.
9. The apparatus of claim 8, wherein the string similarity calculation method calculates a longest common subsequence length of the concatenated result and named entities in the named entity alternative list, and takes a ratio of the longest common subsequence length and a length of the concatenated result as the similarity.
10. The apparatus of claim 9, wherein the string similarity calculation method further comprises calculating the longest common subsequence length using a dynamic programming matrix.
11. The apparatus according to claim 8, wherein the similarity calculation unit is configured to determine, as the identified named entity, the named entity having the length of the shortest consecutive sub-string in the longest common sub-sequence of the similar named entities larger than a second threshold according to a dynamic programming matrix.
12. The apparatus of claim 8, wherein the predetermined logical relationship comprises: the logical relationship between fields in slots having the same priority is set to "or", and the logical relationship between fields in slots of different priorities is set to "and".
13. The apparatus of claim 8, wherein the querying unit is further configured to, when the connection result does not match a named entity in the named entity alternative list, reserve only a first field in a slot with a highest priority and perform a query in the named entity alternative list, determine one or more named entities in the named entity alternative list that include the first field, and treat the one or more named entities as the identified named entities.
14. The apparatus of claim 8, further comprising a priority setting unit configured to classify fields of named entities in the list into a plurality of slots based on statistics of all named entities in the named entity alternative list, and assign a priority to each slot to obtain a predetermined priority of slots in the named entities.
15. A system for identifying a named entity, comprising:
one or more processors; and
one or more memories configured to store a series of computer-executable instructions,
wherein the series of computer-executable instructions, when executed by the one or more processors, cause the one or more processors to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium, on which a program is stored, wherein the program, when executed by a processor, implements the steps of the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911369966.1A CN113051919B (en) | 2019-12-26 | 2019-12-26 | Method and device for identifying named entity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911369966.1A CN113051919B (en) | 2019-12-26 | 2019-12-26 | Method and device for identifying named entity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113051919A CN113051919A (en) | 2021-06-29 |
CN113051919B true CN113051919B (en) | 2023-04-04 |
Family
ID=76505629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911369966.1A Active CN113051919B (en) | 2019-12-26 | 2019-12-26 | Method and device for identifying named entity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113051919B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115618824B (en) * | 2022-10-31 | 2023-10-27 | 上海苍阙信息科技有限公司 | Data set labeling method and device, electronic equipment and medium |
CN117592471B (en) * | 2023-11-10 | 2024-11-01 | 易方达基金管理有限公司 | News main body recognition method and system for public opinion data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858040A (en) * | 2019-03-05 | 2019-06-07 | 腾讯科技(深圳)有限公司 | Name entity recognition method, device and computer equipment |
CN109918680A (en) * | 2019-03-28 | 2019-06-21 | 腾讯科技(上海)有限公司 | Entity recognition method, device and computer equipment |
CN110516247A (en) * | 2019-08-27 | 2019-11-29 | 湖北亿咖通科技有限公司 | Name entity recognition method neural network based and computer storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10672391B2 (en) * | 2014-09-26 | 2020-06-02 | Nuance Communications, Inc. | Improving automatic speech recognition of multilingual named entities |
-
2019
- 2019-12-26 CN CN201911369966.1A patent/CN113051919B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858040A (en) * | 2019-03-05 | 2019-06-07 | 腾讯科技(深圳)有限公司 | Name entity recognition method, device and computer equipment |
CN109918680A (en) * | 2019-03-28 | 2019-06-21 | 腾讯科技(上海)有限公司 | Entity recognition method, device and computer equipment |
CN110516247A (en) * | 2019-08-27 | 2019-11-29 | 湖北亿咖通科技有限公司 | Name entity recognition method neural network based and computer storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113051919A (en) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109871483B (en) | Method and device for determining recommendation information | |
CN109815314B (en) | Intent recognition method, recognition device and computer readable storage medium | |
US10268758B2 (en) | Method and system of acquiring semantic information, keyword expansion and keyword search thereof | |
WO2021174717A1 (en) | Text intent recognition method and apparatus, computer device and storage medium | |
KR101723862B1 (en) | Apparatus and method for classifying and analyzing documents including text | |
CN111046221A (en) | Song recommendation method and device, terminal equipment and storage medium | |
CN110929125A (en) | Search recall method, apparatus, device and storage medium thereof | |
CN107861753B (en) | APP generation index, retrieval method and system and readable storage medium | |
CN112988784B (en) | Data query method, query statement generation method and device | |
CN113051919B (en) | Method and device for identifying named entity | |
CN108170293A (en) | Input the personalized recommendation method and device of association | |
CN111339166A (en) | Word stock-based matching recommendation method, electronic device and storage medium | |
CN111198936B (en) | Voice search method and device, electronic equipment and storage medium | |
CN113407785A (en) | Data processing method and system based on distributed storage system | |
US10353927B2 (en) | Categorizing columns in a data table | |
CN110489032B (en) | Dictionary query method for electronic book and electronic equipment | |
CN114297449A (en) | Content searching method and device, electronic equipment, computer readable medium and product | |
CN113626558A (en) | Intelligent recommendation-based field standardization method and system | |
CN110738048B (en) | Keyword extraction method and device and terminal equipment | |
CN109101630B (en) | Method, device and equipment for generating search result of application program | |
CN110008352B (en) | Entity discovery method and device | |
CN112989011B (en) | Data query method, data query device and electronic equipment | |
CN115129871A (en) | Text category determination method and device, computer equipment and storage medium | |
CN110019829A (en) | Data attribute determines method, apparatus | |
CN110580243A (en) | file comparison method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |