CN112307134B - Entity information processing method, device, electronic equipment and storage medium - Google Patents
Entity information processing method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112307134B CN112307134B CN202011196563.4A CN202011196563A CN112307134B CN 112307134 B CN112307134 B CN 112307134B CN 202011196563 A CN202011196563 A CN 202011196563A CN 112307134 B CN112307134 B CN 112307134B
- Authority
- CN
- China
- Prior art keywords
- entity
- candidate
- target
- names
- department
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 26
- 238000003672 processing method Methods 0.000 title claims abstract description 19
- 239000000463 material Substances 0.000 claims abstract description 130
- 238000012549 training Methods 0.000 claims description 41
- 238000000034 method Methods 0.000 claims description 31
- 230000015654 memory Effects 0.000 claims description 19
- 238000010276 construction Methods 0.000 claims description 9
- 238000012216 screening Methods 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 13
- 238000004590 computer program Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000002372 labelling Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000003442 weekly effect Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000001737 promoting effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides an entity information processing method, an entity information processing device, electronic equipment and a storage medium, and relates to the fields of deep learning and the like. The specific implementation scheme is as follows: identifying N document materials of a target department to obtain candidate entity names respectively corresponding to the N document materials; n is an integer greater than or equal to 1; generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; and determining target entity names of M first-class entities corresponding to the target departments in the relation map based on the candidate entity names respectively contained in the M candidate clusters.
Description
Technical Field
The present disclosure relates to the field of computer technology. The present disclosure relates particularly to the field of deep learning.
Background
Relationship maps are increasingly used in enterprises, and may include content such as entities of a first type (i.e., "things") and entities of a second type (i.e., "people"), and relationships between entities of the first type and entities of the second type. The relationship map may provide more functionality such as a person in charge of a search event, viewing related information of a person, and so forth. However, how to efficiently and accurately construct the first kind of entity in the relationship graph is a problem to be solved.
Disclosure of Invention
The disclosure provides an entity information processing method, an entity information processing device, electronic equipment and a storage medium.
According to a first aspect of the present disclosure, there is provided an entity information processing method, including:
identifying N document materials of a target department to obtain candidate entity names respectively corresponding to the N document materials; n is an integer greater than or equal to 1;
generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
and determining target entity names of M first-class entities corresponding to the target departments in the relation map based on the candidate entity names respectively contained in the M candidate clusters.
According to a second aspect of the present disclosure, there is provided an entity information processing apparatus including:
the identification module is used for identifying N document materials of the target department to obtain candidate entity names respectively corresponding to the N document materials; n is an integer greater than or equal to 1;
the clustering module is used for generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
And the entity name determining module is used for determining target entity names of M first-class entities corresponding to the target departments in the relation map based on the candidate entity names respectively contained in the M candidate clusters.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the aforementioned method.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the aforementioned method.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
By adopting the method and the device, the candidate entity names corresponding to the document materials can be determined based on the document materials of the target departments, and then the target entity names of one or more first-class entities corresponding to the target departments in the relation graph are determined based on the candidate entity names, so that the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manually analyzing the entity names can be avoided, the processing efficiency and accuracy of acquiring the target entity names are ensured, and the efficiency and accuracy of constructing or updating the relation graph are further ensured.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow diagram of an entity information processing method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a process flow for constructing candidate clusters in an information processing method according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a constituent structure of an information processing apparatus according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram showing a constitution of an information processing apparatus according to an embodiment of the present disclosure;
fig. 5 is a block diagram of an electronic device used to implement the information processing method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the disclosure provides an entity information processing method, as shown in fig. 1, including:
s101: identifying N document materials of a target department to obtain candidate entity names respectively corresponding to the N document materials; n is an integer greater than or equal to 1;
s102: generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
s103: and determining target entity names of M first-class entities corresponding to the target departments in the relation map based on the candidate entity names respectively contained in the M candidate clusters.
The embodiment of the invention can be applied to electronic equipment, such as a server, terminal equipment and the like.
The target division may be any one of a plurality of divisions in a unit or an enterprise, and the scheme provided in this embodiment may be adopted for each division to perform processing, where any one division is referred to as a target division, and the processing of the other divisions is the same as the target division, so that a detailed description is not given here.
The N document materials of the target department may specifically be at least one of document materials such as weekly report, promotional material, etc. of the target department.
The method for acquiring the N document materials of the target department may be that collecting materials in the department, or collecting all the uploaded document materials of each employee of the target department as the N document materials of the target department; or randomly extracting the N document materials of the target department from the document materials uploaded by each employee of the target department.
The identifying N document materials of the target department to obtain candidate entity names of the first type entities corresponding to the N document materials respectively may include: and respectively inputting the N document materials of the target department into a preset model to obtain candidate entity names respectively output by the preset model.
Based on the candidate entity names respectively corresponding to the N document materials, generating M candidate clusters corresponding to the target department can comprise: clustering the candidate entity names corresponding to the N document materials of the target department respectively to obtain M candidate clusters corresponding to the target department.
Further, one candidate entity name can be selected from one or more candidate entity names contained in each candidate cluster to serve as a target entity name corresponding to each candidate cluster; and taking the target entity name as the target entity name of a first type entity of the target department.
It is to be understood that the specific number of the M candidate clusters may be different according to actual situations. Assuming that the target department can finally obtain a target entity name of a first type entity, M is equal to 1; assuming that 2 or more first-class entities corresponding to the target department respectively correspond to target entity names, M is 2 or more; all possible cases are not exhaustive here.
The first kind of entity may refer to a fact in the relationship graph, and the "event" entity may include various contents, for example, may include: items, platforms, tools, etc.; it is to be understood that the first type of entity may comprise one or more, that is to say one or more, entities may be included in the relationship graph.
The target entity name or the candidate entity name of the corresponding first type entity may refer to an attribute or information of "event" to be used in the relationship map, for example, the entity name of "event" may be: the name of the item, the name of the platform, the name of the tool, etc.
According to the scheme, the candidate entity names corresponding to the document materials can be determined based on the document materials of the departments, and then one or more target entity names corresponding to each department in the relation graph are determined based on the candidate entity names.
Specifically, in S101, the identifying N document materials of the target department to obtain candidate entity names corresponding to the N document materials includes:
inputting a j-th document material in the N document materials of the target departments and the corresponding target departments into a preset model to obtain candidate entity names corresponding to the j-th document material output by the preset model; wherein j is an integer of 1 or more and N or less.
The N document materials can be extracted from the internal documents of the enterprise, including the document materials such as weekly report, promotional material, job report, project proposal material and the like. Because these materials are present in large quantities inside the enterprise, they can be obtained at very low cost; moreover, these materials tend to be relatively time-efficient, e.g., weekly reports need to be written once a week, so the time-efficient requirements can be met by collecting this portion of the document material.
The jth document material is any one of the above-mentioned N document materials. The N document materials are processed in the same way to obtain corresponding candidate entity names, so that the processing of all the N document materials is not repeated one by one.
It should be understood that, the input information of the preset model may be specifically the name of the target department and the j-th document material; still further, the jth document material may be segmented in advance to obtain at least one segmented sentence, and the at least one segmented sentence and the name of the target department are used as input information of the preset model; correspondingly, the output information of the preset model can be candidate entity names.
Therefore, the method and the device for determining the target entity name of the document material can solve the problems of low efficiency and poor accuracy caused by manual analysis or simple text matching, and improve the accuracy of determining the target entity name and the processing efficiency.
Further, for the preset model, the preset model may be obtained by training sample data included in the training set. Regarding the manner in which the training set is constructed, it may include:
acquiring historical candidate entity names corresponding to a plurality of departments respectively;
matching the historical document material of each department with the historical candidate entity name of the corresponding department to obtain the historical entity name corresponding to the historical document material of each department;
And generating a training set based on the historical document materials of each department and the corresponding historical entity names.
Specifically, the historical document material may be obtained by extracting the historical document material from the internal documents of the enterprise; for example, historical document material including department project names may be included in weekly newspapers, promotional material, and the like. Because these historic document materials exist in large quantities inside the enterprise, they can be obtained at low cost.
Generating a training set based on the historical document materials of each department and the corresponding historical entity names thereof can be to use each historical document material and the corresponding historical entity names thereof and the corresponding departments as each sample data, and add each sample data to the training set. The training set may finally comprise all of the sample data described above.
It should be noted that in the construction of the training set, the historical entity name of the same department needs to be matched with the historical document material of the same department to mark the historical document material, so that noise can be reduced, and the quality of the training set is improved. The determining the historical entity name corresponding to each historical document material may be marking the historical document material, that is, the historical entity name matched with the historical document material is used as the marking of the historical document material. In the related art, sample data in a training set is generally marked manually, so that the cost is high; in the embodiment, the processing of the history entity names corresponding to the labeling of the history document materials can be automatically completed only by matching the history entity names and the history document materials in the same department, so that the problem of overlarge cost of manual labeling is avoided, and compared with the manual labeling, the method has higher efficiency and higher accuracy.
Therefore, the labeling work of the data related to the training set is automatically completed by the equipment, and the historical document materials of the same department are labeled by adopting the historical entity name of the same department when the sample data is labeled, so that the departments are used as the granularity of information or as global information, the effect of entity extraction can be improved, the quality of the sample data of the training set can be improved, and the noise can be reduced.
And then training the preset model based on the historical document materials of each department and the corresponding historical entity names contained in the training set to obtain the trained preset model.
The training set is used for training the preset model based on the constructed training set containing the historical document materials of each department and the sample data of the historical entity names (such as project names) in the corresponding departments; in training, the historical document materials contained in each sample data in the training set can be divided into one or more sentences, names of one or more sentences and departments obtained through division are used as input of a preset model, and historical entity names corresponding to the historical document materials in the sample data are used as output, so that the preset model is trained. For example, when the preset model is trained, the input layer and the features of the preset model include: sentences and departments of the historical document materials can be expressed in the following modes: sentence + < SEP > + division.
The convergence condition in training of the preset model may be that the number of iterations reaches a preset threshold and/or the loss function is smaller than the preset threshold. The specific convergence conditions may include more, and are not exhaustive in this embodiment.
The pre-set model may be constructed using BERT (Bidirectional Encoder Representation from Transformers, bi-directional deformable encoder) models and conditional random field (Conditional Random Field, CRF) models. The semantic vector extraction is performed by adopting a preset training language model BERT, so that accurate semantic extraction can be realized on sentence, semantic mobility can be improved, and a better result can be obtained under the condition of smaller training set.
Therefore, in the process of training the preset model, the labeling work of the data of the training set is automatically completed by the equipment, and the historical document materials of the same department are labeled by adopting the historical entity names of the same department when the sample data is labeled, so that the quality of the sample data of the training set can be improved, the noise can be reduced, and the training of the preset model based on the training set can also ensure the identification accuracy of the finally obtained preset model.
By adopting the processing, the currently input document materials can be analyzed based on the preset model, so that the entity name corresponding to each currently input document material is obtained and used as the candidate entity name of each document material. Then, the foregoing process of S102 is executed, and based on the candidate entity names corresponding to the N document materials, determining M candidate clusters corresponding to the target division may include:
s201: screening from N candidate entity names respectively corresponding to the N document materials to obtain L candidate entity names;
s202: clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters in the M candidate clusters comprise different candidate entity names.
Regarding S201, the following processing methods may be specifically included:
the method 1 comprises the steps of obtaining frequency information of N candidate entity names, and selecting L candidate entity names with the frequency information larger than a preset frequency threshold value from the N candidate entity names;
or,
mode 2, filtering N candidate entity names of the N document materials based on a preset rule, and reserving L candidate entity names which do not meet the preset rule;
Still alternatively, the method may comprise,
mode 3, in which the above mode 1 and mode 2 are combined and treated, may be:
filtering N candidate entity names of the N document materials based on a preset rule, and reserving at least one candidate entity name which does not meet the preset rule; and acquiring the frequency information of the at least one candidate entity name, and selecting L candidate entity names with the frequency information larger than a preset frequency threshold value from the at least one candidate entity name.
In the method 1, first, frequency statistics is performed on candidate entity names of each department to obtain frequency information corresponding to each candidate entity name, and then the candidate entity names with low frequencies are filtered out by combining the frequency information. Thus, the accuracy of subsequent clustering can be improved.
The preset frequency threshold may be set according to actual situations, for example, may be 3 occurrences as the preset frequency threshold, or may be 4 occurrences as the preset frequency threshold.
In mode 2, the preset rule may include: the same as the preset keyword. The preset keywords may be set according to actual situations, for example, "commercialized" may be used as a preset keyword, and accordingly, candidate entity names including the preset keyword of "commercialized" may be deleted.
In mode 3, the two modes may be combined, and a part of candidate entity names satisfying the preset rule is deleted first, and then a part of candidate entity names with lower frequency are filtered out. Of course, after a part of candidate entity names with the frequency lower than the preset frequency threshold value are filtered, candidate entity names meeting the preset rule in the rest candidate entity names are deleted, and finally L candidate entity names corresponding to the target department are obtained.
In S202, clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department may specifically include: and performing similarity calculation on the L candidate entity names, and adding the candidate entity names with the similarity smaller than a preset similarity threshold value into the same cluster to finally obtain M candidate clusters corresponding to the target department.
Further, the similarity calculation may be: editing the calculation of the distance similarity and/or the calculation of the semantic similarity. Correspondingly, the preset similarity threshold may include: at least one of preset editing distance similarity threshold and preset semantic similarity threshold.
For example, in one example, a DBSCAN neighbor clustering algorithm may be used to cluster candidate entity names, which is to solve the entity fusion problem. The similarity of the candidate entity names may be the edit distance of the word, that is, when the edit distance between two candidate entity names is smaller than the preset edit distance similarity threshold, that is, the candidate entity names are clustered under the same cluster.
In yet another example, a deep semantic model (Deep Structured Sematic models, DSSM) or other model may be used to calculate semantic similarity, and candidate entity names with semantic distances less than a preset semantic similarity threshold are used as the same class and clustered into the same cluster.
In still another example, any two candidate entity names may be clustered under the same cluster when the edit distance of the two candidate entity names is less than a preset edit distance similarity threshold and the semantic distance is less than a preset semantic similarity threshold.
Of course, other similarity calculation may be used to determine the similarity between the candidate entity names, which may be within the protection scope of the present embodiment, and this is not exhaustive.
The candidate entity names obtained by the document materials are filtered in advance, the filtered candidate entity names are further clustered, and M candidate clusters corresponding to the target departments are obtained, so that the influence of part of candidate entity names on the final identification of the target entity names can be reduced, and the final determination of the target entity names based on the candidate clusters is more accurate.
In S103, determining target entity names of M first type entities corresponding to the target departments in the relationship graph based on the candidate entity names respectively included in the M candidate clusters, including:
Acquiring frequency information of candidate entity names contained in an ith candidate cluster in the M candidate clusters;
and taking the candidate entity name with the highest frequency information among the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.
The ith candidate cluster may be any one of M candidate clusters, and since the corresponding target entity name is determined in the same manner for each candidate cluster, only one candidate cluster is described herein, and the processing manners of the remaining candidate clusters are the same, which is not described in detail.
By adopting the processing, based on the frequency information of each candidate entity name in the ith candidate cluster, one candidate entity name with the highest occurrence frequency is selected as the target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and the rest candidate entity names in the ith candidate cluster are all used as the entity aliases of the ith first-class entity. In this way, each candidate cluster may obtain one or more entity aliases for the corresponding first class entity, but may only obtain one target entity standard name.
Because a target department can construct and obtain a plurality of candidate clusters, each candidate cluster can be considered to correspond to a first type entity, and the target entity standard name and the target entity alias of the first type entity can be determined based on one candidate cluster; and finally, obtaining target entity standard names and target entity aliases respectively corresponding to a plurality of first type entities of the target department.
Therefore, through the scheme, the standard name of the target entity and one or more target entity aliases for a matter can be finally determined based on the constructed candidate cluster, so that a more accurate expression mode can be provided for the time of constructing the entity of the matter in the relation map, and more reference information is provided for searching in the subsequent generalization process due to the fact that the information of the target entity aliases is added, and the relation map is more accurate and more convenient to use.
Based on the above processing, the target entity name of the first type entity in the relationship graph can be obtained, and further, the relationship between the event and the related second type entity can be obtained, so that the relationship between the target entity name of the first type entity in the relationship graph and the related second type entity is constructed. Specifically, the method can comprise the following steps:
Acquiring a second type entity associated with a kth first type entity from document materials respectively corresponding to target entity names of the kth first type entity in the M first type entities, and establishing an association relationship between the kth first type entity and the second type entity in the relationship graph based on the second type entity associated with the kth first type entity; wherein k is an integer of 1 or more and M or less.
Specifically, each of the M first type entities may include a target entity standard name and one or more target entity aliases; one or more document materials corresponding to the standard name of the target entity and the alias of the target entity or the aliases can be searched, and one or more second type entities are extracted from the one or more document materials. Thus, the related second type entity having a relation with each first type entity can be obtained.
Wherein, the second kind of entity may specifically refer to a "person" entity in the relationship graph.
Further, a relationship between the first type of entity and the related second type of entity having the association relationship can be established in the relationship map. That is, one or more second-class entities having a relationship with each first-class entity are obtained first, and then the association relationship between each first-class entity and the one or more second-class entities related thereto is added to the relationship map.
Wherein the second type entity can be a person, and the person can be represented as a name of the person in the relationship graph; in addition, the second type of entity, such as a person, may also include related attribute information or referred to as entity information, e.g., may include a person's position, title, etc., which is not intended to be exhaustive.
Therefore, the related second type entity in the relationship map can be determined through the names of the events, so that the construction of the relationship map is perfected, and the relationship between the events in the relationship map and the related second type entity can be constructed only by analyzing the entities of the events in advance because the construction of the entity names of the events is the same as the acquisition of the materials of the related second type entity, so that the efficiency of constructing the relationship map is improved.
The embodiment of the invention also provides an entity information processing device, as shown in fig. 3, comprising:
the identifying module 31 is configured to identify N document materials of a target department, and obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;
a clustering module 32, configured to generate M candidate clusters corresponding to the target department based on candidate entity names corresponding to the N document materials respectively; m is an integer greater than or equal to 1;
The entity name determining module 33 is configured to determine target entity names of M first type entities corresponding to the target division in the relationship graph based on candidate entity names respectively included in the M candidate clusters.
The identification module 31 is configured to input a j-th document material in the N document materials of the target department and a target department corresponding to the j-th document material into a preset model, to obtain a candidate entity name corresponding to the j-th document material output by the preset model; wherein j is an integer of 1 or more and N or less.
On the basis of fig. 3, the information processing apparatus provided in this embodiment, as shown in fig. 4, further includes:
the training set construction module 34 is configured to obtain historical candidate entity names corresponding to a plurality of departments respectively; matching the historical document material of each department with the historical candidate entity name of the corresponding department to obtain the historical entity name corresponding to the historical document material of each department; and generating a training set based on the historical document materials of each department and the corresponding historical entity names.
As shown in fig. 4, the apparatus further includes:
the model training module 35 is configured to train the preset model based on the historical document materials of each department and the corresponding historical entity names contained in the training set, so as to obtain the trained preset model.
The clustering module 32 is configured to screen L candidate entity names from N candidate entity names corresponding to the N document materials respectively; l is an integer of 1 or more and N or less; clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters in the M candidate clusters comprise different candidate entity names.
The entity name determining module 33 is configured to obtain frequency information of candidate entity names included in an i-th candidate cluster in the M candidate clusters; wherein i is an integer of 1 or more and M or less; and taking the candidate entity name with the highest frequency information among the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.
As shown in fig. 4, the apparatus further includes:
a relationship construction module 36, configured to obtain a second type entity associated with a kth first type entity from document materials corresponding to target entity names of the kth first type entity in the M first type entities respectively; establishing an association relationship between the kth first-class entity and the second-class entity in the relationship graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.
According to embodiments of the present application, there is also provided an electronic device, a readable storage medium and a computer program product.
As shown in fig. 5, a block diagram of an electronic device according to an information processing method according to an embodiment of the present application is shown. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 5, the electronic device includes: one or more processors 701, memory 702, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 701 is illustrated in fig. 5.
Memory 702 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the information processing methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the entity information processing method provided by the present application.
The memory 702 is used as a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the identification module, the clustering module, the entity name determination module, the training set construction module, and the model training module shown in fig. 4) corresponding to the information processing method in the embodiments of the present application. The processor 701 executes various functional applications of the server and data processing, i.e., implements the information processing method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 702.
Memory 702 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device of the information processing method, and the like. In addition, the memory 702 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 702 optionally includes memory remotely located relative to processor 701, which may be connected to the information processing method's electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the information processing method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or otherwise, in fig. 5 by way of example.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output device 704 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service. The server may also be a server of a distributed system or a server that incorporates a blockchain.
According to the technical scheme of the embodiment of the application, the candidate entity names corresponding to the document materials are determined based on the document materials of the departments, and then one or more target entity names corresponding to each department in the relation graph are determined based on the candidate entity names, so that the target entity names of the departments contained in the relation graph can be finally determined only by collecting the document materials of the departments, the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manual analysis can be avoided, the processing efficiency and accuracy of acquiring the target entity names are guaranteed, and the efficiency and accuracy of constructing or updating the relation graph are further guaranteed.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.
Claims (14)
1. An entity information processing method, comprising:
identifying N document materials of a target department to obtain candidate entity names respectively corresponding to the N document materials; n is an integer greater than or equal to 1;
generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
determining target entity names of M first-class entities corresponding to the target departments in a relation map based on candidate entity names respectively contained in the M candidate clusters, wherein the first-class entities comprise facts in the relation map;
the determining, based on the candidate entity names respectively contained in the M candidate clusters, target entity names of M first-class entities corresponding to the target departments in the relationship graph includes:
acquiring frequency information of candidate entity names contained in an ith candidate cluster in the M candidate clusters; wherein i is an integer of 1 or more and M or less;
and taking the candidate entity name with the highest frequency information among the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target entity standard name in the ith candidate cluster as target entity aliases of the ith first-class entity.
2. The method of claim 1, wherein the identifying N document materials of the target department to obtain candidate entity names corresponding to the N document materials respectively includes:
inputting a j-th document material in the N document materials of the target departments and the corresponding target departments into a preset model to obtain candidate entity names corresponding to the j-th document material output by the preset model; wherein j is an integer of 1 or more and N or less.
3. The method of claim 2, wherein the method further comprises:
acquiring historical candidate entity names corresponding to a plurality of departments respectively;
matching the historical document material of each department with the historical candidate entity name of the corresponding department to obtain the historical entity name corresponding to the historical document material of each department;
and generating a training set based on the historical document materials of each department and the corresponding historical entity names.
4. A method according to claim 3, wherein the method further comprises:
training the preset model based on the historical document materials of each department and the corresponding historical entity names contained in the training set to obtain the trained preset model.
5. The method of claim 1, wherein the determining M candidate clusters corresponding to the target division based on the candidate entity names corresponding to the N document materials, respectively, comprises:
screening from N candidate entity names respectively corresponding to the N document materials to obtain L candidate entity names; l is an integer of 1 or more and N or less;
clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters in the M candidate clusters comprise different candidate entity names.
6. The method of any of claims 1-5, wherein the method further comprises:
obtaining a second type entity associated with a kth first type entity from document materials respectively corresponding to target entity names of the kth first type entity in the M first type entities, wherein the second type entity comprises a person entity in the relationship graph; establishing an association relationship between the kth first-class entity and the second-class entity in the relationship graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.
7. An entity information processing apparatus comprising:
the identification module is used for identifying N document materials of the target department to obtain candidate entity names respectively corresponding to the N document materials; n is an integer greater than or equal to 1;
the clustering module is used for generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
the entity name determining module is used for determining target entity names of M first-class entities corresponding to the target departments in a relation map based on candidate entity names respectively contained in the M candidate clusters, wherein the first-class entities comprise facts in the relation map;
the entity name determining module is used for obtaining frequency information of candidate entity names contained in an ith candidate cluster in the M candidate clusters; wherein i is an integer of 1 or more and M or less; and taking the candidate entity name with the highest frequency information among the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target entity standard name in the ith candidate cluster as target entity aliases of the ith first-class entity.
8. The device of claim 7, wherein the identification module is configured to input a j-th document material in the N document materials of the target department and a target department corresponding to the j-th document material into a preset model, and obtain a candidate entity name corresponding to the j-th document material output by the preset model; wherein j is an integer of 1 or more and N or less.
9. The apparatus of claim 8, wherein the apparatus further comprises:
the training set construction module is used for acquiring historical candidate entity names corresponding to a plurality of departments respectively; matching the historical document material of each department with the historical candidate entity name of the corresponding department to obtain the historical entity name corresponding to the historical document material of each department; and generating a training set based on the historical document materials of each department and the corresponding historical entity names.
10. The apparatus of claim 9, wherein the apparatus further comprises:
the model training module is used for training the preset model based on the historical document materials of each department and the corresponding historical entity names contained in the training set to obtain the trained preset model.
11. The apparatus of claim 7, wherein the clustering module is configured to screen L candidate entity names from N candidate entity names corresponding to the N document materials, respectively; l is an integer of 1 or more and N or less; clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters in the M candidate clusters comprise different candidate entity names.
12. The apparatus according to any one of claims 7-11, wherein the apparatus further comprises:
the relationship construction module is used for acquiring a second type entity associated with a kth first type entity from document materials corresponding to target entity names of the kth first type entity in the M first type entities respectively, wherein the second type entity comprises a human entity in the relationship map; establishing an association relationship between the kth first-class entity and the second-class entity in the relationship graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.
13. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011196563.4A CN112307134B (en) | 2020-10-30 | 2020-10-30 | Entity information processing method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011196563.4A CN112307134B (en) | 2020-10-30 | 2020-10-30 | Entity information processing method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112307134A CN112307134A (en) | 2021-02-02 |
CN112307134B true CN112307134B (en) | 2024-02-06 |
Family
ID=74333114
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011196563.4A Active CN112307134B (en) | 2020-10-30 | 2020-10-30 | Entity information processing method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112307134B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114118087A (en) * | 2021-10-18 | 2022-03-01 | 广东明创软件科技有限公司 | Entity determination method, entity determination device, electronic equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100877477B1 (en) * | 2007-06-28 | 2009-01-07 | 주식회사 케이티 | Apparatus and method for recognizing the named entity using backoff n-gram features |
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN106909655A (en) * | 2017-02-27 | 2017-06-30 | 中国科学院电子学研究所 | Found and link method based on the knowledge mapping entity that production alias is excavated |
US9785696B1 (en) * | 2013-10-04 | 2017-10-10 | Google Inc. | Automatic discovery of new entities using graph reconciliation |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN110263318A (en) * | 2018-04-23 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Processing method, device, computer-readable medium and the electronic equipment of entity name |
CN110277149A (en) * | 2019-06-28 | 2019-09-24 | 北京百度网讯科技有限公司 | Processing method, device and the equipment of electronic health record |
CN110334211A (en) * | 2019-06-14 | 2019-10-15 | 电子科技大学 | A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning |
CN111723575A (en) * | 2020-06-12 | 2020-09-29 | 杭州未名信科科技有限公司 | Method, device, electronic equipment and medium for recognizing text |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9189473B2 (en) * | 2012-05-18 | 2015-11-17 | Xerox Corporation | System and method for resolving entity coreference |
US10643120B2 (en) * | 2016-11-15 | 2020-05-05 | International Business Machines Corporation | Joint learning of local and global features for entity linking via neural networks |
-
2020
- 2020-10-30 CN CN202011196563.4A patent/CN112307134B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100877477B1 (en) * | 2007-06-28 | 2009-01-07 | 주식회사 케이티 | Apparatus and method for recognizing the named entity using backoff n-gram features |
US9785696B1 (en) * | 2013-10-04 | 2017-10-10 | Google Inc. | Automatic discovery of new entities using graph reconciliation |
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN106909655A (en) * | 2017-02-27 | 2017-06-30 | 中国科学院电子学研究所 | Found and link method based on the knowledge mapping entity that production alias is excavated |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN110263318A (en) * | 2018-04-23 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Processing method, device, computer-readable medium and the electronic equipment of entity name |
CN110334211A (en) * | 2019-06-14 | 2019-10-15 | 电子科技大学 | A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning |
CN110277149A (en) * | 2019-06-28 | 2019-09-24 | 北京百度网讯科技有限公司 | Processing method, device and the equipment of electronic health record |
CN111723575A (en) * | 2020-06-12 | 2020-09-29 | 杭州未名信科科技有限公司 | Method, device, electronic equipment and medium for recognizing text |
Non-Patent Citations (3)
Title |
---|
基于共指消解的实体搜索模型研究;熊玲;徐增壮;王潇斌;洪宇;朱巧明;;中文信息学报(05);94-101 * |
实体链接研究综述;陆伟;武川;;情报学报(01);107-114 * |
旅游场景下的实体别名抽取联合模型;杨一帆;陈文亮;;中文信息学报(06);59-67 * |
Also Published As
Publication number | Publication date |
---|---|
CN112307134A (en) | 2021-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113807098B (en) | Model training method and device, electronic equipment and storage medium | |
CN111428049B (en) | Event thematic generation method, device, equipment and storage medium | |
CN111967262A (en) | Method and device for determining entity tag | |
CN112541359B (en) | Document content identification method, device, electronic equipment and medium | |
CN112507068A (en) | Document query method and device, electronic equipment and storage medium | |
CN110020422A (en) | The determination method, apparatus and server of Feature Words | |
CN111767334B (en) | Information extraction method, device, electronic equipment and storage medium | |
CN111078878B (en) | Text processing method, device, equipment and computer readable storage medium | |
CN111522967A (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN111538815B (en) | Text query method, device, equipment and storage medium | |
CN111263943B (en) | Semantic normalization in document digitization | |
CN113220836A (en) | Training method and device of sequence labeling model, electronic equipment and storage medium | |
CN111539209B (en) | Method and apparatus for entity classification | |
CN110569370B (en) | Knowledge graph construction method and device, electronic equipment and storage medium | |
CN112380847A (en) | Interest point processing method and device, electronic equipment and storage medium | |
CN111125176A (en) | Service data searching method and device, electronic equipment and storage medium | |
US20210216713A1 (en) | Method, apparatus, device and storage medium for intelligent response | |
CN111090991A (en) | Scene error correction method and device, electronic equipment and storage medium | |
CN112084150B (en) | Model training and data retrieval method, device, equipment and storage medium | |
CN111241302B (en) | Position information map generation method, device, equipment and medium | |
CN110717025B (en) | Question answering method and device, electronic equipment and storage medium | |
CN111738015A (en) | Method and device for analyzing emotion polarity of article, electronic equipment and storage medium | |
CN112307134B (en) | Entity information processing method, device, electronic equipment and storage medium | |
CN111666417A (en) | Method and device for generating synonyms, electronic equipment and readable storage medium | |
CN113361240A (en) | Method, device, equipment and readable storage medium for generating target article |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |