CN116127097A - Structured text relation extraction method, device and equipment - Google Patents
Structured text relation extraction method, device and equipment Download PDFInfo
- Publication number
- CN116127097A CN116127097A CN202310136023.4A CN202310136023A CN116127097A CN 116127097 A CN116127097 A CN 116127097A CN 202310136023 A CN202310136023 A CN 202310136023A CN 116127097 A CN116127097 A CN 116127097A
- Authority
- CN
- China
- Prior art keywords
- entity
- data
- model
- relation
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a structured text relation extraction method, a device, equipment and a storage medium, relating to the technical field of artificial intelligence and natural language processing; the method comprises the following steps: step 1: designing schema (schema) of relation extraction data, wherein in order to standardize the expression of structured data, each piece of data of relation extraction must meet the predefined entity object of schema (schema) and the type thereof, de-duplicating and labeling the data, and constructing training set, verification set and test set of the model; step 2: constructing a relation extraction model based on deep learning; step 3: training a deep learning model by using training set data, and storing model weights with the best effect on a verification set; step 4, extracting a relation triplet from the data to be tested by using the stored model; and 5, carrying out structural storage on the extracted entity relation triples. The technology can extract knowledge triples from the structured text information, extract high-level abstract features from the data, and provide technical support for the construction of the knowledge graph.
Description
Technical Field
The application relates to the technical field of artificial intelligence and natural language processing, in particular to a method, a device, equipment and a storage medium for entity relation joint extraction based on entity pairs of a deep learning model.
Background
Information extraction is defined as a technique of extracting information of entities, relationships, events, etc. of a specified type from natural language text and forming structured data output. The specific tasks involved are: named Entity Recognition (NER), relationship Extraction (Relation Extraction), and Event Extraction (Event Extraction). Relationship extraction is an important task in the information extraction technology, namely, finding out the relationship existing between a subject and an object in data, and representing the relationship as an entity relationship triplet, namely (head entity, relationship, tail entity), and simply stated as (subject, precursor, object), abbreviated as (s, p, o). The extracted triplet data can be used for constructing a knowledge graph, and further can be used for applications such as information retrieval, intelligent question and answer and the like.
The existing relation extraction mainly has two schemes: 1. pipeline (pipeline) method: firstly, extracting all entities from a text by using a Named Entity Recognition (NER) model, then combining all possible entities, and judging which type the relationship between the combined entities belongs to by using a multi-classification model; 2. joint extraction method (joint): the model utilizes the interaction information between the entity and the relation to simultaneously carry out the task of entity identification and relation classification, and adopts a one-step walking mode to effectively shorten the error transfer problem caused by the task sequence in the assembly line method. General joint extraction methods can be classified into "joint model of parameter sharing" and "structured prediction", but the joint extraction methods still have the following problems:
(1) The parameter sharing is that entities and relations share an encoder, in the decoding stage, the extraction of the subjects, the objects and the relations is not synchronous, but the information of the encoding layer is utilized to identify the first entity object, then the characteristic information of the subjects is utilized to identify the corresponding tail entity object, finally the corresponding relation type is identified according to the characteristics of the subjects and the objects, and the real 'combination' is not realized.
(2) The above-mentioned parameter sharing method does not realize the combination between real entities and relationships, and the researchers studying relationship extraction also propose complex joint decoding algorithms, and do not explicitly divide the decoding scheme into several steps. However, this decoding method requires a relatively complex decoding process to be designed and is not effective in terms of triplet overlap.
Disclosure of Invention
Aiming at the problems existing in the prior relation extraction technology, the application provides a text relation extraction method, device and equipment based on deep learning, which adopt a deep learning algorithm represented by a BERT-base pre-training model. By constructing the data and modeling the relationship between the entity and the entity pair based on the token-pair mode, the accuracy of relationship extraction can be improved while a certain speed is maintained, and the problem of triplet overlapping can be effectively solved.
The technical scheme adopted for solving the technical problems is as follows:
s1, data construction and pretreatment;
s2, the data are divided into a training set, a verification set and a test set after preprocessing, and the training set, the verification set and the test set are respectively used for training a deep learning model, and are used for storing an optimal training model and testing the model.
S3, building a deep learning relation extraction model
S4, passing through the entity identification layer and the relationship discrimination layer simultaneously, and obtaining a loss value of the model.
S5, updating model parameters through back propagation and gradient descent.
S6, extracting triple information for unlabeled data according to the trained model, and mining semantic information contained in sentences.
And S7, carrying out structured storage on the obtained result.
Further described, in step 1, text data is collected for training a model. For relational extraction data, a schema (schema) needs to be designed that indicates the specific categories of relational triples: the object head entity type, the preorder relation type and the object tail entity type have corresponding relation. And (3) performing de-duplication on the data by writing code definition data preprocessing classes to construct a text data set. The data is stored as json files, the samples exist in the form of key-value pairs, each sample data must contain a corresponding text, a list of relation types spos, the spos list contains one or more pieces of entity relation data, each piece of entity relation data is in the form of a first entity subject, predicate and an object, the first entity, the relation and the last entity are respectively represented, and the position information interval index of the object (the first entity) and the object (the last entity) in the text.
In the second step, the training set is used for training the deep learning model, the verification set is used for verifying the model in the training process, the training model parameter weight with the highest evaluation index on the verification set is stored, and the model parameter with the highest score in the verification process is stored for testing on the training set.
In the third step, the construction of the deep learning model mainly comprises the following steps:
3.1 Using a pre-trained language model such as a Bert-base model as a model for encoding the build relationship extraction thereof, which mainly comprises the following parts: 1. identifying a first entity object network layer and a tail entity object network layer, wherein the first entity object network layer and the tail entity object network layer belong to an entity extraction module; 2. judging the relation type according to the head entity object and the tail entity object; 3. and adding auxiliary tasks on the relation extraction main task for post-processing the number of the extracted relation triples to form multi-task learning, so as to increase the robustness of the model.
3.2 Dividing each sample in the training data by word, if the word-wise division may result in entities in the data not being in the dictionary,namely OOV (english full name: out of vocabulary). If the current sentence is x, the sequence expression x= [ x ] is obtained after division 0 ,x 1 ,...,x n-1 ,x n ]According to the income requirement of the Bert pre-training language model, let x 0 =[CLS],x n =[SEP]Wherein [ CLS ]]The mark being at the beginning of the sentence, and [ SEP ]]The logo is located at the end of the sentence. And (3) carrying out a Bert model on the obtained text sequence to obtain word embedding combined with the context semantic information.
3.3 When extracting the relation data in the text, the subject head entity and the subject tail entity need to be identified, a token-pair (entity pair) mode can be adopted on the basis of entity identification, the head and the tail of the entity are regarded as a whole to be judged, and in the aspect of entity identification, two types of tensors are respectively used by N 1 And N 2 To construct a subject head entity and a subject tail entity input, the tensor dimension is [ n, seq_len]The first dimension n represents how many types of entities exist, and the second and third dimensions are used to represent the length of the sentence when the entity belongs to the i-th class (i<N), and the location index information of the entity in the text is (s, t), then N j [i,s,t]When the sentence length is l, there is n×l (l+1)/2 combinations, and we construct features only for the entities that appear in the text, so that the complexity of the input data can be reduced; meanwhile, according to the relation between the two types of solid modeling, two types of tensors R are constructed in a similar way 1 And R is 2 Dimension [ r, seq_len]R represents the number of relationship categories, when the position index of the subject first entity is (s 1 ,t 1 ) The object tail entity has a position index (s 2 ,t 2 ) The relationship between the two is the kth class (k<R), then R 1 [k,s 1 ,s 2 ]=1,R 2 [k,t 1 ,t 2 ]=1, both R 1 And R is 2 Indicating a match of the predicte (relationship) according to the location information of the head-to-tail entity pair.
3.4 Inputting the sequence x obtained in 3.2 into a bert-base model, wherein the sequence x has 12 layers of encoder layers, different encoding layers learn different semantic information, and the last four layers are taken out to outputIs weighted averaged to obtain a sentence vector h containing context semantics 1 ,h 2 ,...,h n ]By transforming q i,α =W q,α h i +b q,α And k i,α =W k,α h i +b k,α Obtaining a vector sequence q 1,α ,q 2,α ,...,q n,α ]And [ k ] 1,α ,k 2,α ,...,k n,α ]Using these two vector sequences, a scoring function for entity recognition can be constructedRepresents q i,α And k is equal to i,α In which [ i: j ]]Is a continuous substring of text, and can form an entity. The first entity and the last entity of the model can obtain two vectors e1 and e2 through a scoring function, and the vector dimensions are [ n, seq_len]. In the relationship matching layer, the relationship between the entities can be modeled by using the two tensors R1 and R2 in the input relationship matching feature input constructed in 3.3 by using the scoring function.
3.5 Vector representations e 1-e 2 of vector e1 and e2 in 3.4 are respectively introduced into the vector representations e 1-e 2 of the full-connected layer dense output entity pair, vector splicing is carried out on the vector representations to obtain vector representation e, and then the attention score alpha is calculated by following sentence vector h output by bert in 3.4 i =Attention(h i E), finally according to the formulaAnd calculating the weighted sentence vector S, so as to obtain an enhanced sentence vector integrated with the entity information, and the enhanced sentence vector is used for predicting the number of entity relations in the multitasking.
The fourth step comprises the following steps:
4.1 In a text of length l there are a total of l (l+1)/2 different consecutive subsequences, i.e. l (l+1)/2 entities will occur, each entity having two choices: 0 or 1, because the number of triples in each text cannot be a disadvantage, becomes a multi-tag classification problem in class l (l+1)/2, and thus the penalty function needs to be used for multi-tag classificationLoss functionWherein P is α Is the head-to-tail set of entities of type alpha, Q α Then it is a head-to-tail set of non-entity or non-alpha type entities. Similarly, when matching the relationship, the concept is similar to the entity identification mode, except that the entity type is replaced by the relationship type, and the position index of the entity is replaced by the position indexes of the subject first entity and the subject tail entity, so that the loss function is also adopted in the relationship matching task.
4.2 The auxiliary task triplet number judging task is one of multi-category classification, and for the task, a loss value loss can be calculated by a commonly used cross entropy loss function.
Step five: model parameters are updated by back propagation and gradient descent.
Step six: predicting unlabeled data through a trained model, and extracting a new relation triplet, wherein the method specifically comprises the following steps:
6.1 3.4, modeling the input data in the training process to make the marked host entity, guest entity and relation score between the host entity and guest entity larger than 0; in the prediction process, only all possible entities need to be listed, then a scoring function is used for verifying that the score of a main entity is greater than 0, the score of a guest entity is greater than 0, and then the triad meeting the conditions is the final output needed by users based on the extracted matching relation score of the main and guest entities is greater than 0.
And 7, inputting the model output result into a MySQL database according to the corresponding relation, and storing data. The method is convenient to fall to the ground and is applied to knowledge graph construction and intelligent question answering.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow chart of preprocessing a data set provided in an embodiment of the present application;
fig. 2 is a flowchart of a text relationship extraction model according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
For easy understanding, first, a text relation extraction method based on a deep learning model is provided in the embodiments of the present application to be described in detail:
referring to fig. 1 and 2, a method, apparatus and device for structured text relationship extraction based on a deep learning model includes the steps of:
step 1, firstly, text data needs to be collected, a text data set is constructed, and preprocessing is carried out.
According to the designed schema, which is to normalize the expression of structured data, each piece of data extracted from the relationship must satisfy the schema (schema) predefined physical objects and their types. And (3) performing de-duplication on the data by writing code definition data preprocessing classes to construct a text data set. The data is stored as json files, the samples exist in the form of key-value pairs, each sample data must contain a corresponding text and a relation type spos list, each spos list contains one or more pieces of entity relation data, each piece of entity relation data is in a format of subject, predicate and object, the format of each entity relation data respectively represents a head entity, a relation and a tail entity, and the position information interval index of the head entity and the tail entity of the object in the text.
And step 2, dividing the data to obtain a training set, a verification set and a test set.
And 3, building a deep learning relation extraction model.
3.1 using a pre-trained language model such as a Bert-base model as a model for coding the build relationship extraction, mainly comprising the following parts: 1. identifying a first entity object network layer and a tail entity object network layer, wherein the first entity object network layer and the tail entity object network layer belong to an entity extraction module; 2. judging the relation type according to the subject head entity and the subject tail entity; 3. and adding auxiliary tasks on the relation extraction main task for post-processing the number of the extracted relation triples to form multi-task learning, so as to increase the robustness of the model.
3.2 dividing each sample in the training data by word, which if split by word may result in entities in the data not being in the dictionary, i.e. OOV (english full name: out of vocabulary). If the current sentence is x, the sequence expression x= [ x ] is obtained after division 0 ,x 1 ,...,x n-1 ,x n ]According to the income requirement of the bert pre-training language model, let x 0 =[CLS],x n =[SEP]Wherein [ CLS ]]The mark being at the beginning of the sentence, and [ SEP ]]The logo is located at the end of the sentence. And (3) carrying out a bert model on the obtained text sequence to obtain word embedding combined with the context semantic information.
3.3 when extracting the relation data in the text, the subject head entity and the subject tail entity need to be identified, the head and the tail of the entity can be regarded as a whole to be judged by adopting a token-pair mode on the basis of identifying the entity, and N is respectively used for identifying the entity through two types of tensors 1 And N 2 To construct a subject head entity and a subject tail entity input, the tensor dimension is [ n, seq_len]The first dimension n represents how many types of entities exist, and the second and third dimensions are used to represent the length of the sentence when the entity belongs to the i-th class (i<N), and the location index information of the entity in the text is (s, t), then N j [i,s,t]When the sentence length is l, there are n×l (l+1)/2 combinations, and we are only entities that appear in the text, =1 (j=1 or j=2)The complexity of input data can be reduced by constructing the features; meanwhile, according to the relation between the two types of solid modeling, two types of tensors R are constructed in a similar way 1 And R is 2 Dimension [ r, seq_len]R represents the number of relationship categories, when the position index of the subject first entity is (s 1 ,t 1 ) The object tail entity has a position index (s 2 ,t 2 ) The relationship between the two is the kth class (k<R), then R 1 [k,s 1 ,s 2 ]=1,R 2 [k,t 1 ,t 2 ]=1, both R 1 And R is 2 Indicating a match of the predicte (relationship) according to the location information of the head-to-tail entity pair.
3.4 inputting the sequence x obtained in 3.2 into a bert-base model, wherein the sequence x has 12 layers of encoder layers, different encoding layers learn different semantic information, and the vectors output by the last four layers are weighted and averaged to obtain a sentence vector [ h ] containing context semantics 1 ,h 2 ,...,h n ]By transforming q i,α =W q,α h i +b q,α And k i,α =W k,α h i +b k,α Obtaining a vector sequence q 1,α ,q 2,α ,...,q n,α ]And [ k ] 1,α ,k 2,α ,...,k n,α ]Using these two vector sequences, a scoring function for entity recognition can be constructedRepresents q i,α And k is equal to i,α In which [ i: j ]]Is a continuous substring of text, and can form an entity. The first entity and the last entity of the model can obtain two vectors e1 and e2 through a scoring function, and the vector dimensions are [ n, seq_len]. In the relationship matching layer, the relationship between the entities can be modeled by using the two tensors R1 and R2 in the input relationship matching feature input constructed in 3.3 by using the scoring function.
3.5 introducing e1 and e2 in 3.4 into the vector representations e 1-e 2-of the full-connection layer output entity pair respectively, vector-splicing the two to obtain vector representation e, and calculating with the sentence vector h output by bertCalculate an attention score alpha i =Attention(h i E), finally according to the formulaAnd calculating the weighted sentence vector S, so as to obtain an enhanced sentence vector integrated with the entity information, and the enhanced sentence vector is used for predicting the number of entity relations in the multitasking.
And 4, passing through the entity identification layer and the relationship discrimination layer, and obtaining a loss value of the model.
In a text of length l there are a total of l (l+1)/2 different consecutive subsequences, i.e. l (l+1)/2 entities will occur, each entity having two choices: 0 or 1, because the number of triples in each text cannot be a disadvantage, becomes a multi-tag classification problem in class l (l+1)/2, and thus the penalty function requires a penalty function for multi-tag classificationWherein P is α Is the head-to-tail set of entities of type alpha, Q α Then it is a head-to-tail set of non-entity or non-alpha type entities. Similarly, when matching the relationship, the concept is similar to the entity identification mode, except that the entity type is replaced by the relationship type, and the position index of the entity is replaced by the position indexes of the subject first entity and the subject tail entity, so that the loss function is also adopted in the relationship matching task.
The auxiliary task triplet number judging task is one of multi-class classification, and for the task, a loss value loss can be calculated by a commonly used cross entropy loss function.
Step 5, updating model parameters through back propagation and gradient descent
And 6, predicting unlabeled data through a trained model, and extracting a new relation triplet, wherein the method specifically comprises the following steps:
3.4, referring to a scoring function, in the training process, modeling input data to make the score of a marked host entity, a marked guest entity and a relationship between the marked host entity and the marked guest entity larger than 0; in the prediction process, only all possible entities need to be listed, then a scoring function is used for verifying that the score of a main entity is greater than 0, the score of a guest entity is greater than 0, and then the triad meeting the conditions is the final output needed by users based on the extracted matching relation score of the main and guest entities is greater than 0.
And 7, storing the results output in the step 6 into a MySQL database according to a corresponding format, and storing data, so that the results are conveniently applied to tasks such as knowledge graph construction, intelligent question-answering and the like.
The embodiment of the invention provides a structured text relation extraction task model training method, which models the relation between an entity and an entity pair based on a token-pair mode by adopting a joint extraction method, can improve the relation extraction precision while keeping a certain speed, and can effectively solve the problem of triplet overlapping.
The embodiment of the application also provides text relation extraction method equipment, which comprises a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is configured to execute the method for determining the quality of a light spot in the foregoing method embodiment according to an instruction in the program code.
The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium is used for storing program codes, and the program codes are executed by a processor to realize the light spot quality judging method in the embodiment of the method.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to execute all or part of the steps of the methods described in the embodiments of the present application by a computer device (which may be a personal computer, a server, or a network device, etc.). And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.
Claims (8)
1. A structured document relationship extraction method, comprising the steps of:
s1: data construction and pretreatment;
s2: building a deep learning relation extraction model;
s3: sentence vectors coded by the Bert-base model pass through the entity recognition layer and the off discrimination layer respectively, and loss values of the model are obtained;
s4, updating model parameters through back propagation and gradient descent
S5: selecting data from the test set, extracting a triplet relation through a trained model, and mining the data;
s6: and storing the obtained data relation triples into a database Mysql.
2. A structured document relationship extraction method according to claim 1 wherein:
the construction and pretreatment of the data set to obtain the entity and the relation between the entities specifically comprises the following steps:
s2.1: for relation extraction data, a schema (schema) is designed for defining specific information of relation data to be stored: the method comprises the steps that a subject type, a pre-event type and an object type are corresponding to each other, expression of structured data is normalized, and each piece of data extracted by the relation must meet a schema (schema) predefined entity object and type thereof;
s2.2: and (5) carrying out de-duplication on the data through the code definition data preprocessing class, and constructing a text data set. Storing the dataset as a json file, the sample form being in the form of key-value pairs;
s2.3: the traditional CRF is not used for processing data when entity identification is carried out: namely, the 'BIESO' entity labeling mode is not adopted, and only the position information of the entity pair in the text is needed to be known. Constructing a dictionary of entity category label mapping IDs;
s2.4: counting relationship types among entities appearing in the data, and constructing a dictionary of relationship type mapping IDs among the entities;
s2.5: the data is divided into a training set, a verification set and a test set after preprocessing, and the training set, the verification set and the test set are respectively used for training a deep learning model, verifying the deep learning model and storing an optimal training model and testing the model.
3. A structured document relationship extraction method according to claim 1 wherein:
the method for constructing the deep learning relation extraction model by using the pre-training language model such as Bert-base specifically comprises the following steps:
s3.1: the Bert-base is used as the encoder for the model. The Bert-base has 12 layers of Encoder layers, each layer learns knowledge in terms of different semantics of the text, and in order to fully utilize the context information and the semantic knowledge in different aspects, the last four layers of the Bert model are combined, weighted average is carried out on the knowledge, and the knowledge is used as sentence vectors of the whole text;
s3.2: building two entity identification modules for respectively identifying a head entity object and a tail entity object;
s3.3: building two relation matching models, and respectively matching the relation of the entities according to start and stop position information and end position information in interval position information of a head entity and a tail entity;
s3.4: on a relation extraction main task, a downstream task layer is newly added to form an auxiliary task by adding an attention mechanism (attention), and the auxiliary task is used for post-processing the number of the extracted relation triples to form multi-task learning so as to increase robustness for a model;
s3.5: the loss function value is calculated and model parameters are updated by back propagation and gradient descent.
4. A structured document relationship extraction method according to claim 3 wherein:
the step S3.1 specifically comprises the following steps:
s3.1.1: dividing each sample in the training data according to words, wherein if the samples are divided according to words, the entities in the data are possibly not in a dictionary, namely OOV (English full name: out of vocabulary);
s3.2.2: if the current sentence is x, the sequence expression x= [ x ] is obtained after division 0 ,x 1 ,...,x n-1 ,x n ]According to the income requirement of the Bert pre-training language model, let x 0 =[CLS],x n =[SEP]Wherein [ CLS ]]The mark being at the beginning of the sentence, and [ SEP ]]The mark is positioned at the end of the sentence;
s3.2.3: the resulting text sequence x is passed through the Bert model and the weighted average of the last four layers, h=concatate ([ layer ] 9 ,layer 10 ,layer 11 ,layer 12 ]) Wherein the layer is i The vector output by the i layer is represented, and word embedding combined with context semantic information can be obtained.
5. A structured document relationship extraction method according to claim 3 wherein:
the step S3.5 specifically comprises the following steps:
s3.5.1: obtaining a sentence vector [ h ] containing context semantics in S3.2.3 1 ,h 2 ,...,h n ]By transforming q i,α =W q,α h i +b q,α And k i,α =W k,α h i +b k,α Obtaining a vector sequence q 1,α ,q 2,α ,...,q n,α ]And [ k ] 1,α ,k 2,α ,...,k n,α ];
S3.5.2: using these two vector sequences, a scoring function for entity recognition can be constructedRepresents q i,α And k is equal to i,α Is a product of the inner product of (a). Wherein [ i:j ]]Is a section of continuous substring in text, and can form an entity;
s3.5.3: the first and last entity layers of the model (described in S3.2) can obtain two vectors e1 and e2 by scoring function, the vector dimensions are [ n, seq_len ];
s3.5.4: in the relationship matching layer, the scoring function can be utilized as well, in entity identification, scoring is performed according to the position information of the entity, and in the relationship matching layer, the model is scored according to the starting and ending position information of the head and tail entities;
s3.5.5: loss function employing multi-label classificationWherein P is α Is the head-to-tail set of entities of type alpha, Q α Then it is a head-to-tail set of non-entity or non-alpha type entities;
s3.5.6: for the loss function of the auxiliary task, the common cross entropy loss function can be adopted to calculate the loss valueThe final model loss value is two and loss=loss 1 +loss 2 ;
S3.5.7: finally, the model parameters are updated by back propagation.
6. A structured text relationship extraction method, the modules of which are as follows:
the data marking module: the method is used for manually marking the acquired unsupervised data, and the marking content needs to conform to the designed schema: the method comprises the steps that the method comprises the steps of including entity and types thereof and semantic relation among entity pairs, indexing start-stop position information of the entity in a text, and training a deep learning relation extraction model by using marked data;
the named entity recognition module is used for training an entity recognition model and extracting a first entity and a tail entity in the relation extraction process;
the relation matching model is used for extracting the relation between the entity pairs;
and the triplet information storage module is used for: and the data processing unit is used for storing the extracted triplet data into a database Mysql.
7. A structured-text-relationship extraction method, characterized in that the device comprises a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-5 according to instructions in the program code.
8. A computer readable storage medium for storing program code which, when executed by a processor, implements the method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310136023.4A CN116127097A (en) | 2023-02-20 | 2023-02-20 | Structured text relation extraction method, device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310136023.4A CN116127097A (en) | 2023-02-20 | 2023-02-20 | Structured text relation extraction method, device and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116127097A true CN116127097A (en) | 2023-05-16 |
Family
ID=86295422
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310136023.4A Pending CN116127097A (en) | 2023-02-20 | 2023-02-20 | Structured text relation extraction method, device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116127097A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402055A (en) * | 2023-05-25 | 2023-07-07 | 武汉大学 | Extraction method, device, equipment and medium for patent text entity |
-
2023
- 2023-02-20 CN CN202310136023.4A patent/CN116127097A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402055A (en) * | 2023-05-25 | 2023-07-07 | 武汉大学 | Extraction method, device, equipment and medium for patent text entity |
CN116402055B (en) * | 2023-05-25 | 2023-08-25 | 武汉大学 | Extraction method, device, equipment and medium for patent text entity |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
CN111291185B (en) | Information extraction method, device, electronic equipment and storage medium | |
CN112214995B (en) | Hierarchical multitasking term embedded learning for synonym prediction | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN101814067B (en) | System and methods for quantitative assessment of information in natural language contents | |
CN112084381A (en) | Event extraction method, system, storage medium and equipment | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN109960728A (en) | A kind of open field conferencing information name entity recognition method and system | |
CN113196277A (en) | System for retrieving natural language documents | |
CN113168499A (en) | Method for searching patent document | |
CN118093834B (en) | AIGC large model-based language processing question-answering system and method | |
CN113011161A (en) | Method for extracting human and pattern association relation based on deep learning and pattern matching | |
CN114386421A (en) | Similar news detection method and device, computer equipment and storage medium | |
CN116383399A (en) | Event public opinion risk prediction method and system | |
CN116661805A (en) | Code representation generation method and device, storage medium and electronic equipment | |
CN117574898A (en) | Domain knowledge graph updating method and system based on power grid equipment | |
CN116342167A (en) | Intelligent cost measurement method and device based on sequence labeling named entity recognition | |
CN113221569A (en) | Method for extracting text information of damage test | |
CN114510946B (en) | Deep neural network-based Chinese named entity recognition method and system | |
CN115659947A (en) | Multi-item selection answering method and system based on machine reading understanding and text summarization | |
CN116127097A (en) | Structured text relation extraction method, device and equipment | |
CN116720519B (en) | Seedling medicine named entity identification method | |
CN117670017B (en) | Event-based risk identification method and device and electronic equipment | |
CN112329440A (en) | Relation extraction method and device based on two-stage screening and classification | |
CN117828024A (en) | Plug-in retrieval method, device, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |