CN118428471B

CN118428471B - Atlas relation extraction method based on pre-training model enhancement

Info

Publication number: CN118428471B
Application number: CN202410876214.9A
Authority: CN
Inventors: 关相承; 修保新
Original assignee: Hunan Dongyin Information Technology Co ltd
Current assignee: Hunan Dongyin Information Technology Co ltd
Priority date: 2024-07-02
Filing date: 2024-07-02
Publication date: 2024-09-24
Anticipated expiration: 2044-07-02
Also published as: CN118428471A

Abstract

The invention discloses a pre-training model-based enhanced graph relation extraction method, which comprises the following steps: constructing a relation extraction model, wherein the relation extraction model comprises a grouping suspension mark, a pre-training language model and a relation prediction layer; preprocessing text data and initializing grouping suspension marks to obtain characteristic sequences of the text and the suspension marks; calculating an attention mask; controlling the feature propagation direction of the pre-training language model by using the attention mask, and extracting the features of the suspension mark pairs; inputting the features of the suspension mark pairs into a relation prediction layer to obtain a relation probability vector; and calculating a loss function for the relation probability vector, optimizing the loss function, training a relation extraction model, and carrying out relation extraction by using the relation extraction model. The invention provides an entity pair representation method for grouping suspension marks, which is characterized in that the suspension marks are grouped, each group multiplexes the characteristics of head entities, a specific attention mask is designed, the high-efficiency aggregation of the entities on the characteristics is realized, and the high-precision relation extraction is realized under the condition of less calculation amount.

Description

Atlas relation extraction method based on pre-training model enhancement

Technical Field

The invention relates to the field of deep learning and natural language processing, in particular to a pre-training model-based enhanced atlas relation extraction method.

Background

Relationship extraction is a task in natural language processing that aims to identify and extract relationships between entities from text. Given a piece of text and a labeled pair of entities, the goal of a task is to determine the type of relationship or class of relationship between these entities. Relationship extraction has important applications and values in the fields of natural language processing and information extraction, including but not limited to the following: knowledge graph construction, information retrieval and recommendation, event extraction and intelligence analysis, social network analysis, automatic question and answer, intelligent assistant and the like.

Most of the current relation extraction methods of medical knowledge graphs need to design complex relation extraction modules, carry out complex processing on text features output by a language model, and have large calculated amount and low calculation efficiency. A small part of methods can reduce the calculation amount to a certain extent by designing the suspension mark, however, the existing suspension mark methods have the problem of low expression efficiency, which hinders the research and landing of the algorithm. Therefore, how to design an entity extraction method, by improving the entity representation method, the entity extraction method can efficiently represent the entity characteristics, and has academic research significance and industrial application significance.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention discloses a pre-training model-based enhanced graph relation extraction method. Compared with the existing method, the method creatively provides an entity pair representation method of grouping suspension marks, the method groups suspension marks, multiplexes the characteristics of one head entity for each group, designs a specific attention mask for the grouping suspension marks, achieves efficient aggregation of the characteristics of the entities, and achieves high-precision relation extraction under the condition of less calculation amount.

The invention aims at realizing a graph relation extraction method based on pre-training model enhancement, which comprises the following steps:

step 1, constructing a relation extraction model, wherein the relation extraction model comprises a grouping suspension mark, a pre-training language model and a relation prediction layer;

Step 2, preprocessing text data and initializing grouping suspension marks to obtain characteristic sequences of the text and the suspension marks;

step 3, calculating an attention mask;

step 4, using the attention mask to control the feature propagation direction of the pre-training language model, and extracting the features of the suspension mark pairs;

step 5, inputting the characteristics of the suspension mark pairs into a relation prediction layer to obtain a relation probability vector;

and 6, calculating a loss function for the relation probability vector, optimizing the loss function, training a relation extraction model, and carrying out relation extraction by using the relation extraction model.

The text data preprocessing and grouping suspension mark initializing are carried out to obtain a characteristic sequence of the text and the suspension mark, and the method comprises the following steps:

Step 201, word segmentation is carried out on an input text to obtain a word segmentation sequence;

Step 202, inserting a "< e >" mark before each entity of the word segmentation sequence, inserting a "</e >" mark after each entity for marking the position of the entity, inserting a start mark "< CLS >" in the head of the word segmentation sequence, and inserting a stop mark "< SEP >" in the tail of the word segmentation sequence;

Step 203, mapping the word segmentation sequence into a word vector sequence by using a word embedding model of the pre-training language model Roberta-large, wherein the total word segmentation number is The total entity number isThe word vector sequence mathematical expression obtained by mapping the word sequence is as follows:

wherein, A word vector representing the start tag "< CLS >",A word vector representing the end tag "< SEP >",A word vector representing the i-th word,A word vector representing an i < e > "th tag, each < e >" tag content being fixed, so each < e > "tag word vector is identical;

step 204, obtaining a position embedded sequence of the word segmentation sequence by using a position embedded model of the pre-training language model Roberta-large, and for the word segmentation sequence in step 203, obtaining a mathematical expression of the position embedded sequence as follows:

wherein, The position of the start tag "< CLS >" is embedded,The location of the end mark "< SEP >" is embedded,The position of the i-th word is indicated to be embedded,The position embedding representing the i < e > "mark, each < e >" mark position being different, so the position embedding of each < e > "mark is different;

Step 205, mapping the word vector sequence obtained by word segmentation sequence And position embedded sequence of word segmentation sequenceAdding according to elements to obtain feature embedded sequence of word segmentation sequenceThe mathematical expression is:

Step 206, generating a suspension mark feature; the ith suspension mark is characterized by the word vector of the ith "< e >" mark The position of the i < e > "mark is embeddedThe mathematical expression is:

wherein, Features representing the ith suspension mark;

Step 207, generating a suspension mark feature sequence; the entity number is m, and m suspension marks are provided, so that a suspension mark characteristic sequence containing m groups of suspension marks is generated, and the generation mode of the ith group of suspension marks is as follows: features of the ith suspension mark Placed at the beginning of the i-th set of floating-point sequences, other floating-point sequences are arranged in the order of appearance in the text, from small to large, behind the i-th set of floating-point sequences, where i = 1,2,3, …, m; sequentially splicing m groups of suspension mark characteristic sequences to obtain a suspension mark characteristic sequence with the length ofSuspension marker signature sequences of (2)；

Step 208, embedding the features of the word segmentation sequence into the sequenceAnd a suspension marker feature sequenceSpliced together, the mathematical expression is:

wherein, Characteristic sequences representing text and hover marks.

The calculating the attention mask comprises the following steps:

the characteristic embedded sequence of the word segmentation sequence Sequence length isSuspension marker feature sequencesSequence length isNumber of entitiesGenerates a size ofIs a matrix of (a)The mathematical expression of the element assignment in the matrix is:

wherein, Is the attention mask of the person,Representation ofElements of row i and column j.

The method for extracting the feature of the suspension mark pair by using the attention mask to control the feature propagation direction of the pre-training language model comprises the following steps:

Step 401, feature sequences of the text and the suspension mark Input into pre-trained language model Roberta-large and mask with the attentionAs a mask for Roberta-large forward propagation, the mathematical expression is:

wherein, Is thatThe last hidden layer of the output is characterized, d isIs used to determine the hidden layer dimension of the (c),Embedding sequences for features of said word segmentation sequencesThe length of the sequence is set to be,For said suspended tag feature sequencesSequence length;

step 402, slave Features of the last hidden layer of the outputThe characteristics of each entity pair are selected, and the mathematical expression is as follows:

wherein, Features representing pairs of floating marks for the i-th entity and the j-th entity,Representing an operation of indexing from the 0 th dimension of the target tensor.

The method inputs the characteristics of the suspension mark pairs into a relation prediction layer to obtain a relation probability vector, and comprises the following steps:

Features of pairs of floating marks for the i-th and j-th entities Inputting the full connection layer to obtain a relation prediction vector of the ith entity and the jth entity, wherein the mathematical expression is as follows:

wherein, Representing the relationship prediction vector of the i-th entity and the j-th entity,Representing the weight matrix of the fully connected layer,Representing the bias vector of the fully connected layer, C representing the number of relationship categories, d representing the dimension of the floating token pair feature,Is an activation function for normalizing the vector into a probability distribution.

The method comprises the following steps of calculating a loss function for the relation probability vector, optimizing the loss function, training a relation extraction model, and carrying out relation extraction by using the relation extraction model:

calculating a relationship prediction vector for an ith entity and a jth entity Tag associated with realityCross entropy loss between the two, the mathematical expression is:

wherein, A true relationship label representing an ith entity and a jth entity, if the ith entity and the jth entity have a kth type relationshipOtherwise，The probability that the ith entity and the jth entity representing model prediction have the kth class is the relation prediction vectorIndex value of (2);

The cross entropy loss of all entity pairs is calculated and the mathematical expression is as follows:

wherein, Representing the total cross entropy loss;

pairs using Adam optimization algorithm And optimizing, and training a relation extraction model.

Compared with the prior art, the method has the advantages that: the technology provides a pre-training model-based enhanced graph relation extraction method. The method innovatively provides an entity pair representation method of grouping suspension marks, by grouping the suspension marks, multiplexing the characteristics of one head entity for each group, and designing a specific attention mask for the grouping suspension marks, the efficient aggregation of the entity pair characteristics is realized, and the high-precision relation extraction is realized under the condition of less calculation amount.

Drawings

Fig. 1 shows a schematic flow chart of an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Knowledge graph is a collection of knowledge and its links described in a structured form, which is a knowledge representation that organizes entities, attributes and relationships into a graphical structure, and aims to better describe and understand the knowledge and concepts of the world, and can be used to store, query, infer and analyze knowledge. The knowledge graph is designed not only by representing the knowledge in the form of a graph structure, but also by considering how to define the attribute and the relation, how to establish the relation between the attribute and the relation, how to perform the operations of inquiry, reasoning, analysis and the like. Such a design makes knowledge graph a powerful tool for storing and manipulating a large amount of complex knowledge.

In this embodiment, there is massive knowledge in the medical field in the internet that can be used for medical condition consultation and health care, but the traditional search engine cannot make reasonable judgment according to the actual condition of the patient, and cannot meet the use requirement. Assuming that a large-scale medical knowledge graph is built, a huge amount of text data needs to be crawled on the Internet and structured, and relationship extraction is used as a key link in text structuring, and in the process of carrying out relationship extraction on texts, a graph relationship extraction method based on pre-training model enhancement is used for the medical field and extracting medical related relationships. A reliable Chinese medical knowledge system is built, and the system can help to meet the demands of people on the knowledge related to daily diseases and has high application value.

The medical knowledge graph (Medical Knowledge Graph) serves as the core of medical artificial intelligence, is essentially a semantic network for revealing relationships between medical entities, and can formally describe things and correlations of things in the real world. In general, a medical knowledge graph is constructed by continuously expanding entities and relationships based on manually constructed expert knowledge through algorithms and expert auditing, and comprises medical concepts and various medical relationships such as diseases, symptoms, medicines, operations and the like. In a wide range of medical scenarios, medical knowledge maps have proven to be effective in providing medical knowledge support for algorithms and medical interpretation of predicted outcomes of algorithms. In the foreseeable future, knowledge maps will play a vital role in the field of medical treatment, which is a strong knowledge attribute. Therefore, the method for extracting the relationship between the patterns based on the enhancement of the pre-training model can provide very important support for extracting the relationship between the medical knowledge patterns.

Thus, as shown in fig. 1, a method for extracting a graph relationship based on pre-training model enhancement, the method comprising:

step 3, calculating an attention mask;

The map is a medical knowledge map, the entities of the map comprise diseases, symptoms, medicines and operations, and the relationship of the map comprises disease-symptom relationship, disease-medicine relationship, disease-disease relationship, symptom-symptom relationship and disease-operation relationship.

wherein, Features representing the ith suspension mark;

wherein, Characteristic sequences representing text and hover marks.

RoBERTa-large is one of the variants based on the BERT (Bidirectional Encoder Representations from Transformers) model, developed by Facebook AI (now referred to as Meta AI). RoBERTa, collectively referred to as "A Robustly Optimized BERT Pretraining Approach", was modified and optimized based on BERT. The following are some key features of RoBERTa-large: (1) model scale: roBERTa-large is larger than BERT-large, having 24 layers of Transformer encoder, with 1024 hidden units per layer, for a total of 355M parameters. In contrast, BERT-large has 24 layers, each with 1024 hidden units, for a total of 340M parameters. (2) amount of pre-training data: roBERTa a larger pre-training dataset was used, about 160GB of data, much more than 16GB of BERT. This includes datasets BookCorpus, ENGLISH WIKIPEDIA, CC-News, openWebText, stories, and the like. (3) pretraining strategy: roBERTa are more optimized during the pre-training process. For example, the Next Sentence Prediction (NSP) task in BERT is eliminated and a longer training sequence (longer sentence) is used. (4) training time: roBERTa perform a longer pre-training to ensure that the model captures language patterns and context better. (5) effect improvement: due to the optimization described above, roBERTa performs better than BERT across multiple natural language processing tasks, including tasks such as text classification, question-answering, text generation, and the like.

The calculating the attention mask comprises the following steps:

One of the roles of the attention mask in the transducer model is to control the propagation of information, i.e. information that determines which locations can affect each other.

The attention mechanism interacts each location with other locations in calculating the attention weight and assigns weights according to their relevance. By marking certain locations in the attention mask, we can control whether the model takes these locations into account when calculating the attention weight.

wherein, Representing the total cross entropy loss;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. The method for extracting the map relation based on the pre-training model enhancement is characterized by comprising the following steps:

step 3, calculating an attention mask;

step 6, calculating a loss function for the relation probability vector, optimizing the loss function, training a relation extraction model, and carrying out relation extraction by using the relation extraction model;

The map is a medical knowledge map, the entities of the map comprise diseases, symptoms, medicines and operations, and the relationship of the map comprises disease-symptom relationship, disease-medicine relationship, disease-disease relationship, symptom-symptom relationship and disease-operation relationship;

；

Step 204, obtaining a position embedded sequence of a word segmentation sequence by using a position embedded model of a pre-training language model Roberta-large, wherein the mathematical expression of the position embedded sequence obtained for the word segmentation sequence is as follows:

；

wherein, Features representing the ith suspension mark;

；

wherein, A feature sequence representing text and a hover mark;

The calculating the attention mask comprises the following steps:

；

wherein, Is the attention mask of the person,Representation ofElements of the ith row and the jth column;

；

2. The method for extracting the relationship between the atlases based on the enhancement of the pre-training model according to claim 1, wherein the step of inputting the features of the suspension mark pairs into the relationship prediction layer to obtain the relationship probability vector comprises the following steps:

；

wherein, Representing the relationship prediction vector of the i-th entity and the j-th entity,Representing the weight matrix of the fully connected layer,Representing bias vectors of the fully connected layers, C representing the number of relationship categories, d representing the dimension of the floating mark pair feature, the dimension of the floating mark pair feature andIs equal in the dimension of the hidden layer of (c),Is an activation function for normalizing the vector into a probability distribution.

3. The method for extracting the relationship between the patterns based on the enhancement of the pre-training model according to claim 2, wherein the steps of calculating the loss function for the relationship probability vector, optimizing the loss function, training the relationship extraction model, and extracting the relationship by using the relationship extraction model comprise the following steps:

；

wherein, Representing the total cross entropy loss;