CN111428502A

CN111428502A - Named entity labeling method for military corpus

Info

Publication number: CN111428502A
Application number: CN202010102664.4A
Authority: CN
Inventors: 黄宇; 冯洋
Original assignee: Zhongke Shitong Hengqi Beijing Technology Co ltd
Current assignee: Zhongke Shitong Hengqi Beijing Technology Co ltd
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2020-07-17

Abstract

The invention discloses a military corpus-oriented named entity labeling method which comprises the steps of respectively using three deep neural networks based on bidirectional L STM and CRF combined neural network models, L attece L STM neural network models and BERT pre-training neural network models to automatically label named entities of machines, using an XGboost method to integrally learn results obtained by three algorithms of S1 to obtain samples with successful labeling and samples with failed labeling, wherein the successful samples are defined as samples with any two consistent recognition results in three types of machine entity recognition, the failed samples are defined as samples with three different machine entity recognition results, using a manual labeling mode to label the failed samples, and storing all sample labeling results in a data base management mode in a json mode.

Description

Named entity labeling method for military corpus

Technical Field

The invention relates to the field of natural language data processing, in particular to a method for labeling military entities in a military corpus by using an ensemble learning method to provide training corpuses for application of named entity recognition in the military field, and the recognition accuracy of the military entities is improved.

Background

Named Entity recognition (Named Entity recognition) is an important task in information extraction and information retrieval, and aims to recognize and classify components in text representing Named entities, and is sometimes referred to as Named Entity recognition and classification. With the advent of the big data age, the internet has become an important source for military intelligence acquisition. The method is characterized in that a large amount of military text information can be obtained through ways such as news lines, news magazines, military reports, operation schemes, exercise reports, military magazines, dictionaries, government documents, military comments and the like, and military entities facing the military field, such as military officials, military equipment names, military facility names and military organization names, need to be extracted in order to realize text semantic understanding, semantic representation and knowledge management. In order to achieve the purpose that a computer automatically identifies military entities, a large amount of high-quality military entity labeling corpora are needed, however, in the current era with extremely high labor cost, on one hand, a large amount of labeling corpora consume a small amount of manpower, material resources and financial resources, and on the other hand, the labeling quality from non-professionals may be lower than that from experts, so that the generated low-quality corpora cannot guarantee the accuracy of named entity identification. Therefore, the establishment of the efficient military corpus-oriented named entity labeling method has important value and significance for mining the potential value of the military corpus.

At present, there are 3 common modes of corpus tagging, which are a traditional tagging mode, a crowd-sourced tagging mode and a group tagging mode. The three labeling modes are all used for labeling the corpora in a manual labeling mode, the traditional labeling mode is characterized in that labeling personnel label the corpora under the guidance of labeling standards, the crowdsourcing labeling mode utilizes a network, the same corpora are labeled on line through the labeling personnel, high-quality labeling corpora are obtained through vote arbitration, and group labeling is realized by utilizing a large-scale labeling group to label and obtain the corpora. The method is based on the fact that the labeled linguistic data are obtained through labeling work of labeling personnel. Even the social annotation and the group-based intelligent corpus annotation method with efficient information resource indexing, organizing and retrieving modes still cannot get rid of the defect. Some software platforms or networks are utilized, and the marking personnel are still required to spend a large amount of time for arbitration comparison besides unifying marking specifications to finally adopt the optimal linguistic data.

Xgboost, used in the invention, is currently the most popular integrated learning method. The ensemble learning refers to that a plurality of weak supervision models are utilized to obtain a better and more comprehensive strong supervision model, and the potential idea of the ensemble learning is that even if a certain weak classifier obtains wrong prediction, other weak classifiers can correct the errors. Xgboost is an efficient algorithm proposed by Chentianqi of Washington university in 2016, with a linear scale solver and tree learning. The method is an improvement on the traditional integrated learning GBDT algorithm and is more efficient. The traditional GBDT method only utilizes first-order derivative information, Xgboost is Taylor expansion of a loss function in second order, a regular term is added outside an objective function, an optimal solution is integrally solved for balancing the reduction of the objective function and the complexity of a model, overfitting is avoided, and the solving efficiency of the model is improved, and the method comprises the following steps:

(1) given a dataset D { (xi, yi): i ═ 1, 2, …, n, xi ∈ Rp, yi ∈ R }, where n is the number of samples, each sample having P features.

(2) The objective function is defined as follows:

in the formula, yi is a predicted value, yi is a true value, in order to prevent overfitting, a regularization item is defined, T and omega are the number of tree leaf nodes and leaf weight values respectively, gamma is a leaf tree punishment coefficient, and lambda is a leaf weight punishment coefficient.

(3) Xgboost uses a gradient lifting strategy, an existing model is reserved, a new regression tree is added to the model at one time, and assuming that the prediction result of the ith sample in the t iteration is yi (t), ft (xi) is the added new regression tree, the following derivation process can be obtained:

(4) by substituting the result of formula (8) into formula (7), it is possible to obtain:

(5) performing second-order Taylor expansion on the target function, and introducing a regular term:

in the formula:

the XGboost integrated learning has good performance on data sets of various scales, and is one of the most stable methods for improving the accuracy rate of the algorithm and the best effect at present.

Disclosure of Invention

The invention aims to provide a named entity labeling method facing military corpora, which aims to solve the problem of military entity identification contained in massive Internet texts and provides a basis for discovery and extraction of open source information.

In order to realize the purpose, the following technical scheme is adopted: the method comprises the following steps:

s1, performing automatic labeling of machine named entity recognition by using three deep neural networks, namely a neural network model based on bidirectional L STM and CRF combination, a neural network model based on L attice L STM and a neural network model based on BERT pre-training;

s2, performing ensemble learning on results obtained by the three algorithms of S1 by using an XGboost method, and obtaining samples with successful labeling and samples with failed labeling, wherein the successful samples are defined as samples with consistent results of any two of the three machine entity identifications, and the failed samples are defined as samples with inconsistent results of the three machine entity identifications;

s3, marking failed samples in a manual marking mode;

and S4, storing all sample labeling results into a database for management in a json mode.

Further, the military entity labels are divided into 7 types, including a person name entity, a time entity, a place name entity, a personnel military affairs and military rank entity, a military equipment entity, a military facility entity, and a military institution entity, which are respectively labeled as person _ entry, time _ entry, location _ entry, position _ entry, weather _ entry, facility _ entry, and location _ org _ entry, and each element is labeled as "X-B", "X-I", or "O". Wherein "X-B" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "X-I" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type. For example, "the F-16 aircraft landed at Anderson air force base 23 points at 15, day 4", labeled "Webon _ entry _ B _ Webon _ entry _ I _ Webon _ entry _ Iwebon _ entry _ I _ Webon _ entry _ I _ time _ entry _ I _ entry _ I _ O location _ entry _ location _ I _ location _ entry _ I _ location _ I _ entry _ I".

Further, the calculation process of the long-time and short-time memory module in the L STM model is as follows:

(1) the Input word Xt enters the network through an Input Gate (Input Gate) at the time t, and comprises the Input at the time t and the output of a hidden layer and a cell update (cell) at the time t-1 connected with the Input at the time t, and the function calculation is activated;

(2) information forgetting is realized through a forgetting Gate (Forget Gate), and an activation function is obtained in the same way as in the step (1);

(3) the cell activation function comprises an input at the time t and an output of a hidden layer at the time t-1;

(4) the final information unit output includes the vector output through the output gate Ot and the cell unit output, i.e., the result of the forward estimation.

In theory, backward estimation is the inverse derivative based on forward estimation, with a process similar to forward, bidirectional L STM performs L STM-specific training twice forward and backward against known training sequences, thereby ensuring the globality and completeness of feature extraction.

Compared with the prior art, the invention has the following advantages: the method can obviously improve the labeling accuracy of military entities in the military corpora, and simultaneously achieves the best labeling effect with the minimum labor cost.

Drawings

Table 1 shows the military entity labeling specifications provided by the present invention.

FIG. 1 is a basic flow diagram of the present invention.

Fig. 2 is a diagram of a bidirectional L STM neural network model structure.

FIG. 3 is a diagram of a model architecture of an STM neural network based on L attice L.

FIG. 4 is a diagram of a BERT-based pre-trained neural network model architecture.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

with reference to fig. 1-4, the method of the present invention comprises the following steps:

s3, marking failed samples in a manual marking mode;

TABLE 1 military entity Mark Specifications in the present invention

As shown in table 1, the military entity labels are divided into 7 types, including a person name entity, a time entity, a place name entity, a personnel military affairs and military rank entity, a military equipment entity, a military facility entity, and a military institution entity, which are respectively labeled as person _ entry, time _ entry, location _ entry, position _ entry, weather _ entry, facility _ entry, and location _ org _ entry, and each element is labeled as "X-B", "X-I", or "O". Wherein "X-B" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "X-I" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type. For example, "the F-16 aircraft landed at Anderson air force base 23 points at 15, day 4", labeled "Webon _ entry _ B Webon _ entry _ I Webon _ entry _ Iwebon _ entry _ I webon _ entry _ I time _ entry _ I O location _ entry _ I location _ entry _ I location _ entry _ I location _ entry _ I location _ entry _ I location.

Further explanation is as follows:

1. military named entity part-of-speech tagging specification formulation

2. Military text import and preprocessing

For a corpus annotation platform, a large number of raw corpora need to be annotated to form a corpus with complete annotation. The raw corpus is obtained by the existing text data or crawlers on the network, so that the most basic functional requirements of the text loading part are that the text data is imported, the network crawlers and other loading modes are added with the function of manual input, and the loss that some file contents which cannot be imported cannot be marked is avoided. On the basis of the prior art, if the platform is to be improved and strengthened, the functions of loading texts and adding picture and character recognition input and the like can be considered, and with the development of networks and technical equipment, texts are not only recorded in text files, but also a large amount of character information actually exists in pictures, audios and videos. Certainly, the audio and video are not required to be analyzed by making corpus labeling, but some texts can be recorded on pictures, so that the function can be considered to be added by the platform for later-stage strengthening.

3. Military text named entity recognition

The neural network model combining bidirectional L STM (Bi-L STM) with CRF is a more common extraction algorithm in named entity recognition, the bidirectional L STM is a subclass of a recurrent neural network, which is firstly proposed by HOCHREITER and the like, and is also a complex nonlinear unit in nature, and the bidirectional L STM has the remarkable characteristics of stronger memory capacity and fitting capacity to nonlinear relation, and the long-time and short-time memory module in the L STM model has the following calculation process:

(1) the Input word Xt enters the network at the time t through an Input Gate (Input Gate), and comprises an Input at the time t and an output of a cell update (cell) and a hidden layer at the time t-1 connected with the Input, and the activation function is calculated as follows:

(2) the information forgetting is realized through a forgetting Gate (Forget Gate), and an activation function is obtained in the same way as in (1):

(3) the cell activation function includes an input at time t and an output of the hidden layer at time t-1:

(4) the final information unit output includes the vector output and cell unit output through the output gate Ot, i.e. the result of forward estimation:

The Conditional Random Field (CRF) is essentially a discriminant undirected graph, and the theoretical basis is a hidden Markov model and a maximum entropy model, and an observable symbol X belonging to the whole observable vector is mainly used for part-of-speech tagging and segmentation of ordered data. The application and development of the conditional random field still reserve part of the characteristics of a hidden Markov model, variables in the practical application process obey the Markov assumption, and the transition probability of each state depends on the instantaneous state of the adjacent variables. Taking a linear chain random field as an example, assuming a random variable sequence, if the two satisfy markov property, i.e., p (Y X) is called a linear chain element random field, where X is an input observation sequence and Y represents an output tag sequence (or state sequence) corresponding thereto. The feature function of the conditional random field comprises a transfer feature and a state feature, the transfer feature function defines the part of speech of the front word and the rear word, and the state feature function calculates the probability of each state of each word.

4. Standard labeled corpus warehousing

After a worker utilizes a corpus labeling platform to identify entities and add attributes to a text, a corpus can be generated through a corpus generation function of software, the corpus can be automatically generated through a corpus generation scheme which is designed to meet the corpus specification, a corpus editing frame of an XM L view is formed, the frame is used for adjustment and modification, and after the situation that the corpus is correct is confirmed, the corpus can be generated by clicking XM L, and the corpus is finally incorporated into a corpus database with complete labeling.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A named entity labeling method for military corpora is characterized by comprising the following steps:

s3, marking failed samples in a manual marking mode;

2. The military corpus-oriented named entity tagging method of claim 1, characterized in that: the military entity labels are divided into 7 types, including a person name entity, a time entity, a place name entity, a personnel military affairs and military rank entity, a military equipment entity, a military facility entity and a military institution entity which are respectively marked as person _ entry, time _ entry, location _ entry, position _ entry, weather _ entry, facility _ entry and location _ org _ entry, and each element is labeled as 'B-X', 'I-X' or 'O'. Wherein "B-X" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type.

3. The military corpus-oriented named entity labeling method according to claim 1, wherein a long-time and short-time memory module in an L STM model is calculated as follows: