CN106886516A

CN106886516A - The method and device of automatic identification statement relationship and entity

Info

Publication number: CN106886516A
Application number: CN201710108288.8A
Authority: CN
Inventors: 简仁贤; 王海波
Original assignee: Intelligent Technology (shanghai) Co Ltd
Current assignee: Intelligent Technology (shanghai) Co Ltd
Priority date: 2017-02-27
Filing date: 2017-02-27
Publication date: 2017-06-23

Abstract

The invention belongs to intelligent identification technology field, there is provided the method and device of a kind of automatic identification statement relationship and entity.The method of automatic identification statement relationship of the invention and entity includes：During the read statement of user projected into a space for fixed dimension, sentence vector of the read statement in the space of the fixed dimension is obtained；By the good deep learning grader of sentence vector input training in advance, the relation classification of the read statement is obtained；If identifying relation classification, the entity in the read statement is recognized.The method and system that the present invention is provided, using deep learning, judge user input from semantically, can precisely recognize relation；Entity recognition is modeled as sequence labelling problem, optimal mark is solved using condition random field, so as to precisely recognize entity；With reference to deep learning and condition random field, the automatic decimation of relation and entity is realized.

Description

The method and device of automatic identification statement relationship and entity

Technical field

The present invention relates to Intelligent Recognition art field, and in particular to the method and dress of a kind of automatic identification statement relationship and entity Put.

Background technology

In interactive system, we be often required to identifying user whether be express some specific areas information, than Such as hobby, pet name information；If user is to express these information, we often there is a need for being able to accurately extract these letters Cease signified specific object.Generally, these information can be indicated by relation and entity.Relation is primarily referred to as user and exists Which type of information is expressed, than such as whether being hobby, pet name etc.；And entity refers to then the signified specific object of relation.Such as use " I likes eating spicy hot pot " is expressed at family, and corresponding relation is " liking ", and corresponding entity is " spicy hot pot ".In conversational system In, how this specific area of automatic identification relation and entity be a problem for having much challenge.

The most frequently used method to recognize relation and entity mainly has two kinds：Based on keyword and based on regular expression.

Method based on keyword is mainly by keyword to recognize relation.By taking hobby as an example, if user input " liking " one word is included in sentence, is taken as liking in expression；If comprising " not liking " one word, be taken as in expression not Like.Then the entity of the relation is extracted in conjunction with grammer dependency analysis or semantic character labeling (SRL).Such as " I likes Joyous Zhou Jielun ", wherein comprising liking, the method based on keyword thinks that the words is in expression " liking "；By dependency analysis It is recognised that " Zhou Jielun " depends on core word " liking ", thus like pair as if " Zhou Jielun ", that is, the entity for identifying is " Zhou Jielun ".The shortcoming of the method based on keyword is the presence of substantial amounts of erroneous judgement, i.e., the sentence comprising certain keyword and differ It is fixed necessarily to express the relation.Take as a example by the hobby in face, user input " I is also unable to say for certain whether like Zhou Jielun at present " is inner Face both includes keyword " liking ", and the meaning of expression is but a kind of uncertain state.If including " liking " according to the inside, just It is considered to like relation, just loses unavoidably biased.This example is disclosed still cannot be judged in itself only according to keyword Go out relation, because the Limited information that keyword is included.Included in itself than keyword for the information required for judgement relation The big situation of information, such as " being unable to say for certain and whether like " information for being included will than the information content of single " liking " one word Greatly, the method based on keyword is just helpless.

In order to solve the problems, such as above, people generally add more qualifications using regular expression, so as to enter Row relation judges and entity is extracted.Relation is such as liked to recognize by regular expression " I likes (.*) ", represents there was only sentence Included in son " I likes ", just relation is liked in expression at last；" (.*) " below is represented and is followed all behind " I likes " Word, is regarded as the object liked, i.e. entity.Such as " I likes Zhou Jielun ", the relation that can be recognized is " liking ", real Body is " Zhou Jielun ".

Method based on regular expression there is also with the same shortcoming of the method based on keyword, that is, there is substantial amounts of mistake Sentence, the situation for being not belonging to the relation is also identified as the relation.Another of method based on regular expression has the disadvantage reality The function that body is extracted is more fragile, can usually extract the entity of mistake.Such as " I likes Zhou Jielun just to blame " meets above " I likes (.*) " pattern, and the meaning is completely contradicted, user's expression is the relation not liked.If according to it is above just Then, system identification is the relation liked, and like pair as if " Zhou Jielun just monster "；Under such case, relation and entity are all Identification mistake.

Another of method based on keyword and regular expression has the disadvantage to be difficult to safeguard.Due to natural language expressing Diversity is, it is necessary to substantial amounts of keyword and regular expression cover various situations.And with keyword and canonical table Up to increasing for formula, system can also become very complicated.Newly-increased keyword and regular expression be possible to it is existing in keyword and Regular expression mutually conflicts.What is worse, this conflict is generally more hidden, and people are generally difficult to judge whether this in advance Plant conflict.Many situations are after going wrong, by the root of tracing problem, just to find the conflict being originally between rule Caused.

Entity is extracted based on SRL or dependence also perfect not to the utmost.Due to Chinese expression complexity, SRL or Dependence accuracy rate in itself is not just high.Under this accuracy situation not high, various rules are recycled to carry out Entity recognition, Its precision can also be affected, and cause the problem that entity extraction is inaccurate.

In sum, the defect of prior art is as follows：

1st, relation judges inaccurate problem.Only according to keyword or canonical, sentence language in itself is not accounted for Justice, so as to cause relation to be judged by accident.

2nd, the inaccurate problem of entity extraction.Extracted according to regular expression, SRL, syntactic analysis, dependency analysis Entity, is easily influenceed by the precision that the method exists in itself, causes entity extraction mistake.

3rd, increasing with rule, system complexity is uprised, it is difficult to judge in advance newly-increased rule whether can with it is original Rule it is compatible, therefore system is difficult to safeguard.

The content of the invention

The automatic identification statement relationship and the method and device of entity provided for defect of the prior art, the present invention, Using deep learning, user input is judged from semantically, can precisely recognize relation；Entity recognition is modeled as sequence Mark problem, solves optimal mark, so as to precisely recognize entity using condition random field；With reference to deep learning and condition random , realize the automatic decimation of relation and entity.

In a first aspect, the method for a kind of automatic identification statement relationship of present invention offer and entity, including：By the defeated of user Enter during sentence projects to a space for fixed dimension, obtain sentence of the read statement in the space of the fixed dimension Vector；By the good deep learning grader of sentence vector input training in advance, the relation classification of the read statement is obtained； If identifying relation classification, the entity in the read statement is recognized.

Automatic identification statement relationship and the method for entity that the present invention is provided, using deep learning, from semantically to user Read statement judged, can precisely recognize relation, be favorably improved the degree of accuracy of Entity recognition.

Preferably, it is described that the read statement of user projected into a space for fixed dimension, obtain the input language Sentence vector of the sentence in the space of the fixed dimension, including：Read statement to user carries out participle；By searching Word2vec term vectors, corresponding term vector is converted into by each participle；According to the term vector of each participle, the input is obtained Sentence vector of the sentence in a space for fixed dimension.

Preferably, the deep learning grader that sentence vector input training in advance is good, obtains the input The relation classification of sentence, including：By the input of sentence vector, CNN layers carries out convolution operation, obtains the office of the read statement Portion's feature；The local feature is input into LSTM layers, the relation coding between the front and rear word in the read statement is obtained；By institute Stating ReLU layers of relation coding input carries out nonlinear transformation；Nonlinear transformation result is passed into output layer, the input is obtained The relation classification of sentence.

Preferably, the deep learning grader includes CNN layers of multiple.

Preferably, the deep learning grader includes LSTM layers of multiple.

Preferably, the output layer of the deep learning grader uses Softmax functions or Sigmoid functions.

Preferably, the entity in the identification read statement, including：The read statement is input into CRF models, is obtained Optimal sequence to the read statement is marked, and the entity in the read statement is obtained according to optimal sequence mark.

Preferably, the training step of the deep learning grader includes：The sentence vector input of training sample is advance The deep learning grader of structure, the projected relationship classification LP of training sample is obtained by feedforward；By loss function F (LP, L) Loss values are obtained, wherein, L is the relation classification of the actual mark of sample, and loss values are the difference degree between LP and L, according to institute Loss values are stated, gradient backpropagation is carried out using stochastic gradient descent, change the parameter of the deep learning grader；Iteration The deep learning grader is trained, until the projected relationship classification and the actual mark of sample of deep learning grader output The other loss values of relation object be less than threshold value set in advance, or iterations exceed frequency threshold value set in advance.

Preferably, the loss function can be cross entropy or mean square error.

Second aspect, a kind of automatic identification statement relationship and the device of entity that the present invention is provided, including：Pretreatment mould Block, in the read statement of user projected into a space for fixed dimension, obtains the read statement in the fixation Sentence vector in the space of dimension；Relation recognition module, for sentence vector to be input into the good depth of training in advance Grader is practised, the relation classification of the read statement is obtained；Entity recognition module, if for identifying relation classification, recognizing Entity in the read statement.

Automatic identification statement relationship and the device of entity that the present invention is provided, using deep learning, from semantically to user Read statement judged, can precisely recognize relation, be favorably improved the degree of accuracy of Entity recognition.

Brief description of the drawings

A kind of automatic identification statement relationship and the flow chart of the method for entity that Fig. 1 is provided by the embodiment of the present invention；

A kind of automatic identification statement relationship and the structured flowchart of the device of entity that Fig. 2 is provided by the embodiment of the present invention；

Fig. 3 is the deep learning framework that deep learning grader provided in an embodiment of the present invention is used.

Specific embodiment

The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this Scope.

It should be noted that unless otherwise indicated, technical term used in this application or scientific terminology should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.

As shown in figure 1, the method for a kind of automatic identification statement relationship provided in an embodiment of the present invention and entity, including：

Step S1, during the read statement of user projected into a space for fixed dimension, obtains read statement in fixation Sentence vector in the space of dimension.

Step S2, by the good deep learning grader of sentence vector input training in advance, obtains the relation object of read statement Not.

Step S3, if identifying relation classification, the entity in identified input sentence.

Wherein, entity first must be a noun, and entity refers to a self-existent object, such as name or Person's things name etc., but do not include pronoun, such as " I " " you " " he ".Such as, read statement is " I likes Zhou Jielun ", reality therein Body is " Zhou Jielun ".

Automatic identification statement relationship and the method for entity that the present embodiment is provided, using deep learning, from semantically to The read statement at family judged, can precisely recognize relation, is favorably improved the degree of accuracy of Entity recognition.

Wherein, the preferred embodiment of step S1 is as follows, including：

Step S11, the read statement to user carries out participle.

Step S12, by searching word2vec term vectors, corresponding term vector is converted into by each participle.

Step S13, according to the term vector of each participle, obtains sentence of the read statement in a space for fixed dimension Vector.

Wherein, the concrete methods of realizing of step S11~step S13 is as follows：

Participle is carried out to read statement, if vocabulary quantity gives up the vocabulary of overage more than N.N is to preset Read statement vocabulary quantity maximum, such as N be 25.Because user is input into the form of chatting, N values are not It is very big.By statistics, user chat when, the number of words being input into when most is within 10 words.

By searching word2vec term vectors, each participle is converted into corresponding term vector.Might as well assume each word to The dimension of amount is M, such as M is 300 dimensions.Wherein, Word2vec term vectors are good off-line trainings, need to only be called related disclosed Interface, by searching Word2vec term vectors, participle vocabulary is converted into corresponding term vector.

These term vectors are spliced.If vocabulary lazy weight N, 0 is mended below, until formed NM dimension to Amount.Such as N is 300 for 25, M, if user input only has 23 vocabulary, except splice this 23 300 dimension term vectors it Outward, in addition it is also necessary to fill 20 vectors of M dimensions later, that is, fill 2 × 300 zero (i.e. 600 zero).This kind of vector of filling M dimensions 0 Way be called padding.

By above step, in read statement being projected into a space for fixed dimension, such as above example It is in projecting to N × M dimension spaces, if N is 300 for 25, M, then in projecting to the spaces of 25 × 300 dimensions.

Vector representation of the read statement in N × M dimension spaces is the sentence vector of the read statement.

Wherein, the deep learning framework that the deep learning grader in step S2 is used is as shown in figure 3, the bottom is using volume Product neutral net (Convolutional Neural Network, CNN), for the sentence extracted from read statement vector Convolution operation is carried out, the local feature of read statement is obtained, it is preferred to use two-layer CNN is superimposed, and can get more abstract Local feature；The local feature is passed through as the input of time recurrent neural network (Long Short-Term Memory, LSTM) Two-layer LSTM is crossed, the dependence between front and rear word in sentence is encoded；The relation coding for obtaining passes to activation letter again Several layers (Rectified Linear Units, ReLu), carries out nonlinear transformation；Nonlinear transformation result passes to output layer, Finally give the relation classification of read statement.Wherein, output layer can use Softmax functions or Sigmoid functions, if adopting Softmax functions are used, then deep learning grader is output as many-valued output, such as, for preference categories device, can be modeled as Multi-class Classifier：Like, do not like, other；According to Sigmoid functions, then to be output as two-value defeated for deep learning grader Go out, such as, for pet name grader, two-value grader can be modeled as：The pet name, other.

Based on above-mentioned deep learning framework, the training for carrying out having supervision by the labeled data of specific area so that depth Study strategies and methods can accurately and efficiently recognize the relation classification represented in sentence, the training step bag of deep learning grader Include：

Step S21, the deep learning grader that the sentence vector input of training sample is built in advance, by feedforward (forward pass) obtains the projected relationship classification LP of training sample.

Step S22, loss values are obtained by loss function F (LP, L).Wherein, LP is projected relationship classification, and L is sample reality The relation classification of border mark, loss values have weighed the difference journey between the relation classification of projected relationship classification and the actual mark of sample Degree, F can be cross entropy (Cross Entropy) or mean square error (MSE, Mean Squared Error).

Step S23, according to loss values, carries out backward pass and (is also back using stochastic gradient descent (SGD) Propagation, gradient backpropagation), change the parameter of deep learning grader so that the deep learning classification after modification Relation classification of the projected relationship classification of device output closer to the actual mark of sample.

Step S24, repetitive exercise deep learning grader, until deep learning grader output projected relationship classification with The other loss values of relation object of the actual mark of sample are less than threshold value set in advance, or iterations exceedes set in advance time Number threshold value.

The framework that above-mentioned deep learning grader is used, can well model the succession in sentence between vocabulary and close System.For this reason, this framework has suitable sensitiveness to negative word, can distinguish such as " I likes Zhou Jielun " and " I Like the Zhou Jielun just strange " as difference, while being also capable of identify that situation of " I does not like Zhou Jielun " so expression negative And the situation of " I is not not like Zhou Jielun " so multiple negative.

Identification entity can be modeled as sequence labelling problem, specifically, to each character in sentence, be labeled as BMESO, wherein B (Begin) expression are the beginning characters of entity, and M (Middle) expressions are the intermediate character of entity, E (End) table Show be entity termination character, S (Single) represents the entity of single character composition.For the character of non-physical, O can be used (Other) it is labeled, expression is not belonging to the part of entity.Such as " I/happiness/joyous/week/outstanding person/human relations ", the O/ for me can be marked The joyous O/ weeks B/ outstanding person's M/ human relations E of happiness O/ ", wherein BME altogether, is obtained " Zhou Jielun ", and the entity that expression is liked is " Zhou Jielun "；Compare again As " I/happiness/joyous/song ", can mark as my the joyous O/ songs S of O/ happinesses O/ " and, wherein S represents single character entity, likes here Entity is " song ".

Entity recognition problem can solve optimal mark with condition random field, so as to accurately extract the reality in sentence Body, therefore, the preferred embodiment that step S3 is used is as follows：Read statement is input into CRF models, the optimal sequence of read statement is obtained Mark, the entity in read statement is obtained according to optimal sequence mark.

Wherein, the detailed process that the optimal sequence for obtaining read statement by CRF models is marked is as follows：

Sequence labelling problem can be solved by condition random field.Formally, for given read statement x (i.e. One character string) and annotated sequence y based on the sequence, condition random field modeled conditional probability：

Wherein, exp (x) represents e^x, e is natural constant, and w can be the weight vectors of training, w^TIt is the transposition of vectorial w, y' It is all possible marks of sequence x, F (x, y) is characteristic vectors of the annotated sequence y on x.Conditional probability p (y | x, w) represent The given weight w in the case of, character string x is marked into the possibility size of annotated sequence y.

Given n is to training data { x_i,y_i, solve following object function：

Optimal w can be found by the method for stochastic gradient descent (SGD).

Find after optimal w, for each possible mark y', we can calculate its corresponding p (y'| x, w) Value.Optimal mark y is so that the maximum annotated sequences of p (y | x, w).In order to improve calculating performance, can be calculated by Viterbi Method finds optimal annotated sequence.

After finding optimal annotated sequence, then marked by BME therein or S and accurately to extract the reality in sentence Body.

Based on the method identical inventive concept with above-mentioned automatic identification statement relationship and entity, the embodiment of the present invention is also carried The device of a kind of automatic identification statement relationship and entity has been supplied, including：Pretreatment module 101, for by the read statement of user Project in a space for fixed dimension, obtain sentence vector of the read statement in the space of fixed dimension；Relation recognition Module 102, for by the good deep learning grader of sentence vector input training in advance, obtaining the relation classification of read statement； Entity recognition module 103, if for identifying relation classification, the entity in identified input sentence.

The method and device of automatic identification statement relationship provided in an embodiment of the present invention and entity, using deep learning, from Semantically the read statement to user judges, can precisely recognize relation；Entity recognition is modeled as sequence labelling problem, Optimal mark is solved using condition random field, so as to precisely recognize entity；With reference to deep learning and condition random field, pass is realized System and the automatic decimation of entity；Using machine learning, relation and entity are judged from semantically, overcome due to nature Language performance diversity brings influence.Such as " I likes the song of Zhou Jielun ", " song of Zhou Jielun is my favorite ", " love is dead The song of Zhou Jielun " can be identified as in expression " liking " relation, and the object liked is then " song of Zhou Jielun ".Separately Outward, method and system provided in an embodiment of the present invention are more readily maintained compared to traditional method.If necessary to increase coverage rate, only The data for needing addition new, train new model.

Finally it should be noted that：Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations；To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that：Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent；And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.

Claims

1. a kind of method of automatic identification statement relationship and entity, it is characterised in that including：

During the read statement of user projected into a space for fixed dimension, the read statement is obtained in the fixed dimension Space in sentence vector；

By the good deep learning grader of sentence vector input training in advance, the relation classification of the read statement is obtained；

If identifying relation classification, the entity in the read statement is recognized.

2. method according to claim 1, it is characterised in that described that the read statement of user is projected into a fixed dimension In the space of degree, sentence vector of the read statement in the space of the fixed dimension is obtained, including：

Read statement to user carries out participle；

By searching word2vec term vectors, each participle is converted into corresponding term vector；

According to the term vector of each participle, sentence vector of the read statement in a space for fixed dimension is obtained.

3. method according to claim 2, it is characterised in that the depth that sentence vector input training in advance is good Degree Study strategies and methods, obtain the relation classification of the read statement, including：

By the input of sentence vector, CNN layers carries out convolution operation, obtains the local feature of the read statement；

The local feature is input into LSTM layers, the relation coding between the front and rear word in the read statement is obtained；

By the relation coding input, ReLU layers carries out nonlinear transformation；

Nonlinear transformation result is passed into output layer, the relation classification of the read statement is obtained.

4. method according to claim 3, it is characterised in that the deep learning grader includes CNN layers of multiple.

5. method according to claim 3, it is characterised in that the deep learning grader includes LSTM layers of multiple.

6. method according to claim 3, it is characterised in that the output layer of the deep learning grader is used Softmax functions or Sigmoid functions.

7. method according to claim 1, it is characterised in that the entity in the identification read statement, including：

The read statement is input into CRF models, the optimal sequence mark of the read statement is obtained, according to the optimal sequence Mark obtains the entity in the read statement.

8. method according to claim 1, it is characterised in that the training step of the deep learning grader includes：

The deep learning grader that the sentence vector input of training sample is built in advance, the pre- of training sample is obtained by feedforward Survey relation classification LP；

Loss values are obtained by loss function F (LP, L), wherein, L is the relation classification of the actual mark of sample, loss values for LP and Difference degree between L,

According to the loss values, gradient backpropagation is carried out using stochastic gradient descent, change the deep learning grader Parameter；

Deep learning grader described in repetitive exercise, until the projected relationship classification and sample of deep learning grader output The other loss values of relation object of actual mark are less than threshold value set in advance, or iterations exceedes number of times threshold set in advance Value.

9. method according to claim 8, it is characterised in that the loss function is cross entropy or mean square error.

10. the device of a kind of automatic identification statement relationship and entity, it is characterised in that including：

Pretreatment module, in the read statement of user projected into a space for fixed dimension, obtains the input language Sentence vector of the sentence in the space of the fixed dimension；

Relation recognition module, for by the good deep learning grader of sentence vector input training in advance, obtaining described defeated Enter the relation classification of sentence；

Entity recognition module, if for identifying relation classification, recognizing the entity in the read statement.