CN113779264B

CN113779264B - Transaction recommendation method based on patent supply and demand knowledge graph

Info

Publication number: CN113779264B
Application number: CN202111023408.7A
Authority: CN
Inventors: 何喜军; 孟雪; 武玉英; 张佑
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-08-29
Filing date: 2021-08-29
Publication date: 2024-08-16
Anticipated expiration: 2041-08-29
Also published as: CN113779264A

Abstract

The transaction recommendation method based on the patent supply and demand knowledge graph is applied to the field of patent transaction recommendation. The method comprises the following steps: (1) constructing an attribute system: patent data is collected based on IncoPat patent databases, and an attribute index system affecting patent transaction is constructed based on literature research; (2) patent transaction AHN construction: constructing a patent transaction AHN comprising three types of nodes, three types of relations and a plurality of attributes; (3) representation learning of patent transaction AHN: the method comprises the steps of obtaining multidimensional Gaussian distribution of nodes based on a neural network, generating a node sequence based on a random walk of a meta-path, and obtaining a low-dimensional vector space of each node based on skip-gram and by utilizing difference among Gaussian distributions of different nodes measured by KL divergence; (4) Top-k recommendation of target tissue. According to the method, the attribute of the multi-type nodes is considered, so that the interpretability of the recommendation result is improved; the problems of uncertainty in embedding and asymmetry of distances among different types of nodes are solved by using multidimensional Gaussian distribution and KL divergence, and accurate recommendation of patent transaction can be realized.

Description

Transaction recommendation method based on patent supply and demand knowledge graph

Technical field:

the invention is applicable to the field of patent transaction recommendation, including recommending patents and transaction partners for organizations.

The background technology is as follows:

The Knowledge Graph (KG) is an emerging technology for mass Knowledge management and intelligent service in the big data era, can capture and present the intricate and complex relationship among multiple types of entities, provides an ideal technical means for solving the problem of 'Knowledge island' and Knowledge recommendation, and can make up for the limitation of manually planning element paths based on heterogeneous information network (Heterogeneous Information Network, HIN) recommendation. At present, patent recommendation results based on KG are more, namely, a semantic network which takes patent entities as nodes and relationships among the entities as edges is established, interaction relationships among patent knowledge and knowledge are analyzed and mined, and the similarity of the patents is calculated by combining a map structure to perform recommendation. KG-based patent recommendation can solve information overload, assist patent retrieval and expand patent knowledge service. However, transaction information is not integrated in the knowledge graph, so that transaction recommendation of the buyer and the seller is difficult to realize. Meanwhile, the construction of the patent knowledge graph is a labor-intensive process depending on domain expert writing rules or manual labeling of domain data.

Therefore, how to automatically or semi-automatically extract reliable and consistent knowledge from large-scale Patent Supply and demand information and construct a Patent Supply and demand knowledge graph (Patent Supply AND DEMAND KG, PSD-KG) still faces challenges.

(1) Patent transaction recommendation research based on heterogeneous information network

The Heterogeneous Information Network (HIN) can fuse multiple objects and multidimensional relations among the objects, and calculates probability of possible links among nodes in future through node information and link relations so as to realize transaction recommendation. Wherein: the binary network is composed of two disjoint node sets and edges only appearing between the node sets of different types, and unknown links between nodes are predicted by mapping the similarity between the nodes through the network. Binary network based recommendation is the primary stage of HIN recommendation, but binary networks have difficulty fusing complex patent knowledge.

The construction of complex HIN by fusing multidimensional patent information and the recommendation of patent transaction are directions of coming up in recent years, and the recommendation method based on HIN embedding has been proposed by the prior study, and firstly, a meta-path/meta-structure is planned to capture complex semantic relations among nodes so as to construct HIN. Transaction recommendation is then achieved through the similarity measure of the inter-node meta-paths/meta-structures. Studies have shown that HIN-based recommendations are superior to traditional isomorphic network-based recommendations. However, in the method, the planning of a surface element path depends on subjective judgment of a person, and potential semantic relations among entities are often ignored, so that recommendation accuracy is affected; on the other hand, the problem of cold start still exists because of sparse technical transaction relationship among organizations.

(2) Patent recommendation research based on knowledge graph

The Knowledge Graph (KG) is a structured network for storing the relationship between knowledge entities, and compared with the HIN, the KG not only can represent and model the entities and the complex semantic relationship thereof, but also can represent massive unstructured or semi-structured knowledge and the relationship thereof as structured entities and relationships, thereby providing a knowledge base for patent recommendation and transaction recommendation. In addition, KG is usually larger than HIN in scale, so that more entities and relations can be covered, and knowledge coverage rate is improved. At present, research based on a knowledge graph mainly focuses on patent recommendation, technical features are mined from patent texts to be expressed as entities, interaction relations among tissues and patents are expressed as relations among the entities, and the recommendation is performed by calculating similarity among the patents. Because the patent supply and demand and transaction information are not fused with the knowledge graph, the patent transaction recommendation research based on the knowledge graph has not seen relevant results.

(3) Recommendation method research based on knowledge graph

At present, the recommendation method based on the knowledge graph mainly comprises 3 types:

① Recommendation based on knowledge graph embedding: the method embeds a knowledge graph of a triplet (head entity, relation and tail entity) of the relation among the entities to obtain a vector representation of the relation among the entities, and calculates the similarity of the entities by combining Pearson correlation coefficient, euclidean distance, cosine formula and the like to finish recommendation. The classical knowledge graph embedding method comprises TransE, transH, transR models and the like.

The method is widely applied to link prediction and triplet classification, but has the following limitations: first, when a large-scale knowledge graph is used as an embedding object, the calculation complexity is high. Secondly, only the direct relation among the entities is concerned, the existence of the path relation in the knowledge graph is ignored, and high-quality entity vectors are difficult to train.

② Path-based recommendation: the method focuses on the connected similarity of the atlas, regards the atlas as a heterogeneous information network, and acquires the high-order neighbor information of the entity by combining a traversing strategy through manually designing a meta path/meta structure. Vector representations of paths are often obtained by using One-Hot, TF-IDF, word2Vec and other methods, and recommendation is completed by combining PathSim, a factoring machine and the like.

The method focuses on the path relation among the entities in the knowledge graph, and does not need to carry out embedded representation on the whole graph. However, a large number of meta paths/meta structures need to be constructed manually, the dependency on the domain knowledge is strong, and the recommendation effect also depends on the quality and the quantity of the meta paths/meta structures.

③ Recommendation based on propagation: the method combines knowledge graph embedding-based and path-based recommendation methods, and the basic idea is as follows: based on the connectivity of the knowledge graph, the multi-hop relationship is defined as a high-order connection, and the path automatic acquisition is realized by iteratively performing information propagation on the whole knowledge graph and capturing neighbor information and potential interests of the entity from the high-order connection. And aggregating neighbor entity information to the central entity from outside to inside by using the graph neural network to update the vector representation of the central entity, and completing recommendation by calculating the vector similarity of the central entity.

The graph sample aggregation (GRAPH SAMPLE AND AGGREGATE, GRAPHSAGE) network and the graph roll-up neural network (Graph Convolutional Network, GCN) are two types of representative graph neural networks. But in aggregating neighbor information, GRAPHSAGE assumes that all neighbors are equally important. Because the entities on each path are of different importance to identify user preferences, neighbor weights need to be distinguished. However, the GCN may differentiate weights according to the degrees of neighbors, and the manner in which the GCN aggregates the neighbor features depends on the graph structure, so that the training model does not have generalization capability on other graph structures, and portability is weak.

The method focuses on the multi-hop neighbor information of the central entity, does not need to embed and express the whole knowledge graph, and reduces the computational complexity; in addition, potential paths can be captured, automatic acquisition of the paths can be realized, and the limitation of manually designed meta-paths is avoided. But the key is how to realize and optimize the weighted aggregation of the neighbor entity information in the recommendation model, so as to obtain the vector representation of the center entity and improve the recommendation precision.

The development of the graph attention network (Graph Attention Network, GAT) provides a new idea for vector representation of a central entity and innovation of a recommendation method based on propagation, the GAT introduces an attention mechanism into the GCN, uses the attention mechanism to replace static normalized convolution operation, assigns higher weight to neighbor entities more related to the central entity, and obtains the vector representation of the central entity by weighting and summing the neighbor features. In contrast to GCN, GAT does not depend on graph structure, and is represented with high capacity.

In summary, the main problems faced by the patent transaction recommendation method include: how to automatically or semi-automatically extract reliable and consistent knowledge from large-scale structured and unstructured patent supply and demand information, and ensure the efficiency of map construction; how to fuse transaction and supply and demand information in a patent knowledge graph and construct the patent supply and demand knowledge graph so as to solve the problem that the recommendation of the graph-based patent transaction is difficult to develop due to the lack of transaction information; how to capture potential paths in the map and realize automatic acquisition of the paths so as to solve the dependence limitation of the heterogeneous information network on the manually designed meta-paths; how to realize and optimize the discrimination of the neighbor weights in the neighbor information aggregation process; how to improve the accuracy of the recommendation result on the basis of solving the above problems.

The invention comprises the following steps:

1. The present invention addresses the technical problems that are needed.

The transaction recommendation method based on the patent supply and demand knowledge graph provided by the invention can solve the following problems to a certain extent: firstly, establishing a dictionary in the patent field to realize automatic labeling of corpus, and providing a BERT-BiLSTM-CRF model-based recognition of patent semantic entities to solve the problems of incomplete rule coverage, excessive dependence on expert knowledge and the like in semantic entity recognition in the traditional patent knowledge graph construction, and improving graph construction efficiency; secondly, fusing transaction and supply and demand information in the patent knowledge graph, and developing transaction recommendation; thirdly, paths are automatically acquired by utilizing a multi-hop relation, an attention mechanism is introduced into a recommendation model, the limitation that element paths in a heterogeneous information network depend on manual planning more and potential semantic relations among nodes are difficult to capture comprehensively is overcome, and meanwhile the problems of neighbor weight distinguishing and calculating in a neighbor information aggregation process are solved and optimized.

2. The invention adopts a specific technical scheme.

The method mainly comprises the following steps: ① Entity and relationship planning of patent supply and demand knowledge maps comprises the following steps: semantic entities and non-semantic entities, semantic relationships and non-semantic relationships; ② Constructing a dictionary in the patent field, realizing corpus semiautomatic labeling, and identifying semantic entities based on a BERT-BiLSTM-CRF model; ③ Extracting non-semantic entities, semantic relations and non-semantic relations by utilizing a crawler technology, a word embedding technology and co-occurrence relations, and storing the non-semantic entities, the semantic relations and the non-semantic relations into a Neo4j graph database to complete construction and storage of patent supply and demand knowledge maps; ④ In the patent supply and demand knowledge graph, automatically acquiring a path by utilizing a multi-jump relation, and completing initial embedding of the path by utilizing a TransR model to acquire an initial embedding vector of an entity and a relation; ⑤ Inputting the initial embedded vector into a graph attention network, and weighting and aggregating neighbor information to update the vector representation of the central entity to obtain the final vector representation of the central entity; ⑥ And calculating vector similarity among the central entities, carrying out transaction recommendation based on a Top-K method, and recommending Top-K results for the central entities.

3. The invention can achieve the effect.

First, the conventional patent knowledge graph is not fused with transaction information, so that important applications such as technology supply and demand mining, technology transaction recommendation and the like are difficult to develop. Based on the existing patent knowledge graph, the invention merges 6 types of entities and 6 types of relations in the patent transaction information and constructs a patent supply and demand knowledge graph (PSD-KG) consisting of 12 types of entities and 14 types of relations.

Secondly, the traditional patent knowledge graph construction method has the limitation that the rule coverage is incomplete, manual annotation corpus is relied on, and the like. The invention builds a dictionary in the patent field to realize semi-automatic labeling of a corpus, and on the basis, provides a semantic entity identification method based on a BERT-BiLSTM-CRF model to realize semi-automatic building of PSD-KG and improve the efficiency of map building. And the accuracy P and recall R, F ₁ are adopted to evaluate the model precision, and the technical point and technical efficacy semantic entity recognition precision based on the BERT-BiLSTM-CRF model is higher than 85%.

Thirdly, the invention combines a recommendation method based on propagation with an attention mechanism, and provides a transaction recommendation model (Transaction Recommendation Model) based on PSD-KG, which is called PSD-KG-TRM for short, wherein: the path is automatically acquired through multi-hop, so that the defect that potential semantic relations among entities are difficult to capture based on manually planning meta-paths in HIN recommendation is overcome.

Description of the drawings:

FIG. 1 is a block diagram of a method and system for recommending transactions based on a patent supply and demand knowledge graph according to an embodiment of the present invention;

FIG. 2 is a process flow diagram of a method and system for recommending transactions based on a patent supply and demand knowledge graph according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a process for constructing a patent supply and demand knowledge graph, according to an embodiment of the invention;

FIG. 4 is a block diagram of a BERT-BiLSTM-CRF model for semantic entity identification according to an embodiment of the present invention;

The specific embodiment is as follows:

fig. 1 shows a frame diagram of the present invention, and the following embodiments are provided:

① S101, data acquisition and preprocessing are completed. The related data of the consulting fuel cell comprises (GB/T20042.1-2017 proton exchange membrane fuel cell part 1: terms, GB/T28816-2012 fuel cell terms, GB/T24548-2009 fuel cell electric car terms) and the like, and the terms of the manual screening field are used for constructing patent retrieval expressions.

Patent information is retrieved based on the de wente innovation index (Derwent Innovations Index, DII) database, and supply and demand information such as patent transfer and license is mapped and collected through the IncoPat database. Because IncoPat only comprises the transfer information of Chinese and U.S. patents, the screening discloses that China and U.S. patent is entitled to effective invention, and 16040 patents are totally used for constructing patent supply and demand knowledge maps. Because the atlas involves more organizations, a total of 235 organizations participating in patent transactions with a frequency greater than 5 are screened as sample data of a recommendation model, namely a central entity.

② S102, completing patent supply and demand knowledge graph planning. And combing and summarizing the entities and relations contained in the traditional patent knowledge graph, fusing 6 types of entities and 6 types of relations in the patent supply and demand and transaction information, constructing a patent supply and demand knowledge graph (PSD-KG) consisting of 12 types of entities and 14 types of relations, and further expanding a patent knowledge base. Table 1 shows the entities and relationships contained in PSD-KG.

Table 1 PSD-KG entity and relationship planning

Remarks: "Extra" represents newly added entity and relationship of the present invention

③ S103, completing technical point and technical efficacy semantic entity identification based on a BERT-BiLSTM-CRF model, wherein the steps comprise: patent field dictionary construction, corpus semiautomatic labeling and semantic entity identification.

First, patent domain dictionary construction. Firstly, summarizing terms in national standard GB/T28816-2012 fuel cell terms, 105 technical terms are obtained to form a technical point seed dictionary. Then, manually screening and classifying technical efficacy words contained in the technical efficacy TRIZ parameters and the technical efficacy grade 1 fields in the IncoPat database to obtain 224 efficacy terms to form a technical efficacy seed dictionary. Since the vocabulary contained in the seed dictionary is mostly basic terms, such as: catalyst, cost, it is difficult to cover the composite technical points and technical effect words, such as: allowances catalysts, GRAPHENE CATALYST, processing costs, manufacturing cost. Thus, the patent abstract is parsed for dependency using StanfordNLP tools to obtain words whose dependencies are "component" to construct a compound glossary. And finally, screening composite nouns containing technical points and technical effect seed words from the composite glossary, and merging the composite nouns with a seed dictionary to obtain 18155 technical points and 149331 technical effect words.

Second, corpus is labeled semi-automatically. At present, for entity identification in a specific field, a great deal of manpower and time are consumed by manually marking data. The invention provides a method for realizing semi-automatic labeling based on a dictionary in the patent field. Select BIESO the labeling mode and distinguish between two types of entities, technical points and technical efficacy, by "Technology" and "Effect". The method comprises the following steps: the patent abstract is segmented by using python, technical points and a technical efficacy dictionary are traversed, the matched words are regarded as entities, words which do not belong to the entities are marked as O, and the words are marked according to the mode of the table 2.

Table 2 data annotation modes and examples

Third, semantic entity identification. Semantic entity identification is carried out based on the BERT-BiLSTM-CRF model, and the method is a key step for realizing semi-automatic construction of PSD-KG. The model is divided into 3 layers, firstly, labeling corpus and obtaining corresponding sequence vectors through BERT layers; then, inputting the sequence vector into BiLSTM layers to model semantic features of the context; and finally, decoding the output result of the BiLSTM layers by utilizing the CRF layer to obtain a prediction labeling sequence, and completing entity identification by extracting and classifying each entity in the sequence. The specific process flow of the BERT-BiLSTM-CRF model is as follows (see FIG. 4 for details):

(1) BERT layer: on the basis of a domain dictionary, each sentence in the patent abstract is automatically marked through BIESO modes, and the beginning and the end of the [ CLS ] and [ SEP ] marked sentences are respectively embedded in the beginning and the end of the sentence. The sentences subjected to the above operation are converted into word sequences w= (W ₁,w₂,…,w_n-1,w_n) (n is the total number of words in all patent texts), then words, sentences and positions are embedded through Token Embedding, segment Embedding and Position Embedding, and sequence vectors X= (X ₁,x₂,…,x_n-1,x_n) containing abundant semantic features are obtained after the feature extraction of a transducer.

(2) BiLSTM layers: the transducer in the BERT layer can directly model the relationship between any characters in the sequence, but its position information is only coded by Position Embedding and is insensitive to the position characteristics of the sequence. And BiLSTM can respectively adopt forward LSTM and backward LSTM for each word sequence, then combine the outputs at the same moment, fully utilize the context information of the text, find the hidden features in the text, solve the problem of recognition of the unregistered words, and abstract and model the context semantic features of the sequence with high efficiency. The BiLSTM layer takes the sequence vector obtained by the BERT layer as the input of each time step, and for the time step p ₁, a plurality of hidden state sequences are spliced according to the positions to obtain a complete sequence, and the complete sequence is recorded asHd is the hidden state sequence dimension, and the label score matrix l= (L ₁,l₂,…,l_n)∈R^n×sn, sn is the number of labels.) corresponding to the calculated sequence, the label set tag= (TAG ₁,tag₂,…,tag_sn) L _i＝(l_i1,l_i2,…,l_isn in the label score matrix L is trained by LSTM module provided in TensorFlow library, where L _ij represents the score labeling semantic vector x _i as label TAG _j.

(3) CRF layer: the BiLSTM layer classifies each label in the sequence independently, cannot process the dependency relationship between adjacent labels, and can cause the problem of confusion of entity labels. The CRF considers the input current state characteristics and the transfer characteristics of each label category simultaneously, and obtains an optimal prediction sequence through the relation of adjacent labels to make up for the defect of BiLSTM, so as to realize the global optimal solution. Thus, the introduction of CRF after BERT-BiLSTM models sequence context tag relationships.

The CRF layer introduces a label transition probability matrix a to constrain the output labels,Representing the probability of the tag y _i transitioning to y _j, using the tag score matrix L as the state probability matrix,Representing the probability that semantic vector x _i obtained tag y _i; for the word sequence w= (W ₁,w₂,…,w_n), the score of the tag sequence Y is predicted to be the sum of transition probability and state probability, with the following formula:

Normalizing all possible sequence paths by using a softmax function to obtain the probability generated by the tag sequence Y:

taking the logarithm from two sides to obtain a likelihood function of the tag sequence Y:

(3) In the method, in the process of the invention, The actual sequence of labels is represented and,Representing the score of predicting the actual tag sequence for the word sequence W, Y _W representing the set of all possible tag sequences; the output sequence with the maximum score is obtained after decoding by using the Viterbi algorithm, namely the optimal tag sequence: and finally, integrating the entities according to the labels to complete entity identification.

④ S104, constructing and storing the patent supply and demand knowledge graph. Comprises 4 steps: non-semantic entity identification, semantic relation extraction, non-semantic relation extraction and Neo4j graph database-based storage.

First, non-semantic entity identification. (1) organization type entity identification: constructing a keyword table of tissue classification, dividing the tissue into 6 classes includes: enterprises, universities, scientific institutions, government institutions, individuals, and financial institutions utilize Python to implement type mapping; (2) organizing city identification: calling a hundred-degree map API and a google map API by using Java Script language, and searching city information of an organization in a fuzzy query and manual searching mode; other non-semantic entity identifications are obtained from the structured data through a regular expression and crawler technology.

Second, semantic relationship extraction. The relation extraction comprises semantic relation extraction and non-semantic relation extraction, wherein the semantic relation comprises technical point semantic similarity and technical effect semantic similarity relation type 2.

The word embedding method is used for calculating the semantic relation of technical points, and the method comprises the following steps:

(1) A technical point set is constructed and is named as Tech= (Tech ₁,tech₂,…,tech_m), wherein Tech _i represents the ith technical point vocabulary, and m is the number of the technical point vocabularies.

(2) And embedding and representing the technical point vocabulary in the technical point set by using the BERT model to obtain a technical point Vector set Tech_vector= (T ₁,T₂,…,T_m).

(3) And calculating a technical point semantic similarity matrix. Calculating the similarity of the technical point vector set in the form of Cartesian products to obtain an M multiplied by M semantic similarity matrix M ₁:

Tsim (T _i,T_j) in M ₁ (i=1, 2, …, M; j=1, 2, …, M) represents the semantic similarity of the technical point vectors T _i and T _j. The calculation method is cosine similarity of vectors:

(4) And establishing a technical point similarity relationship. And selecting the similarity relation among the first 20 construction technical points with higher similarity values for each technical point.

Third, non-semantic relation extraction. Comprising 12 classes: organization and patent relationship (application, possession, transfer-out, assignee, license), organization and city membership, organization and type membership, patent and technical field membership, patent and technical point relationship, patent and technical efficacy relationship, patent-to-patent citation relationship. The extraction steps are as follows:

(1) Transfer of the assignee and license relationship extraction: when a patent is transferred for multiple times, the IncoPat performs field combination on multiple transferors and transferees, so that the participation organization in each transfer cannot be accurately judged after the transfer records are downloaded, and the patent records transferred for multiple times are split. The treatment method comprises the following steps: analyzing the legal status field of the U.S. patent data by using a regular expression, so as to realize multiple transfer splitting; the "legal status" field of chinese patent contains more noise, and then the Python is utilized to crawl transfer and license records one by one.

After the single transfer record is obtained, transfer-out and assigned relations among the transferor, the assignee and the patent publication number are respectively obtained by using Python, and transfer time is taken as a transfer-out and assigned relation attribute for distinguishing multiple transfers. The same applies to the extraction of permissions and permitted relationships.

(2) Other relations extraction: except for transferring the assigned and permitted relationship and the similar relationship between technical points and technical effects, the patent bibliographic-based co-occurrence method is adopted to complement other 8 types of relationships from the original data, and the specific method is as follows:

application relation of organization and patent, possession relation of organization and patent: the method takes the publication number as an intermediary and is respectively established between the patent publication number and the applicant and the current patentee. Membership of organization and type, membership of organization and city: respectively between organization and region, type. Membership of patents to fields: the patent belongs to the field of the first 4 bits of the IPC (International patent Classification), namely: and dividing into subclasses. Relationship between patent and technical point, relationship between patent and technical efficacy: respectively established between the patent publication number and the technical point and the technical efficacy. The cited relation between patents: is established between the occurrence of the cited patent publications.

Fourth, based on the storage of Neo4j graph databases. After entity identification and relation extraction are completed, 12 types of entity and 14 types of relation are obtained, and are used for constructing a patent supply and demand knowledge graph, and the result is shown in Table 3. And establishing connection between python and the Neo4j graph database by using a py2Neo database, and storing the entity and the relationship into the Neo4j graph database.

TABLE 3 entity and relationship quantity

⑤ S105, completing multi-hop path acquisition of the center entity and initial embedding of the path. Firstly, automatically acquiring a path of a central entity from a patent supply and demand knowledge graph by utilizing a multi-hop relation, and then, completing initial embedding of the path by utilizing a TransR model to obtain an initial embedding vector of the entity and the relation.

First, the multi-hop path acquisition of the central entity. For PSD-KG, an entity directly adjacent to a central entity is called a 1-hop entity, an entity directly adjacent to the 1-hop entity is called a 2-hop entity, and the like, so as to obtain a q-hop entity (q is more than or equal to 2) of the central entity, an entity set in a q-hop range is called a multi-hop neighbor, and a path existing between the central entity and the multi-hop neighbor is called a multi-hop path. And (3) taking a central entity as a starting point, establishing connection with a map database Neo4j by using a py2Neo library of Python, constructing a search statement based on a Cypher language, and obtaining a multi-hop path by applying a breadth-first search strategy.

Second, based on the initial embedding of TransR models, the structure information of the path is encoded, generating an initial embedding vector for each entity and relationship. The invention adopts TransR model to generate embedded vector, because TransR considers that the same entity has different semantics in different relations, and additionally introduces relation space, thereby realizing the consideration in specific relation space when calculating the distance between two entities. For example: organization 1 and organization 2 belong to enterprise types, but organization 1 is located in new york, organization 2 is located in beijing, organization type and city are two different relations, and organization type space and city space are introduced, so that organization 1 and organization 2 are similar in organization type but dissimilar in city, namely, consideration on specific relation space is realized.

The basic method is as follows: for a triplet (h, r, t) consisting of each head entity h, an inter-entity relationship r and a tail entity t, transR uses two vectors to represent each class of relationship, one vectorRepresenting the relationship itself, another vector is used to construct the projection matrixRepresenting the space of relationship vectors in which this relationship exists. ed and rd are the entity embedding dimension and the relationship embedding dimension, respectively. By means ofEach vector of head and tail entities in entity space. First, at a specific relation r, a projection matrix M _r is used to obtain projection vectors of a head entity and a tail entity in a relation spaceAndWherein: the embedding of entities and relationships in triples (h, r, t) is then continually learned. For any given triplet (h, r, t), the loss function is defined as follows:

wherein, A projection matrix of the relation r for projecting the entity vector from the entity space of the ed dimension into the relation space of the rd dimension; and ₂ represents L ₂ regularization to prevent overfitting.

According to the loss function, a finger objective function is defined as follows:

L＝∑_(h,r,t)∈I∑_{(h′,r,t′)∈I′}max(0,g(h,r,t)+γ-g(h′,r,t′)) (7)

Wherein I is the correct-prediction triplet set, and I' is the incorrect-prediction triplet set. The boundary value gamma may be such that the correctly predicted score needs to be at least one boundary value higher than the incorrectly predicted score to be considered as a positive sample for distinguishing between positive and negative samples. The present invention sets γ=0.1.

TransR respectively constructing a relation vector r and a relation vector space M _r where the relation is located for all relations contained in all paths, learning entity and relation vector representation, and realizing the solving process through maximization of a range objective function, wherein the obtained optimal solution when the objective function is maximum comprises: all entity vectors, relationship vectors and relationship vector space.

⑥ The step S106 completes the aggregation of neighbor information and the representation of the central entity by using the graph attention network (GAT). Firstly, obtaining weighted aggregation representation of neighbor information, and then aggregating the neighbor information and the central entity information to obtain a central entity vector containing the neighbor information, namely, final vector representation of organization. The method comprises the following steps: information propagation-based neighbor information representation, attention mechanism-based neighbor weight calculation, neighbor information and central entity information aggregation.

First, neighbor information representation based on information propagation. In PSD-KG, the organization is taken as a central entity, the neighbor information on the multi-hop path is propagated and aggregated from outside to inside along the path, and the vector representation of the multi-dimensional neighbor information can be obtained and recorded asThe calculation formula is as follows:

Wherein N (h) is a triplet set in the multi-hop range of the central entity, Is the embedded vector of the tail node t in the triplet (h, r, t), and ln is the information propagation layer number. Sequentially arranging layers 1,2, … and ln from outside to inside, wherein the layer ln connects a central entity and a 1-hop neighbor, and then the neighbor information of the layer ln-1 is needed to be aggregated in the step; pi (h, r, t) is the neighbor weight used to control the amount of information that a neighbor entity t propagates in relation to entity h.

Second, neighbor weight calculation based on the attention mechanism. The weight of the neighbor entity can distinguish the contribution degree of different neighbors to the central entity vector representation, and the calculation formula is as follows:

Wherein, tanh is used as a nonlinear activation function, e ^r and M _r respectively represent a vector of a relation r generated in initial embedding and a projection matrix on r, and the projection matrix can be used to obtain an embedding vector of a head entity h and a tail entity t in a relation r space, namely: m _re_h and M _re_t; the size of pi (h, r, t) depends on the semantic distance of the head and tail entities on the relation r, and more information can be transmitted between entities with smaller distances. To simplify the operation, the vector inner product is used for computation. And normalizing pi (h, r, t) by adopting a softmax function to obtain the neighbor weight.

Third, entity information aggregation. The embedded propagation process is to iteratively aggregate neighbor information on paths from outside to inside to a central entity to update a vector representation of the central entity, denoted asThe calculation formula is as follows:

wherein f (·) represents a polymerizer, the present invention employs GRAPHSAGE AGGREGATOR a polymerizer by polymerization AndTo update the vector representation of the central entity.

⑦ And S107, calculating, evaluating and comparing the recommended result.

(1) Recommendation result calculation

And carrying out transaction recommendation based on a Top-K method by calculating cosine similarity among entity vectors. The similarity calculation formula for the tissues o _i and o _j is as follows:

wherein, AndVector representations of the tissues o _i and o _j respectively,AndRespectively represent vectorsAndIs a die length of the die. And ranking the organizations o _i and similarity values of other organizations, and returning the first K recommended organizations according to the Top-K thought to obtain transaction partner recommendation results.

(2) Recommended result evaluation index and comparison experiment design

The effect of the recommendation model has important significance for promoting accurate butt joint of patent transactions and reducing transaction searching cost. In order to measure the recommended performance of a model based on PSD-KG, the invention refers to the existing research, adopts a precision@K index to perform model evaluation, and has the following formula:

Where org is the organization set of the test set, R (o _i) is the first K recommendation result sets of the organization o _i, and T (o _i) is the organization set that has a trade relationship with o _i in the test set, i.e., the actual trade set.

In order to select the optimal path length of the multi-hop path in the model and compare with other models, so as to test the advantages of the model provided by the invention, the following comparison experiment is designed.

Comparative experiment one: different path lengths are compared. In the model, the larger the number of multi-hop steps, the longer the path, the more neighbor information is contained, but the longer the path can introduce noise. Therefore, a plurality of neighbor path lengths pathnum = {3,4,5,6,7}, and the optimal path length is determined by comparison.

Comparison experiment II: average aggregation model (AVE-TRM). The recommendation model core idea provided by the invention is to utilize the attention mechanism in GAT to carry out weighted aggregation on neighbor information. To verify the necessity of distinguishing the neighbor weights, an experiment of average aggregation is set, and all neighbors are regarded as equally important, which is called average aggregation model (AVE-TRM).

Comparison experiment three: graph convolution aggregation model (GCN-TRM). The graph convolutional network does not contain an attention mechanism, and weights are calculated according to the degrees of neighbors when aggregating neighbor information. To compare the impact of different weight calculation modes on the recommended results, the model of the present invention was compared to the graph roll accumulation mold type (GCN-TRM).

(3) Evaluation and comparison of recommended results

Based on the multi-hop relation, automatically acquiring a high-order path of a central entity, taking the high-order path as an input corpus of a PSD-KG-TRM model, randomly dividing the input corpus into a training set and a testing set according to the proportion of 8:2 by using Python, and setting recommended model parameters as follows: the entity embedding dimension and the relation embedding dimension are 64, the initial learning rate is 0.001, the L2 regularization parameter is set to be 1e-5, and the maximum iteration number is set to be 100. The calculated evaluation index and the comparative experiment result are shown in table 4.

Table 4 evaluation of recommended results (Top-K)

Remarks: p in P@K represents precision.

From table 4: based on PSD-KG-TRM model, selecting different path lengths for recommendation, finding: pathnum is 5, and the recommendation accuracy (P@K, k=5, 10,15, 20) is inverted U-shaped along with the increase of the path length, which indicates that the path is too short to obtain enough neighbor information, the noise is easily introduced when the path is too long, and the accuracy of the recommendation result is affected.

Considering the influence of the calculation mode of the neighbor weights in the central entity vector representation on the recommendation precision, the following is found: the PSD-KG-TRM model provided by the invention has higher recommendation precision than the GCN-TRM model, and adopts an average aggregated AVE-TRM model, so that the recommendation precision is the lowest. On one hand, the necessity of distinguishing the neighbor weights is described, and on the other hand, the introduction of an attention mechanism into a recommendation model can improve the accuracy of recommendation results.

Claims

1. A transaction recommendation method based on a patent supply and demand knowledge graph is characterized by comprising the following steps: (1) Entity and relationship planning of patent supply and demand knowledge maps comprises the following steps: semantic entities and non-semantic entities, semantic relationships and non-semantic relationships; (2) Constructing a dictionary in the patent field, realizing corpus semiautomatic labeling, and identifying semantic entities based on a BERT-BiLSTM-CRF model; (3) Extracting non-semantic entities, semantic relations and non-semantic relations by utilizing a crawler technology, a word embedding technology and co-occurrence relations, and storing the non-semantic entities, the semantic relations and the non-semantic relations into a Neo4j graph database to complete construction and storage of patent supply and demand knowledge maps; (4) In the patent supply and demand knowledge graph, automatically acquiring a path by utilizing a multi-jump relation, and completing initial embedding of the path by utilizing a TransR model to acquire an initial embedding vector of an entity and a relation; (5) Inputting the initial embedded vector into a graph attention network, and weighting and aggregating neighbor information to update the vector representation of the central entity to obtain the final vector representation of the central entity; (6) Calculating vector similarity among the central entities, carrying out transaction recommendation based on a Top-K method, returning the first K recommendation organizations, and recommending Top-K results for the central entities;

the detailed steps are as follows:

① S101, completing data acquisition and preprocessing;

Screening domain vocabulary and constructing a patent retrieval expression;

Retrieving patent information based on a database, mapping and collecting patent transfer and license information through the database; because the atlas involves more organizations, the organization with the participation patent transaction frequency more than 5 is screened as sample data of a recommendation model, namely a central entity;

② S102, completing patent supply and demand knowledge graph planning; the entities and the relations contained in the traditional patent knowledge graph are combined and summarized, 6 types of entities and 6 types of relations in the patent supply and demand and transaction information are fused, a patent supply and demand knowledge graph (PSD-KG) consisting of 12 types of entities and 14 types of relations is constructed, and a patent knowledge base is further expanded; table 1 shows the entities and relationships contained in PSD-KG;

table 1 PSD-KG entity and relationship planning

"Extra" represents newly added entity and relationship

③ S103, completing automatic identification of technical points and technical efficacy based on a BERT-BiLSTM-CRF model, wherein the steps comprise: extracting a dictionary in the patent field, automatically labeling corpus, and automatically identifying semantic entities;

Firstly, extracting a dictionary in the patent field; firstly, automatically extracting domain terms by utilizing national standards in the patent domain to obtain a technical point seed dictionary; then, automatically screening technical efficacy words contained in the technical efficacy TRIZ parameters and the technical efficacy grade 1 fields in the IncoPat database to obtain efficacy terms to form a technical efficacy seed dictionary; performing dependency syntactic analysis on the patent abstract by using StanfordNLP tools to obtain words with dependency relationship of "compound" to construct a compound term table; finally, compound nouns containing technical points and technical effect seed words are selected from the compound glossary and combined with the seed dictionary to obtain a plurality of technical points and a plurality of technical effect words;

Secondly, corpus automation labeling; realizing automatic labeling based on a dictionary in the patent field; selecting BIESO a labeling mode, and distinguishing two types of entities of technical points and technical effects by using Technology and Effect; the method comprises the following steps: segmenting the patent abstract by using python, traversing technical points and a technical efficacy dictionary, taking the matched words as entities, marking words which do not belong to the entities as O, and marking according to the mode of the table 2;

TABLE 2 data tagging modes

Thirdly, semantic entity identification; semantic entity identification is carried out based on the BERT-BiLSTM-CRF model, which is a key step for realizing semi-automatic construction of PSD-KG; the model is divided into 3 layers, firstly, labeling corpus and obtaining corresponding sequence vectors through BERT layers; then, inputting the sequence vector into BiLSTM layers to model semantic features of the context; finally, decoding the output result of BiLSTM layers by using the CRF layer to obtain a prediction labeling sequence, and completing entity identification by extracting and classifying each entity in the sequence;

(1) BERT layer:

Each sentence in the patent abstract is automatically marked through BIESO modes on the basis of a domain dictionary, and the beginning and the end of the [ CLS ] and [ SEP ] marked sentences are respectively embedded in the beginning and the end of the sentence; the sentences subjected to the operation are converted into word sequences W= (W ₁,w₂,...,w_n-1,w_n), n is the total number of words in all patent texts, then words, sentences and positions are embedded through Token Embedding, segment Embedding and Position Embedding, and sequence vectors X= (X ₁,x₂,...,x_n-1,x_n) containing abundant semantic features are obtained after the feature extraction of a transducer;

(2) BiLSTM layers:

The BiLSTM layer takes the sequence vector obtained by the BERT layer as the input of each time step, and for the time step p ₁, a plurality of hidden state sequences are spliced according to the positions to obtain a complete sequence, and the complete sequence is recorded as Hd is the dimension of the hidden state sequence, a label score matrix L= (L ₁,l₂,...,l_n)∈R^n×sn, sn is the number of labels; a label set TAG= (TAG ₁,tag₂,...,tag_sn), L _i＝(l_i1,l_i2,…,l_isn in the label score matrix L) is trained by an LSTM module provided in a TensorFlow library, wherein L _ij represents the score marking a semantic vector x _i as a label TAG _j;

(3) CRF layer:

The CRF layer introduces a label transition probability matrix a to constrain the output labels, Representing the probability of the tag y _i transitioning to y _j, using the tag score matrix L as the state probability matrix,Representing the probability that semantic vector x _i obtained tag y _i; for the word sequence w= (W ₁,w₂,...,w_n), the score of the tag sequence Y is predicted to be the sum of transition probability and state probability, with the following formula:

(2) In the method, in the process of the invention, The actual sequence of labels is represented and,Representing the score of predicting the actual tag sequence for the word sequence W, Y _W representing the set of all possible tag sequences;

The output sequence with the maximum score is obtained after decoding by using the Viterbi algorithm, namely the optimal tag sequence: Finally, integrating the entities according to the labels to complete entity identification;

④ S104, constructing and storing a patent supply and demand knowledge graph; comprises 4 steps: non-semantic entity identification, semantic relation extraction, non-semantic relation extraction and Neo4j graph database-based storage;

First, non-semantic entity identification; (1) organization type entity identification: constructing a keyword table of tissue classification, dividing the tissue into 6 classes includes: enterprises, universities, scientific institutions, government institutions, individuals, and financial institutions utilize Python to implement type mapping; (2) organizing city identification: calling a hundred-degree map API and a google map API by using Java Script language, and searching city information of an organization in a fuzzy query and manual searching mode; other non-semantic entity identifications are obtained from the structured data through a regular expression and crawler technology;

Secondly, extracting semantic relations; the relation extraction comprises semantic relation extraction and non-semantic relation extraction, wherein the semantic relation comprises a technical point semantic similarity and a technical effect semantic similarity relation type 2;

(1) Constructing a technical point set, namely Tech= (Tech ₁,tech₂,...,tech_m), wherein Tech _i represents an ith technical point word, and m is the number of the technical point words;

(2) Embedding and representing the technical point vocabulary in the technical point set by using the BERT model to obtain a technical point Vector set Tech_vector= (T ₁,T₂,...,T_m);

(3) Calculating a technical point semantic similarity matrix; calculating the similarity of the technical point vector set in the form of Cartesian products to obtain an M multiplied by M semantic similarity matrix M ₁:

Tsim (T _i,T_j) in M ₁ represents the semantic similarity of the technical point vectors T _i and T _j; the calculation method is cosine similarity of vectors;

(4) Establishing a technical point similarity relationship; selecting the similarity relation among the first 20 construction technical points with higher similarity value for each technical point;

Thirdly, extracting non-semantic relation; comprising 12 classes: the relationship between organizations and patents comprises application, possession, transfer-out, assignee, permission, licence, organization and city membership, organization and type membership, patent and technical field membership, patent and technical point relationship, patent and technical efficacy relationship and patent-to-patent citation relationship;

The extraction steps are as follows:

(1) Transfer of the assignee and license relationship extraction: when a patent is transferred for a plurality of times, the patent library performs field combination on a plurality of transferors and transferees, so that participation organization in each transfer cannot be accurately judged after the transfer records are recorded, and split processing is performed on the patent records transferred for a plurality of times; the treatment method comprises the following steps: analyzing legal status fields of patent data of other countries by using regular expressions to realize multiple transfer splitting; the legal status of Chinese patent is that the transfer and permission records are crawled by Python one by one;

After obtaining a single transfer record, respectively obtaining transfer-out and assigned relations among the transferor, the assignee and the patent publication number by using Python, and taking transfer time as a transfer-out and assigned relation attribute for distinguishing multiple transfers; the extraction of the licensed and licensed relationships is the same;

Application relation of organization and patent, possession relation of organization and patent: the method comprises the steps of taking a publication number as an intermediary, and respectively establishing between the patent publication number and an applicant and a current patentee; membership of organization and type, membership of organization and city: respectively establishing between the organization and the region and between the types; membership of patents to fields: the patent belongs to the field of the first 4 bits of the IPC (International patent Classification), namely: dividing into subclasses; relationship between patent and technical point, relationship between patent and technical efficacy: the method is established among the patent publication number, the technical points and the technical effects respectively; the cited relation between patents: establishing among the patent publication numbers cited in the occurrence;

Fourth, based on Neo4j graph database storage; after entity identification and relation extraction are completed, 12 types of entity and 14 types of relation are obtained, and are used for constructing a patent supply and demand knowledge graph, and the result is shown in Table 3; establishing connection between python and a Neo4j graph database by using a py2Neo database, and storing entities and relations into the Neo4j graph database;

TABLE 3 entity and relationship quantity

⑤ S105, completing multi-hop path acquisition of a central entity and initial embedding of the path; firstly, automatically acquiring a path of a central entity from a patent supply and demand knowledge graph by utilizing a multi-hop relationship, and then completing initial embedding of the path by utilizing a TransR model to acquire an initial embedding vector of the entity and the relationship;

Firstly, acquiring a multi-hop path of a central entity; for PSD-KG, the entity directly adjacent to the central entity is called a 1-hop entity, the entity directly adjacent to the 1-hop entity is called a 2-hop entity, and the like, so as to obtain a q-hop entity of the central entity, wherein q is more than or equal to 2, the entity set in the q-hop range is called a multi-hop neighbor, and the path existing between the central entity and the multi-hop neighbor is called a multi-hop path; taking a central entity as a starting point, establishing connection with a map database Neo4j by using a py2Neo library of Python, constructing a search statement based on a Cypher language, and obtaining a multi-hop path by applying a breadth-first search strategy;

Secondly, based on the initial embedding of TransR models, the structural information of the paths is encoded, and initial embedding vectors are generated for each entity and each relation; generating an embedded vector by adopting TransR models; organization 1 and organization 2 belong to enterprise types, but the organization types and the cities belong to two different relations, and by introducing an organization type space and an city space, the organization 1 and the organization 2 are similar in organization type but dissimilar in city, namely, the consideration on a specific relation space is realized;

The basic method is as follows: for a triplet (h, r, t) consisting of each head entity h, an inter-entity relationship r and a tail entity t, transR uses two vectors to represent each class of relationship, one vector Representing the relationship itself, another vector is used to construct the projection matrixA relationship vector space representing the relationship; ed and rd are the entity embedding dimension and the relationship embedding dimension, respectively; by means ofTo represent each vector formed by the head entity and the tail entity in the entity space; first, at a specific relation r, a projection matrix M _r is used to obtain projection vectors of a head entity and a tail entity in a relation spaceAndWherein: then, continuously learning the embedding of entities and relations in the triples (h, r, t); the learning process is as follows:

(1) For any given one triplet (h, r, t), the loss function is defined as follows:

Wherein, ₂ represents L ₂ regularization, L ₂ regularization is the sum of squares of the individual elements and then the square root again, to prevent overfitting;

L＝∑_(h,r,t)∈I∑_{(h′,r,t′)∈I′}max(0，g(h，r，t)+γ-g(h′，r，t′)) (7)

Wherein, I is a correct-prediction triplet set, I ' is a wrong-prediction triplet set, (h ', r, t ') represents a wrong-prediction triplet under the relation r, h ' is a head entity of the wrong triplet, and t ' is a tail entity; the boundary value gamma may be such that the correctly predicted score needs to be at least one boundary value higher than the incorrectly predicted score to be considered as a positive sample for distinguishing between the positive and negative samples; the invention takes gamma=0.1;

TransR respectively constructing a relation vector r and a relation vector space M _r where the relation is located for all relations contained in all paths, learning entity and relation vector representation, and realizing the solving process through maximization of a range objective function, wherein the obtained optimal solution when the objective function is maximum comprises: all entity vectors, relationship vectors and relationship vector spaces;

⑥ S106, completing neighbor information aggregation and representation of a central entity by using a graph attention network (GAT); firstly, obtaining weighted aggregation representation of neighbor information, and then aggregating the neighbor information and the central entity information to obtain a central entity vector containing the neighbor information, namely, final vector representation of organization; the method comprises the following steps: neighbor information representation based on information propagation, neighbor weight calculation based on an attention mechanism, neighbor information and central entity information aggregation;

First, neighbor information representation based on information propagation; in PSD-KG, the organization is taken as a central entity, the neighbor information on the multi-hop path is propagated and aggregated from outside to inside along the path, and the vector representation of the multi-dimensional neighbor information is obtained and recorded as The calculation formula is as follows:

Wherein N (h) is a triplet set in the multi-hop range of the central entity, Is an embedded vector of a tail node t in a triplet (h, r, t), and ln is the information propagation layer number; sequentially arranging layers 1,2, … and ln from outside to inside, wherein the layer ln connects a central entity and a 1-hop neighbor, and then the neighbor information of the layer ln-1 is needed to be aggregated in the step; pi (h, r, t) is a neighbor weight used to control the amount of information that a neighbor entity t propagates to entity h on relationship r;

second, neighbor weight calculation based on the attention mechanism; the weight of the neighbor entity can distinguish the contribution degree of different neighbors to the central entity vector representation, and the calculation formula is as follows:

Wherein, tanh is used as a nonlinear activation function, e ^r and M _r respectively represent a vector of a relation r generated in initial embedding and a projection matrix on r, and the projection matrix can be used to obtain an embedding vector of a head entity h and a tail entity t in a relation r space, namely: m _re_h and M _re_t; the size of pi (h, r, t) depends on the semantic distance between the head and tail entities on the relation r, and more information can be transmitted between the entities with smaller distance; using the vector inner product to calculate; normalizing pi (h, r, t) by adopting a softmax function to obtain neighbor weights;

Thirdly, entity information aggregation; the embedded propagation process is to iteratively aggregate neighbor information on paths from outside to inside to a central entity to update a vector representation of the central entity, denoted as The calculation formula is as follows:

Wherein f (. Cndot.) represents a polymerizer, and polymerizing by using GRAPHSAGE AGGREGATOR polymerizer AndTo update the vector representation of the central entity;

⑦ S107, calculating a recommendation result;

Calculating cosine similarity among entity vectors, and recommending transaction based on Top-K method; the similarity calculation formula for the tissues o _i and o _j is as follows:

wherein, AndVector representations of the tissues o _i and o _j respectively,AndRespectively represent vectorsAndIs a die length of (2); and ranking the organizations o _i and similarity values of other organizations, and returning the first K recommended organizations according to the Top-K thought to obtain transaction partner recommendation results.