CN113254659A

CN113254659A - File studying and judging method and system based on knowledge graph technology

Info

Publication number: CN113254659A
Application number: CN202110153678.3A
Authority: CN
Inventors: 衣秀; 张�成; 苏卫卫; 黄瑞; 杨文起
Original assignee: Tianjin Delta Technology Co ltd
Current assignee: Tianjin Delta Technology Co ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-08-13

Abstract

The invention provides a file studying and judging method and a system based on a knowledge graph technology, wherein the file studying and judging method comprises the following steps: uploading the file; and (3) extracting file entities: extracting names of people, places, organizations, colleges and universities, works, events, buildings and data entities; and (3) extracting a file relation: extracting the incidence relation between the two entities and constructing the edge of the archive knowledge graph; and (3) archive knowledge fusion: fusing the extracted file entities from different sources to generate a unique file entity with comprehensive attributes; and (3) archive knowledge storage: storing the extracted archive entity and the relation data, wherein the archive entity and the relation data can be stored in a graph database; analyzing and judging the file: and intelligently retrieving files. Under the support of knowledge maps and machine learning technologies, the established file knowledge maps are fully utilized, the association recommendation of various related file entities and the accurate recommendation of related files are realized, the file studying and judging personnel are helped to expand reading, and the depth and the breadth of file studying and judging are expanded.

Description

File studying and judging method and system based on knowledge graph technology

Technical Field

The invention belongs to the technical field of file management and study and judgment, and particularly relates to a file study and judgment method and system based on a knowledge graph technology.

Background

The archives are the place where the national development history is collected, the archives are the records of past work and history situations, and are the real evidence of history, and along with the development of informatization, the digital archives are the necessary building objects of all levels of archives. However, with the development of the national and social economy, the traditional file informatization service can not meet the increasing practical requirements of the files, and the great promotion of the intelligent file construction becomes the inevitable requirement of the current social development.

In the past, the analysis of a data statistical analysis layer is mainly performed through a multidimensional data analysis tool, a data mining tool and the like, but the intelligent file construction needs to be capable of mining deeper meanings of files, such as the relationship between a certain person and other persons, the relationship between a relevant organization and certain events, so as to realize the correlation analysis and recommendation of related persons, cities, events, writings and the like of file study and judgment objects.

Therefore, a knowledge graph technology-based file study and judgment method and system are needed, which combine with a deep learning technology to construct a file knowledge graph, assist a user in performing file association analysis and intelligent retrieval, and provide data support for file study and judgment decisions.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a method and a system for studying and judging files based on the knowledge graph technology, wherein the method for studying and judging files comprises the following steps:

step 1, uploading files;

step 2: and (3) extracting file entities: extracting names of people, places, organizations, colleges and universities, works, events, buildings and data entities;

and step 3: and (3) extracting a file relation: extracting the incidence relation between the two entities and constructing the edge of the archive knowledge graph;

and 4, step 4: and (3) archive knowledge fusion: fusing the extracted file entities from different sources to generate a unique file entity with comprehensive attributes;

and 5: and (3) archive knowledge storage: storing the extracted archive entity and the relation data, wherein the archive entity and the relation data can be stored in a graph database;

step 6: analyzing and judging the file: the method comprises file hot word analysis, file sensitive word analysis, file special word identification, file automatic classification and file clustering, and is based on the utilization of a natural language processing technology and a deep learning technology to assist file analysis and study personnel to efficiently study and analyze file data.

And 7: intelligent retrieval of archives: the intelligent retrieval of the file comprises the steps of retrieving semantic keyword recommendation, related person recommendation, related organization recommendation, related event recommendation, related work recommendation and related policy recommendation, helping a user to expand reading through semantic related recommendation, improving the depth and the breadth of information search and analysis, and carrying out weighted average on the similarity according to the weight of each chapter to obtain the overall similarity of the document.

Preferably, the step 2 comprises the following steps:

step 21, vectorizing the corpus and inputting the corpus into a network;

step 22, automatically extracting input features by using the bidirectional LSTM;

and 23, performing sentence-level sequence labeling on the upper-layer result by using a CRF layer.

Preferably, the step 3 comprises the following steps:

step 31, analyzing sentence contents of the input file text, vectorizing the file text, and inputting the file text into a next layer of network;

step 32, using bidirectional LSTM to learn context information in forward and backward directions;

step 33, selecting a group of abnormal candidate sets by using an Attention mechanism for analysis, generating a weight vector, and combining the vocabulary-level characteristics in each iteration into sentence-level characteristics by multiplying the weight vector;

and 34, sending the vector generated by the Attention layer into a softmax classifier to predict a label value, and selecting the label with the maximum probability as a prediction label.

Preferably, the step 6 comprises the following steps:

step 61: analyzing the file hot words;

step 62: extracting file keywords: extracting a group of words identifying the input file;

and step 63: automatic file classification: automatically classifying the position compiling and researching files;

step 64: and (3) carrying out association analysis on the file characters: producing events and mechanisms associated with the character;

preferably, the file studying and judging system includes: the system comprises a file uploading module, a file entity extraction module, a file relation extraction module, a file knowledge fusion module, a file knowledge storage module and a file studying and judging analysis module.

Preferably, the archive uploading module is used for uploading and collecting archive documents, and automatically indexing metadata such as titles, authors, keywords, classifications and abstracts of the archive documents by using a natural language processing technology; the archive entity extraction module is used for automatically extracting archive entities such as names of people, place names, organization names, events, buildings and data from archives; the archive relation extraction module is used for extracting the association relation between archive entities from the archive so as to form an archive knowledge graph; the archive knowledge fusion module is used for performing ambiguity elimination and reference resolution on archive entity attributes extracted from all archive documents to realize the combination and fusion of the archive entity attributes; the archive knowledge storage module is used for storing the extracted and constructed archive knowledge graph and providing data query of the archive knowledge graph; the file studying and judging analysis module is used for providing intelligent tools for file studying and judging analysis personnel, such as file hot word analysis, file sensitive word analysis, file special word identification, file automatic classification, file clustering, file character relation extraction and the like, and simultaneously providing file intelligent retrieval, integrating file knowledge maps and file full-text retrieval, and realizing accurate recommendation of file retrieval keywords, file entities and related files.

Compared with the prior art, the invention has the beneficial effects that:

1. under the support of knowledge maps and machine learning technology, the established archive knowledge maps are fully utilized in archive search to realize the associated recommendation of various related archive entities and the accurate recommendation of related archives, so that archive studying and judging personnel are helped to expand reading, and the depth and breadth of archive studying and judging are expanded.

2. The invention provides an intelligent file compiling and researching method which comprises a plurality of automatic methods of file hot word analysis, file sensitive word analysis, file special word identification, automatic file classification, file clustering and file entity association analysis, and assists a compiling and researching staff to extract high-quality file information, thereby greatly improving the working efficiency of file compiling and researching.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a BILSTM-CRF architecture diagram of the present invention;

FIG. 3 is a diagram of the bidirectional LSTM architecture based on the Attention mechanism of the present invention;

FIG. 4 is a flow chart of the intellectual search based on knowledge-graph of the present invention;

FIG. 5 is a schematic diagram of an overall structure of the archive compiling and researching system according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

example (b):

as shown in fig. 1, a method for compiling and researching archives based on knowledge graph technology comprises the following steps:

step 1, uploading files:

the archive data mainly comprises archive catalogue data, archive full-text data, archive photo data, archive multimedia data, archive professional archive data, archive historical literature data and archive management data.

The user can upload the archives through modes such as uploading button or dragging, and the archives metadata are automatically formed by utilizing technologies such as natural language processing, such as archives names, archives authors, archives time, archives classification, archives keywords, archives abstract, archives security level and the like, and meanwhile, the operation records of uploading, inquiring, modifying and deleting of the data are recorded, and the efficiency of archives metadata entry is greatly improved.

Step 2, file entity extraction:

the archive entity mainly comprises entities of names of people, places, organizations, colleges and universities, works, events, buildings and data.

The archive entity extraction method mainly adopts an archive relation extraction method based on a mode or a rule, an archive relation extraction method based on a sequence labeling supervised learning technology and an archive relation extraction method based on text classification supervised learning, and takes the archive relation extraction method based on text classification supervised learning as an example:

a document relation extraction method based on text classification supervised learning includes the steps of firstly, carrying out entity labeling on a part of document data by using a BIO labeling method, and then, utilizing machine learning and deep learning methods such as BILSTM-CRF, SVM, Bayes and the like to construct a model.

Taking the BILSTM-CRF method as an example, the input of the file named entity recognition is the word sequence s ═ w corresponding to one sentence in the file₁，w₂…w_n>When the supervised learning method is used, the NER problem is a sequence labeling problem, and the file corpora are processed by using a BIO labeling method, wherein B represents the initial position of the file entity, I represents the middle or end position of the file entity, and O represents that the file entity is not; if the name of the PERSON is marked, the name of the PERSON is marked with a starting mark B-PERSON, the I-PERSON represents the middle or the end position of the name entity of the PERSON, the starting mark B-LOC of the place name represents the middle or the end position of the name entity of the placeThe position is ended.

In the NER identification task, common deep networks comprise a CNN convolutional neural network and an RNN cyclic neural network, wherein the CNN convolutional neural network is used for vector feature learning, the RNN cyclic neural network simultaneously learns vector features and sequence labels, and a long-term memory network LSTM of RNN is used as an example for modeling:

the BILSTM-CRF is a NER identification method based on deep learning, and as shown in figure 2, mainly comprises an input layer, a BILSTM layer, a CRF layer and an output layer.

The function of each layer is as follows:

step 21: an input layer:

the first layer is an input layer, a group of words are generated after the document text is input and word segmentation is carried out, each word is expressed into a vector, the vector of each word is inquired from the model by using a pre-trained word2vec model, and the vector is transmitted to the next layer.

Step 22: BILSTM layer:

the second layer is a bidirectional LSTM layer, automatically extracts input features, and comprises two LSTMs, a forward input sequence and a reverse input sequence, so that the model can simultaneously consider the features extracted in the forward process and the features extracted in the backward process, namely the past features and the future features, an input word vector sequence (w0, w1, w2 and … wn) is used as the input of each time step of the bidirectional LSTM, and the hidden state sequences (h1, h2, h3 and … hn) output by the forward LSTM and the hidden state sequences (r1, r2 and r3 ….. rn) output by the reverse LSTM at each position are spliced in sequence to obtain a complete hidden state sequence;

and accessing a linear layer, reducing the dimension of the implicit state sequence vector to the dimension k, wherein the dimension k is the label number of the file corpus set, and further obtaining the automatically extracted sentence characteristics which are marked as C (C1, C2 and C3 … ck). Each item is a score value of a j label classified by a word, if softmax is carried out later, the k classification is carried out independently on each position, so that the labeled information cannot be utilized by labeling each position, and the result is input to a CRF layer for labeling;

step 23: CRF layer

The third layer is a CRF layer, sentence-level sequence labeling is carried out on the results of the upper layer, context information is learned through a BILSTM layer, the results are output to the CRF layer through a hidden layer, scoring values of each class of each word belonging to each word are input into the CRF layer, and the sequence with the highest predicted score is selected as the best answer through sequence labeling;

the parameter of the CRF layer is a matrix of (k +2) × (k +2), each entry of the matrix representing the transition score from the ith tag to the jth tag, k +2 meaning that a start state is added at the beginning of the sentence and an end state is added at the end of the sentence, if the tag sequence of a sentence is y ═ y (y +2)₁，y₂，....，y_n) The label scoring result of model prediction is

The scoring result of the whole sequence is the accumulated sum of the scoring results of all positions, and the probability of normalization by using softmax is as follows:

in model prediction, the optimal path is solved by using a Viterbi (Viterbi) algorithm of dynamic programming

After the optimal solution, the model predicts the result as < word, label >, such as

King B-PERSON

Greeting I-PERSON

Bio O

Industry O

In O

Same as B-ORG

Ji I-ORG

Large I-ORG

Study I-ORG

And step 3: extracting a file relation;

the archive relation extraction is to extract the incidence relation between two entities and construct the edges of an archive knowledge graph, wherein generally, a triple < subject, predicate, object >, namely an SPO structure;

the archive relation extraction method mainly adopts an extraction method based on a mode or a rule, a supervised learning method based on sequence labeling and a supervised learning method based on text classification, and takes the supervised learning method based on text classification as an example;

the two-way LSTM neural network model based on the Attention mechanism is used for extracting the archival relationship, and the Attention mechanism can automatically discover the words which are important for the classification of the archival relationship, so that the model can capture the most important semantic information from sentences.

As shown in fig. 3, the system mainly includes four parts, namely an input layer, a blstm layer, an Attention layer and an output layer, and the function of each layer is as follows:

step 31: input layer

The first layer is an input layer, sentence content analysis is carried out on input archive texts, a group of words are generated, words are embedded, vectors are generated, and the vectors are transmitted to the next layer of the model.

Step 32: BILSTM layer:

the bidirectional LSTM comprises two LSTM networks, context information can be learned in a forward direction and a reverse direction, the finally output current hidden state is composed of the current cell state and an output gate weight matrix, and the output of the ith word is as follows;

summing the forward and reverse results to produce an output for the layer;

step 33: attention layer

A weight vector is generated and the vocabulary-level features in each iteration are combined into sentence-level features by multiplication with this weight vector.

Representing the input vector of the LSTM layer as

The weight matrix of the Attention layer is obtained by the following formula:

M＝tanh(H)

a＝softmax(w^NM)

r＝Ha^N

wherein,

d^w，d^wis the dimension of the word vector, w is the variable matrix in the training process, w^NIs the transpose matrix of the learned parameter vectors, we calculate the final classification of the sentence by the following formula.

h^*＝tanh(r)

By using an attention mechanism, each step of output of the encoder layer is calculated in parallel with the current output, and then a probability value is generated by using a softmx function. These inputs are selectively learned by retaining intermediate output results of the LSTM encoder on the input sequence, and then training the model and associating the output sequence with the model as it is output.

Step 34: output layer

Sending the vector generated by the Attenttion layer into a softmax classifier to predict a tag value, selecting the tag with the maximum probability as a prediction tag, and finally generating a triple;

to avoid overfitting, we use dropout in the network forward computation. Droupout refers to the fact that when forward propagation calculation is carried out, the activation value of a certain neuron stops working with a certain probability p, so that two neurons do not necessarily appear in a dropout network each time, updating of weight values does not depend on implicit nodes of inherent relations any more, network learning is forced to be more robust, and model generalization capability is stronger.

And 4, step 4: archival knowledge fusion

Fusing the extracted file entities from different sources to generate a unique file entity with comprehensive attributes;

the method comprises the following steps of (1) archive knowledge fusion, wherein the archive knowledge fusion is mainly divided into two types, one type is based on the fusion of a plurality of archive knowledge maps, each group of archive data sources independently constructs one archive knowledge map, and then the plurality of archive knowledge maps are fused; the second category is based on the fusion of multiple different data sources, and archive knowledge is obtained by analyzing each data source and then combining all archive knowledge into a knowledge graph.

Taking the fusion of a plurality of different data sources as an example, the archive knowledge fusion process is divided into two parts, namely archive entity linking and archive knowledge merging. When the archive entities are linked, the extracted archive entities are linked to the existing archive entity objects, and archive knowledge merging is to blend the newly linked archive entities into the existing archive entities.

Step 41: file physical linking

The archive entity link is to extract an archive entity object from an archive text and link the archive entity object to an existing archive entity object in an archive knowledge base, and the processing flow is as follows:

extracting a file entity object from a file text;

performing the operations of reference resolution and entity disambiguation, and judging whether the meanings of the file entities with the same name in the file knowledge base and the currently extracted file entities are consistent;

if the selected archive entity is consistent with the archive entity in the archive repository, the extracted archive entity is linked to the archive entity in the archive repository.

And the index resolution is used for solving the problem that a plurality of names correspond to the same archive entity object, and the index resolution can be related to the correct archive entity.

The entity disambiguation is to solve the ambiguity problem of the same name file entity. Assuming that the attributes of the two archive entities are recorded as x and y, the value on the ith attribute is x_iAnd y_iThen the similarity of the archival entities can be obtained by calculating the similarity of the accumulated single attributes.

[sim(x₁，y₁)，sim(x₂，y₂)，…，sim(x_n，y_n)]

We use cosine clip similarity to calculate attribute similarity.

The larger the cosine value of the generated included angle is, the higher the similarity of the two file entities is;

if we extract a file character entity containing name, ID card number, mobile phone number and age information, firstly, go to the file knowledge base to search the file entity list consistent with the name, calculate the similarity by using the cosine of the included angle, select the file entity with the largest similarity value and larger than the threshold value, and link it to the file entity, otherwise, it is a new file entity.

Step 42: archival knowledge consolidation

Combining the newly extracted archive entity linked together with the archive entity in the archive knowledge base, constructing the archive knowledge graph into a more complete graph, for example, fusing the entity attributes of the archive character entities linked together, and adding the extracted new attribute and the attribute with more complete description into the archive entity attributes of the archive knowledge base.

And 5: storing archive knowledge;

and storing the extracted archive entity and relationship data, wherein the archive knowledge map data is stored by adopting two modes: the knowledge data is stored in a graph database by taking the storage mode based on the graph model as an example.

Taking the example of using Neo4j graph database to store the extracted archival entities and relationship data, Neo4j is a native graph database engine, which has a unique storage structure and an index-free neighbor node storage method, and has a corresponding graph traversal algorithm, thereby having very high query performance; the nature of the graph data structure and its unstructured data format allow the database design of Neo4j to have great flexibility and flexibility.

Step 6: analyzing the file;

by means of a machine learning technology, a natural language processing technology and a constructed file knowledge graph, massive file data are mined, file text key information is automatically extracted by means of intelligent analysis technologies such as text information extraction, text classification, text clustering, automatic abstract extraction, file character relation and intelligent retrieval aiming at information requirements of different research users, high-value file information is generated, work efficiency of research personnel is effectively improved, and data and tool support are provided for the research personnel to compile high-quality research reports.

Step 61: archival hotword analysis

A method for analyzing file hot words is based on the text document data of the file, and by mining and analyzing the full-text data of the file, a hot word list corresponding to the currently input file data is generated and displayed in a list or word cloud form.

Archival hotword analysis is implemented using the tf-idf method.

the tf formula is shown below.

Wherein, the numerator represents the number of times of a certain word in the input file text, and the denominator represents the number of all words in the input file text.

The idf formula is shown below.

Where the numerator represents the total number of documents in the input archive text and the denominator represents the number of documents containing a word, if the word is not in the archive text, this will result in the denominator being zero, so we add 1 to the denominator.

The formula for calculating the product of tf and idf is shown below.

tfidf_i，j＝t_fi，j×id_fi

If the word frequency of one word is very high and the word rarely appears in other documents, the word is a word with high distinguishing degree, each word is calculated in sequence, and finally a hot word list of topN is generated and can be presented to a user by using a word cloud;

step 62: archive keyword extraction

For an input file text, a group of words capable of identifying the file text is extracted and widely used in file keyword indexing and file information retrieval.

Tf-idf method

tf-idf is a commonly used method of scoring words in a sentence, and the tf-idf value of a word depends on two factors: the word frequency and the importance of the word. And generating top words through calculation to form a keyword set.

Textrank method

the textrank method is derived from page-rank, and considers that the importance of adjacent words in a document or a sentence is mutually influenced, and introduces the sequence information of the words.

Wherein, V_iIdentifying the words for which weights are to be calculated, S (V)_j) Weight of the word, d is the damping coefficient, In (V)_i) Represents and V_iWord sets in the same window, Out (V)_j) Is represented by the formula V_jAdding absolute values to the word sets in the same window to represent the number of the word sets;

textrank initializes the weight of each word, and then updates according to the formula until convergence, and the screened word set can reflect the whole file or sentence most.

And step 63: automatic file classification

The automatic file classification function is provided, the automatic classification of the compiled and researched files of unknown classification can be realized, and two methods of automatic classification based on rules and automatic classification based on machine learning are provided.

1. Rule-based automatic classification

According to the actual analysis scenario, rules are provided by the user in the form: the system automatically classifies unknown classified archives according to rules set by a user, a rule file comprises classification categories, word lists and weights corresponding to the classification categories, after word segmentation processing and word deactivation are carried out on input archive data, position association and cumulative weighting calculation are carried out on the word lists under each category of the rule file in sequence, finally, the category and the probability of the whole section of input data are given, and the user can designate the category and the probability with the highest topN category probability to return.

2. Automatic classification based on machine learning

The user provides training corpora, the system automatically learns the classification rules in the corpora by using a machine learning algorithm, establishes an automatic file classification model, realizes the automatic classification of unknown classification files, provides a series of machine learning file classification methods, such as Bayesian classification, and calculates the probability after the exchange of two files based on the probability under a certain known condition; bayesian classification is classification of a generated model under the condition of adopting attribute condition independence assumption; support vector machines are methods that map low-order space linearly indivisible samples to high-dimensional linearly separable space through a kernel function.

And providing a machine learning classification model training function, inputting a specified archive theme data set, making a classification label, and performing classification model training.

And providing a machine learning classification model evaluation function, inputting a specified theme data set (printing classification labels) and generating a model evaluation result.

Effect evaluation use accuracy, recall and F of classification model_1-score，As follows:

wherein, tp: predicting a correct positive sample; fp: the prediction is the wrong positive sample; fn: negative samples for which the prediction is wrong;

after the effect evaluation of the classification model reaches the standard, automatic classification labels can be marked on the archive texts, and label distribution statistics is supported;

step 64: archival character relationship analysis

In the extraction of the archive entity, the archive entity information such as people, mechanisms, events and the like is extracted, in the analysis of the relationship of the archive characters, target characters to be analyzed are input, the events and the mechanisms related to the people are inquired from an archive knowledge graph and are shown to a user in a semantic network mode.

And 7: and intelligently retrieving files.

The search is a behavior of information search initiated by a user, the user submits a query request, matched contents are searched after the system receives the query request, query results are returned to the user after being sorted, and the archive intelligent search based on the knowledge graph can help the user to expand reading and improve the depth and the breadth of information search and analysis through semantic association recommendation on the basis of the traditional full-text search.

As shown in fig. 4, the intelligent search based on the archival knowledge graph mainly comprises the following steps:

1. analyzing the intention of the user: identifying a target entity of a user from a query submitted by the user, and generating a query condition of the target entity for searching;

2. target query: searching the target entity and related content related to the target entity in the knowledge graph by using a query statement for the query condition of the target entity;

3. the results show that: if the target entity is not unique, the results need to be sorted;

4. target entity exploration: and after the target entity is produced, displaying the related entities having the incidence relation with the target entity to the expanded search result.

Specifically, as shown in fig. 5, the file studying and judging system includes: the system comprises a file uploading module, a file entity extraction module, a file relation extraction module, a file knowledge fusion module, a file knowledge storage module and a file studying and judging analysis module, wherein the file uploading module is used for uploading and collecting file documents and automatically indexing metadata such as file document titles, authors, keywords, classifications and abstracts by using a natural language processing technology; the archive entity extraction module is used for automatically extracting archive entities such as names of people, place names, organization names, events, buildings and data from archives; the archive relation extraction module is used for extracting the association relation between archive entities from the archive so as to form an archive knowledge graph; the archive knowledge fusion module is used for performing ambiguity elimination and reference resolution on archive entity attributes extracted from all archive documents to realize the combination and fusion of the archive entity attributes; the archive knowledge storage module is used for storing the extracted and constructed archive knowledge graph and providing data query of the archive knowledge graph; the file studying and judging analysis module is used for providing intelligent tools for file studying and judging analysis personnel, such as file hot word analysis, file sensitive word analysis, file special word identification, file automatic classification, file clustering, file character relation extraction and the like, and simultaneously providing file intelligent retrieval, integrating file knowledge maps and file full-text retrieval, and realizing accurate recommendation of file retrieval keywords, file entities and related files.

The technical solutions of the present invention or similar technical solutions designed by those skilled in the art based on the teachings of the technical solutions of the present invention are all within the scope of the present invention.

Claims

1. A file studying and judging method and a system based on knowledge graph technology are characterized in that the file studying and judging method comprises the following steps:

step 1, uploading files;

step 6: analyzing and judging the file: the method comprises the steps of file hot word analysis, file sensitive word analysis, file special word identification, automatic file classification and file clustering, and is used for assisting file analysis and judgment personnel to efficiently judge and analyze file data based on a natural language processing technology and a deep learning technology;

2. The method and system for studying and judging archives based on knowledge-graph technology as claimed in claim 1, wherein said step 2 comprises the steps of:

step 21, vectorizing the corpus and inputting the corpus into a network;

3. The method and system for studying and judging archives based on knowledge-graph technology as claimed in claim 1, wherein said step 3 comprises the steps of:

4. The method and system for studying and judging archives based on the knowledge-graph technology as claimed in claim 1, wherein said step 6 comprises the steps of:

step 41: analyzing the file hot words;

step 42: extracting file keywords: extracting a group of words identifying the input file;

step 43: automatic file classification: automatically classifying the position compiling and researching files;

step 44: and (3) carrying out association analysis on the file characters: and producing the events and mechanisms associated with the character.

5. The system and method of claim 1, wherein the system comprises: the system comprises a file uploading module, a file entity extraction module, a file relation extraction module, a file knowledge fusion module, a file knowledge storage module and a file studying and judging analysis module.

6. The method and system for studying and judging archives based on the knowledge-graph technology as claimed in claim 5, wherein the archive uploading module is used for uploading and collecting archive documents and automatically indexing metadata such as titles, authors, keywords, classifications and abstracts of the archive documents by using a natural language processing technology; the archive entity extraction module is used for automatically extracting archive entities such as names of people, place names, organization names, events, buildings and data from archives; the archive relation extraction module is used for extracting the association relation between archive entities from the archive so as to form an archive knowledge graph; the archive knowledge fusion module is used for performing ambiguity elimination and reference resolution on archive entity attributes extracted from all archive documents to realize the combination and fusion of the archive entity attributes; the archive knowledge storage module is used for storing the extracted and constructed archive knowledge graph and providing data query of the archive knowledge graph; the file studying and judging analysis module is used for providing intelligent tools for file studying and judging analysis personnel, such as file hot word analysis, file sensitive word analysis, file special word identification, file automatic classification, file clustering, file character relation extraction and the like, and simultaneously providing file intelligent retrieval, integrating file knowledge maps and file full-text retrieval, and realizing accurate recommendation of file retrieval keywords, file entities and related files.