CN114896404A - Document classification method and device - Google Patents

Document classification method and device Download PDF

Info

Publication number
CN114896404A
CN114896404A CN202210576341.8A CN202210576341A CN114896404A CN 114896404 A CN114896404 A CN 114896404A CN 202210576341 A CN202210576341 A CN 202210576341A CN 114896404 A CN114896404 A CN 114896404A
Authority
CN
China
Prior art keywords
text
document
category
word
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210576341.8A
Other languages
Chinese (zh)
Inventor
王得贤
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Kingsoft Interactive Entertainment Technology Co ltd, Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Priority to CN202210576341.8A priority Critical patent/CN114896404A/en
Publication of CN114896404A publication Critical patent/CN114896404A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a document classification method and a document classification device, wherein the document classification method comprises the following steps: segmenting a document to be processed to obtain a plurality of texts; respectively inputting the texts into a feature extraction model, and determining the category feature of each text; combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed; and inputting the category feature vector into a classification model, and determining the category of the document to be processed. The method is not only suitable for long document processing, but also can obtain the category feature vector fused with the full-text category information of the document to be processed, the category feature vector not only can embody the category features of all the contents in the document to be processed, but also can embody the association among all the contents in the document to be processed, so that the category feature vector is input into the classification model for classification, more information can be provided for the classification model, the classification result of the classification model is more accurate, and the accuracy of document classification is improved.

Description

Document classification method and device
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for classifying documents.
Background
The document classification is to intelligently identify a document, determine the category of the document and judge whether the document is a target category. In the prior art, a deep learning method based on text interception is generally adopted to classify documents, for example, for longer documents, such as documents with more than 3000 words, in the prior art, part of text is generally intercepted from the front part or the middle part of the document, and the intercepted part of text is classified through a Neural network model such as LSTM (Long Short-Term Memory network), CNN (Convolutional Neural network), and the like, so as to determine the category of the input document.
However, because the document is long, the document cannot be completely input into the neural network model, and the text information is lost due to the part of text intercepted from the document, which affects the accuracy of document classification. Therefore, a document classification method is needed to solve the above problems.
Disclosure of Invention
In view of this, the present application provides a document classification method to solve the technical defects in the prior art. The embodiment of the application also provides a document classification device, a computing device and a computer readable storage medium.
According to a first aspect of embodiments of the present application, there is provided a document classification method, including:
segmenting a document to be processed to obtain a plurality of texts;
respectively inputting the texts into a feature extraction model, and determining the category feature of each text;
combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed;
inputting the category feature vector into a classification model, and determining the category of the document to be processed
According to a second aspect of embodiments of the present application, there is provided a document classification apparatus including:
the segmentation module is configured to segment the document to be processed to obtain a plurality of texts;
the first determination module is configured to input the texts into a feature extraction model respectively and determine the category feature of each text;
the combination module is configured to combine the category characteristics of the texts to obtain a category characteristic vector of the document to be processed;
and the second determination module is configured to input the category feature vector into a classification model and determine the category of the document to be processed.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions which, when executed by the processor, implement the steps of the document classification method.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the document classification method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the document classification method.
The document classification method provided by the application comprises the steps of segmenting a document to be processed to obtain a plurality of texts; respectively inputting the texts into a feature extraction model, and determining the category feature of each text; combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed; and inputting the category feature vector into a classification model, and determining the category of the document to be processed. According to the method, the document to be processed is firstly divided into shorter texts, the method is suitable for long document processing, the category characteristics of each text are firstly determined, then the category characteristics of a plurality of texts are combined to obtain the category characteristic vector of the document to be processed, and the category characteristic vector can be considered to be fused with the category information of the full text of the document to be processed, namely the category characteristic vector not only can embody the category characteristics of each part of content in the document to be processed, but also can embody the association among each part of content in the document to be processed, so that the category characteristic vector is input into the classification model for classification, more information can be provided for the classification model, the classification result of the classification model is more accurate, and the accuracy of document classification is improved.
Drawings
FIG. 1 is a system architecture diagram of a system for performing a document classification method according to an embodiment of the present application;
FIG. 2 is a flowchart of a document classification method provided by an embodiment of the present application;
FIG. 3 is a flowchart of a method for training a classification model according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for determining category characteristics of a text according to an embodiment of the present application;
FIG. 5 is a flow chart of another method for determining category characteristics of text provided by an embodiment of the present application;
FIG. 6 is a flowchart of yet another method for determining category characteristics of text according to an embodiment of the present application;
FIG. 7 is a flowchart of a method for determining category feature vectors of documents to be processed according to an embodiment of the present application;
FIG. 8 is a flowchart of a method for segmenting a document to be processed according to an embodiment of the present application;
FIG. 9 is a process flow diagram of a document classification method applied to identification of treaty documents according to an embodiment of the present application;
FIG. 10 is a process diagram of a document classification method according to an embodiment of the present application;
FIG. 11 is a schematic structural diagram of a document sorting apparatus according to an embodiment of the present application;
fig. 12 is a block diagram of a computing device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present application relate are explained.
A feature extraction model: the method is used for extracting the features of the input text to obtain the category features of the input text.
Class characteristics: for characterizing features of the category to which the text belongs.
Classification models: the method is used for classifying the input documents and determining the categories of the documents.
Category feature vector: the feature vector which can be used for determining the category of the document can not only represent the category features of all parts of contents in the document, but also represent the association among all parts of contents in the document.
Word unit: before any actual processing of the input text, it needs to be segmented into language units such as words, punctuation marks, numbers or letters, which are called word units. For English text, a word unit can be a word, a punctuation mark, a number, etc.; for Chinese text, the smallest word unit can be a word, a punctuation mark, a number, etc.
Word Embedding Layer: the layer for embedded encoding of the input text may generate a representation of the text in a new space, which may be a word-embedded vector of the text, by a mapping or a function.
Word embedding: refers to the process of embedding a high-dimensional space with the number of all words into a continuous vector space with a much lower dimension, each word or phrase being mapped as a vector on the real number domain.
Word embedding vector: and carrying out word embedding processing on the word units to obtain vectors.
Wordattenion Layer (word level attention Layer): an attention mechanism may be included to perform attention calculations in units of word units.
word2 vec: a method for word embedding processing is an efficient word vector training method constructed by Mikolov on the basis of a Bengio NNLM (Neural Network Language Model). Namely, the method can be used for carrying out word embedding processing on the text to obtain a word embedding vector of the text.
An attention mechanism is as follows: in cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing, a mechanism commonly referred to as attentiveness. In neural network models, an attention mechanism may improve the efficiency of processing a task by allowing the model to dynamically focus on certain portions of the input that contribute to the current task.
Attention calculation: for an output y at a certain moment, its attention, i.e. the weight, on the input x, i.e. the weight that each part of the input x contributes to the output y at a certain moment, is given.
Feature vector: the method comprises the steps of fusing a vector obtained after words of word units in a text are embedded into the vector, fusing the relation between the first word unit and the word units in the text into the feature vector of the first word unit, and fusing semantic information of the full text.
Enhancing the feature vector: the text and the feature vectors of other texts are fused to obtain a vector, the enhanced feature vector of the text is fused with the relation between the text and other texts, and the semantic information of the full text of the document is fused.
BERT (Bidirectional Encoder representation from Transformer) model: the method is a dynamic word vector technology, adopts a bidirectional Transformer model to train a non-labeled data set, comprehensively considers characteristic information of preceding and following words, and can better solve the problems of polysemy of a word and the like.
Lightgbm model: the method is a gradient promotion framework, uses a decision tree as a base learner, supports high-efficiency parallel training, and has the advantages of higher training speed, lower memory consumption, higher accuracy, support of distributed type, capability of rapidly processing mass data and the like.
Log loss: i.e., Log-likelihood Loss, also known as Logistic Loss or cross-entropy Loss, is defined on the probability estimation. It is commonly used in multi-nominal regression and neural networks, as well as some variants of the desired maximization algorithm, to evaluate the probabilistic output of the classifier.
TF-IDF (Term Frequency-Inverse Document Frequency): the method is a common weighting technique for information retrieval and text mining, and can be used for evaluating the importance degree of a word unit to a certain text in a text set or a text library. The importance of a word unit increases in proportion to the number of times it appears in the text, but at the same time decreases in inverse proportion to the frequency with which it appears in the text corpus.
N-gram: is a statistical language model used to predict the nth item from the first (n-1) items.
Next, an application scenario of the document classification method provided in the present application will be described.
The document classification is to intelligently identify the documents and determine the preset classification categories of the documents. Under the scene of contract document classification, the contract document classification is to intelligently identify the document so as to judge whether the document belongs to a contract. There are three main methods for classifying the contract documents at present. First, a rule-based approach: manually designing a classification rule, and determining whether the document is a contract through rule matching so as to realize the identification of the contract document; second, a traditional machine learning based approach: manually constructing characteristics of a word stock and a document, such as TF-IDF, N-gram, keywords and the like, and realizing contract document identification through a machine learning model (such as SVM (support vector machines), XGboost (extreme gradient boosting), LR (Logistic Regression) and the like); thirdly, a deep learning method based on text interception: for documents with more character contents, such as long documents with 3000 words, a part of text is generally intercepted from the front part or the middle part of the document, and the intercepted part of text is processed through neural network models such as LSTM, CNN, and the like, so as to determine the category of the document.
However, the rule-based method requires a user to design a large number of rules, and is relatively high in construction cost and labor-consuming; the method based on the traditional machine learning needs to manually construct word banks and features, the feature engineering is complex and is difficult to completely construct, and the identification accuracy is influenced; in the deep learning-based method, because the text is long, the text cannot be completely input into the neural network model, and the text information is lost due to the fact that part of the text in the document is intercepted, so that the identification accuracy is influenced.
Based on the method, the document classification method is convenient and quick, complex rule design and complex feature extraction are not needed, and the identification accuracy of the document can be effectively improved. The specific implementation of the document classification method can be referred to the following description of various embodiments.
Referring to fig. 1, fig. 1 is a system architecture diagram of a system for performing a document classification method according to an embodiment of the present application.
The system may include a server side 101 that performs a document classification method, a first training side 102 that trains a feature extraction model, and a second training side 103 that trains a classification model. Moreover, the server, the first training terminal and the second training terminal may be integrated in the same computing device, or may be in different computing devices independent of each other. Exemplarily, the server, the first training terminal and the second training terminal are three mutually independent computing devices respectively; or the first training end and the second training end are integrated in the same computing device, and the server end is in another computing device; or the server and the first training terminal are integrated in the same computing device, and the second training terminal is in another computing device; or, the server, the first training terminal, and the second training terminal are integrated in the same computing device, which is not limited in this embodiment of the present application.
Moreover, the computing device may be a terminal or a server, the terminal may be any electronic product capable of performing human-computer interaction with a user, the server may be one server, a server cluster formed by multiple servers, or a cloud computing service center, which is not limited in this embodiment of the present application.
Taking the example that the server, the first training terminal and the second training terminal are integrated in the same computing device, the document classification method provided by the embodiment of the application is briefly introduced.
The first training end trains the feature extraction model through the sample document, can output the class feature vector of the sample document through the feature extraction model, then sends the class feature vector of the sample document to the second training end, and then the second training end trains the classification model through the class feature vector of the sample document.
The method comprises the steps that a server divides a document to be processed to obtain a plurality of texts, then the plurality of texts are sent to a first training end, the class characteristics of each text are determined through a characteristic extraction model of the first training end, then the class characteristics of the plurality of texts are sent to the server, the class characteristics of the plurality of texts are combined through the server to obtain class characteristic vectors of the document to be processed, then the class characteristic vectors of the document to be processed are sent to a second training end, and the class of the document to be processed is determined through a classification model of the second training end.
According to the document classification method provided by the embodiment of the application, the document to be processed is firstly divided into shorter texts, the method is suitable for long document processing, the category characteristics of each text are firstly determined, then the category characteristics of a plurality of texts are combined to obtain the category characteristic vector of the document to be processed, and the category characteristic vector can be considered to be fused with the category information of the whole text of the document to be processed, namely the category characteristic vector not only can embody the category characteristics of each part of content in the document to be processed, but also can embody the association among each part of content in the document to be processed, so that the category characteristic vector is input into a classification model for classification, more information can be provided for the classification model, the classification result of the classification model is more accurate, and the accuracy of document classification is improved.
In the present application, a document classification method is provided. The present application relates to a document sorting apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
Fig. 2 is a flowchart illustrating a document classification method according to an embodiment of the present application, which specifically includes the following steps:
step 202: and segmenting the document to be processed to obtain a plurality of texts.
The documents to be processed are documents which need to be classified to determine the category, or documents which need to be identified to determine whether the documents belong to the target category. The target category is a category to which a document that the user wants to acquire belongs, and for example, the target category may be a contract, a patent document, a resume, or the like.
In some embodiments, if the document to be processed is a long document, that is, the document to be processed includes a large number of character contents, and the data size of the document to be processed is large, the document to be processed cannot be directly input into the feature extraction model for processing, and therefore, the document to be processed needs to be divided into a plurality of texts including a small number of character contents.
As an example, the Document to be processed may be a Document in a picture Format, for example, the Format of the Document to be processed is a PDF (Portable Document Format), or the Document to be processed may be a Document in an editable Format, such as doc, docx, txt, and the like, and the Format of the Document to be processed is not limited in this embodiment of the application.
In some embodiments, before segmenting the document to be processed, the character content of the document to be processed may be obtained, and then the character content may be segmented according to the segmentation policy, so as to obtain a plurality of texts. For documents to be processed with different formats, character content may be recognized and obtained by using corresponding character recognition methods, for example, for documents in picture formats, character content in the documents to be processed may be recognized by using an OCR (optical character recognition) technology, and the character recognition method is not limited in the embodiment of the present application.
As an example, the segmentation policy may include segmentation by segments, segmentation by sentences, segmentation by chapters, segmentation by pages, and the like, and the embodiment of the present application does not limit the segmentation policy. In addition, in actual use, since the content of a complete sentence may be divided into two texts by page division, the content of each text obtained by page division can be ensured to be complete by combining with other division methods. That is, the dividing strategy is used to divide the document to be processed, so that the content of each text obtained by dividing can be ensured to be complete.
Illustratively, in the case of page-wise segmentation, the determination of how to segment the character content may be made by determining whether the last character of each page is an end symbol, such as a period, exclamation point, or the like. For example, if the last character of the current page is a period, determining the character content of the current page as a text, if the last character of the current page is not an end symbol, searching the end symbol from the next page of the current page, and dividing the character content before the first end symbol in the next page into the current page, that is, determining the character content of the current page and the character content before the first end symbol in the next page as a text; or dividing the character content of the first segment in the next page into the current page, namely determining the character content of the current page and the character content of the first segment in the next page as a text.
In the embodiment of the application, the whole document to be processed is not processed, but the document to be processed is divided into a plurality of short texts, so that the model processing is facilitated, and the problem of difficulty in processing long documents is solved.
Step 204: and respectively inputting the texts into the feature extraction model, and determining the category feature of each text.
The feature extraction model is used for extracting features of an input text, and the category features are used for representing categories of the text.
In some embodiments, the feature extraction model may include an input layer, an embedding layer, and an output layer, the output layer further including a word-level attention layer and a fully connected layer. The input layer is used for performing word segmentation processing on the text to obtain word units; the embedding layer is used for carrying out word embedding processing on the input word units to obtain word embedding vectors of the word units; the word level attention layer is used for carrying out attention calculation on word embedded vectors of word units in the same text to obtain a feature vector fusing the text context semantic information; the full-connection layer is used for determining the category characteristic of each text based on the characteristic vector or the enhanced characteristic vector of each text, namely determining the category to which each text belongs.
As an example, a plurality of texts obtained by segmenting a document to be processed are respectively input into a feature extraction model, for any text, the text may be firstly subjected to word segmentation through an input layer to obtain a plurality of word units of the text, then the plurality of word units of the text are input into an embedding layer to obtain a word embedding vector of each word unit, then the words of the plurality of word units are embedded into a vector to be input into a word level attention layer to obtain a feature vector of each word unit in the text, the feature vector fuses semantic information of the word units in the text, and then the feature vectors of the plurality of word units in the text are spliced to obtain the feature vector of the text. After the word-level attention layer, the feature vector of each text can be obtained, and the feature vector of each text is input into the full-connection layer, so that the classification result of each text can be determined, and the classification result can be referred to as the category feature of the text.
In other embodiments, in order to enhance the relevance between the texts in the whole document to be processed, the output layer of the feature extraction model may further include a text-level attention layer, where the text-level attention layer is configured to perform attention calculation on feature vectors of multiple texts to obtain an enhanced feature vector fusing context semantic information of the document to be processed.
As an example, a plurality of texts obtained after a document to be processed is divided are respectively input into a feature extraction model, for any text, the text can be firstly subjected to word segmentation through an input layer to obtain a plurality of word units of the text, then the plurality of word units of the text are input into an embedding layer to obtain a word embedding vector of each word unit, then the words of the plurality of word units are embedded into a vector input word-level attention layer to obtain a feature vector of each text with semantic features of each word unit, then the feature vectors of the plurality of texts can be input into the text-level attention layer to obtain an enhanced feature vector of each text with semantic information of itself and other texts, and the enhanced feature vector of each text is input into a full-connection layer to be processed to obtain the category features of each text.
As an example, the feature extraction model may include a BERT model, and since the BERT model can extract feature vectors after text is fused with full-text semantic information, extracting category features of the text based on the BERT model can obtain a more accurate result.
As another example, the feature extraction model may also be a modification of the BERT model, such as Roberta, Tinybert, Albert, ERNIE (Enhanced Language Representation with information entity), and the like, and these models have different structures and training manners, and have different effects for different tasks, but all of them may be used to extract features from texts.
In addition, the feature extraction model used in the embodiment of the present application may be obtained by training in the following manner:
obtaining a sample document set, wherein each sample document in the sample document set carries a category label, the label 1 represents a target category, and the label 2 represents a non-target category. Dividing each document into a plurality of texts, taking each text and a class label of a document to which the text belongs as a piece of training data, carrying out the same processing on a plurality of documents in a sample document set to obtain a plurality of pieces of training data, inputting the plurality of pieces of training data into a feature extraction model for training, predicting a classification result by the feature extraction model aiming at each piece of training data, wherein the classification result is used for representing the class of the text in the piece of training data predicted by the feature extraction model, converting the class into a vector form to obtain class characteristics, converting the class label in the piece of training data into the vector form, determining the loss values of the predicted class and the class label through a loss function, stopping training if the loss value is less than a loss threshold value, considering that the feature extraction model is trained, and if the loss value is more than or equal to the loss threshold value, and adjusting parameters of the feature extraction model based on the loss value and continuing training until the loss value is smaller than a loss threshold value.
In the embodiment of the application, the plurality of texts are respectively input into the feature extraction model to obtain the category feature of each text, the content of each part in the document can be preliminarily classified, the BERT model can be used to obtain the enhanced feature vector fusing the semantic information of each part of the text, and the accuracy rate of determining the category feature of the text is improved.
Step 206: and combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed.
In some embodiments, the category features of the plurality of texts are combined, and the category features may be spliced to obtain a category feature vector of the document to be processed; the attention calculation can also be carried out on the plurality of category characteristics to obtain the enhanced category characteristics of each text, and the enhanced category characteristics are spliced to obtain the category characteristic vector of the document to be processed.
As an example, if the category feature of each text is a one-dimensional vector, the dimension of the category feature vector is the same as the number of texts obtained by segmenting the document, and if the category feature of each text is a multi-dimensional vector, the plurality of category features may be adjusted to the same dimension first, and then the plurality of category features may be spliced to obtain the category feature vector.
In the embodiment of the application, the category feature vector of the document to be processed is obtained by attention calculation or splicing according to the category features of the text, and the category features can reflect the category information of the text in the document to be processed, and the attention calculation or splicing can also reflect the association between the texts in the document, so that the category feature vector can provide more classification bases for subsequent document classification, and further improve the accuracy of the document classification.
Step 208: and inputting the category feature vector into a classification model, and determining the category of the document to be processed.
In some embodiments, the category feature vector of the document to be processed is input into the trained classification model, and the category of the document to be processed can be determined by constructing a decision tree.
As an example, the classification model may include a Lightgbm model, and the loss function of the classification model may be a logarithmic loss function. Illustratively, the logarithmic loss function may be a binary logarithmic loss function used to optimize parameters of the Lightgbm model.
As an example, the penalty function of the classification model may also be a cross-entropy penalty. The embodiment of the present application does not limit the loss function of the classification model.
For example, the classification model may include a plurality of decision trees, the category feature vector of the document to be processed is input into each decision tree, a prediction probability may be determined based on each decision tree, the prediction probabilities are added and normalized to obtain a category probability corresponding to the document to be processed, and the category of the document to be processed is determined based on the category probability. For example, in the classification model, the closer the probability is to 1, the greater the likelihood that the document is a contract, and the closer the probability is to 0, the greater the likelihood that the document is a non-contract. Assuming that the category probability corresponding to the determined to-be-processed document is 0.9, the to-be-processed document can be determined to be a contract.
The document classification method provided by the application comprises the steps of segmenting a document to be processed to obtain a plurality of texts; respectively inputting the texts into a feature extraction model, and determining the category feature of each text; combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed; and inputting the category feature vector into a classification model, and determining the category of the document to be processed. According to the method, the document to be processed is firstly divided into shorter texts, the method is suitable for long document processing, the category characteristics of each text are firstly determined, then the category characteristics of a plurality of texts are combined to obtain the category characteristic vector of the document to be processed, and the category characteristic vector can be considered to be fused with the category information of the full text of the document to be processed, namely the category characteristic vector not only can embody the category characteristics of each part of content in the document to be processed, but also can embody the association among each part of content in the document to be processed, so that the category characteristic vector is input into the classification model for classification, more information can be provided for the classification model, the classification result of the classification model is more accurate, and the accuracy of document classification is improved. In addition, the method avoids complicated manual word bank and feature engineering construction, and the category of the document is not determined according to a part of text in the document, so that the problem of text information loss is solved, and the influence on the classification accuracy is reduced.
Fig. 3 is a flowchart of a method for training a classification model according to an embodiment of the present application. The method specifically comprises the following steps:
step 302: obtaining a plurality of sample documents, wherein each sample document corresponds to a category feature vector.
Step 304: a first decision tree is constructed based on the plurality of category feature vectors, and a prediction probability for each sample document is determined based on the first decision tree.
Step 306: and constructing a second decision tree based on the prediction probability of each sample document and the plurality of category feature vectors, determining the prediction probability of each sample document based on the second decision tree, and so on until a stopping condition is reached, and determining the plurality of constructed decision trees as the classification models.
That is, the classification model may be considered as a model composed of a plurality of decision trees, and each decision tree may be considered as a calculation formula, and the plurality of calculation formulas are combined as parameters of the classification model for classifying the input of the classification model. In practice, the multiple decision trees are constructed by adjusting the parameters of the decision trees such that the final prediction class determined based on the multiple decision trees is infinitely close to or even identical to the class label of the sample document.
As an example, the category feature vector corresponding to each sample document is a feature for representing the category of the sample document, and each dimension in the category feature vector represents a category feature.
In addition, before the decision tree is constructed, a preset number of decision trees to be constructed may be set, and a preset depth of each decision tree, which is the number of layers included in the decision tree, may be set. In addition, the preset number and the preset depth can be set by a user according to actual requirements, can also be set by equipment default, and can also be adjusted according to actual conditions.
As an example, assume that M sample documents are included, and the category feature vector of each sample document is N-dimensional, each dimension represents one category feature, and for each dimension category feature, it includes two values of 0 and 1, where 0 represents not a contract and 1 represents a contract, and the initial prediction probability of each sample document may be set to 0.5, indicating that the probability of each sample document being a contract and a non-contract is the same. When constructing the first decision tree, for the X1 feature in each category feature vector, the following division modes can be obtained: and X1<2, X1<1 and X1<0, determining the Gain of each division mode based on the initial prediction probability of the M sample documents through a loss function, repeating the steps for the X2 features in each class feature vector until all the class features are traversed, selecting the division mode with the maximum Gain from all the calculated gains as a splitting point, dividing the M sample documents according to the division mode with the maximum Gain, and repeating the steps until the preset depth is reached. In addition, in the process of constructing the decision tree, if a leaf node has only one sample document, the node value of the leaf node can be calculated.
Wherein X1<2, X1<1, and X1<0 are merely examples, and indicate that the X1 features in the category feature vector may be divided according to these several ways, and 0, 1, and 2 may indicate different meanings, may be set according to user requirements or by default of the device, and do not limit the dividing way of the X1 features in this embodiment of the application.
In fact, the features may be divided in different dividing manners according to different types of the features. For example, assuming that the X1 feature represents the format of a sample document, and the format of the sample document includes a single column, a double column, and a mixed column, the division of the X1 feature may include the following division modes: whether X1 is less than 2, if yes, the format of the sample document is a single column, and if not, the format of the sample document is a non-single column, namely whether the sample document is divided according to the format of the sample document; whether X1 is less than 1, if yes, the format of the sample document is double-column, if no, the format of the sample document is non-double-column, namely whether the sample document is divided according to the format of the sample document; if X1 is less than 0, if yes, it indicates that the format of the sample document is a mixed column, and if no, it indicates that the format of the sample document is a non-mixed column, i.e., the sample document is divided according to whether the format of the sample document is a mixed column. The relationship between the selected 0, 1, and 2 and the format of the sample document may be set according to actual requirements, which is not limited in the embodiment of the present application.
Taking the example of calculating the gain of the division manner of X1<1, after division is performed according to X1<1, a sample document set a included in a branch of "X1 < 1" and a sample document set B included in a branch of "X1 ≧ 1" are determined, a loss value of the sample document set a is determined according to the initial prediction probability of the sample documents in the sample document set a by a loss function, a loss value of the sample document set B is determined according to the initial prediction probability of the sample documents in the sample document set B by a loss function, loss values of M sample documents are determined according to the initial prediction probabilities of the M sample documents by a loss function, and the gain when division is performed according to X1<1 is determined based on the three loss values.
As an example, after the first decision tree is constructed, each sample document is divided into a leaf node, each leaf node corresponds to a node value, the prediction probability of the sample document corresponding to the leaf node can be determined based on the node value of the leaf node, then the second decision tree is constructed in the manner of constructing the first decision tree based on the prediction probability of each sample document, and the prediction probability of each sample document is determined based on the second decision tree. Repeating the steps until the number of the constructed decision trees is greater than or equal to the preset number; or each sample document carries a class label, the prediction probability of each sample document is determined based on the currently constructed decision tree, the prediction class of each sample document determined based on the prediction probability is the same as the class label, the construction of the decision tree is stopped, and a plurality of currently constructed decision trees are determined as the trained classification models.
It should be noted that the classification models mentioned in the embodiments of the present application are trained in the manner of steps 302 to 306 described above.
According to the method and the device, a plurality of decision trees are constructed through the class feature vectors of the sample documents, the prediction probability of each sample document is determined according to the decision trees, when the decision tree construction stopping condition is determined to be met according to the prediction probability and the prediction label, the decision trees are stopped to be constructed, all constructed decision trees are determined to be classification models, the classification models capable of predicting the classes of the documents can be obtained, and the documents to be processed can be conveniently classified.
Fig. 4 is a flowchart illustrating a method for determining a category feature of a text according to an embodiment of the present application, which specifically includes the following steps:
step 402: and performing word segmentation processing on each text through an input layer to obtain a word unit of each text.
In the embodiment of the application, the feature extraction model comprises an input layer, an embedding layer and an output layer, wherein the input layer is used for performing word segmentation processing on input.
In some embodiments, in the process of performing word segmentation processing on a text, if the text is a Chinese text, a word can be divided into a word unit, a phrase can be divided into a word unit, and a punctuation mark can be divided into a word unit; if the text is a foreign language text, a word can be divided into a word unit, or a phrase can be divided into a word unit, or a foreign language character can be divided into a word unit; if there are numbers in the text, the numbers can be divided into word units individually.
In the embodiment of the present application, word segmentation processing may be performed on each text by using any one of word segmentation methods based on a dictionary, word frequency statistics, rules, and the like. In some embodiments, the dictionary-based word segmentation method may include a forward maximum matching, a reverse maximum matching, a least word segmentation method, and a two-way matching method. The rule-based word segmentation method may include a HMM (Hidden Markov Model) -based word segmentation method. Alternatively, in the embodiment of the present application, if the text is a chinese text, each word may be further divided into a word unit.
Taking the word segmentation method of the forward maximum matching as an example, for any text, forward obtaining m characters of the text as a matching field according to the reading sequence of the text, matching the matching field with words in a dictionary, if the dictionary has the same words as the matching field, considering that the matching is successful, and segmenting the matching field as a word unit. And if the dictionary does not have the same word as the matching field, the matching is considered to be failed, the last character of the matching field is removed, the rest characters are used as new matching fields to be matched again until the length of the rest character strings is zero, one round of matching can be considered to be completed, then the next group of m characters are taken out from the text to be used as the matching field to be matched until all the characters in the text are completely split.
The term "m" may be the number of characters included in the longest word in the dictionary, or may be preset according to experience, which is not limited in the embodiments of the present application.
Taking a word segmentation method of reverse maximum matching as an example, for any text, reversely acquiring m characters of the text according to the reading sequence of the text to serve as a matching field, matching the matching field with words in a reverse dictionary, if the words identical to the matching field exist in the reverse dictionary, matching is successful, and the matching field is segmented as a word unit. And if the word which is the same as the matching field does not exist in the reverse-order dictionary, the matching is considered to be failed, the first character of the matching field is removed, the rest characters are used as new matching fields, the matching is continued until the length of the rest characters is zero, one round of matching can be considered to be completed, then the next group of m characters are taken out from the text and used as the matching field to be matched until all the characters in the text are completely segmented. Wherein, each word in the reverse order dictionary is stored according to the reverse order mode.
As an example, the text may be first inverted to generate a reverse-order text, and then the reverse-order text is processed by the forward maximum matching word segmentation method according to the reverse-order dictionary, so as to achieve the word segmentation effect.
For example, taking the matching field as "fulfillment obligation" as an example, the word unit "fulfillment", "obligation" can be obtained by the above-described word segmentation method.
In the embodiment of the application, word units convenient for processing the feature extraction model can be obtained by performing word segmentation processing on the text, so that the feature extraction model can perform subsequent processing conveniently.
Step 404: and respectively carrying out word embedding processing on the word units of each text through an embedding layer to obtain word embedding vectors of the word units in each text.
In some embodiments, word embedding processing may be performed on word units through an embedding layer of the feature extraction model, that is, word units of a plurality of texts are input into the word embedding layer, so as to obtain word embedding vectors of the word units in each text.
As an example, a random initialization process may be performed on word units of each text to obtain a word embedding vector of each word unit; or, word embedding processing can be performed on the word units of each text in a one-hot (one-hot) coding manner, so as to obtain word embedding vectors of each word unit; or, word embedding processing can be performed on word units of each text in a word2vec coding mode to obtain a word embedding vector of each word unit.
Step 406: for any text, determining the category characteristics of the text through an output layer based on the word embedding vectors of word units in the text.
In the embodiment of the present application, after determining the word embedding vector of the word unit in each text, the word embedding vectors of multiple word units in the same text may be combined to obtain the word embedding vector of the text, and the category feature of the text may be determined based on the word embedding vector of the text.
In some embodiments, for any text, the word embedding vectors of the text may be obtained by concatenating the word embedding vectors according to the sequence of the word units in the text; alternatively, word embedding vectors for a plurality of word units may be added to obtain a word embedding vector for the text.
For example, taking the text as "both sides must fulfill obligations according to contracts", the word embedding vectors of the word units "both sides", "must", "according", "contract", "fulfilling", and "obligation" are determined by the above two steps, respectively, and assuming that the word embedding vector of "both sides" is 001, the word embedding vector of "must" is 000, the word embedding vector of "according to" is 001, the word embedding vector of "contract" is 010, the word embedding vector of "fulfilling" is 100, and the word embedding vector of "obligation" is 110, the word embedding vectors of the text may be spliced in the order of the word units in the text to obtain the word embedding vector of the text. According to different splicing modes, word embedding vectors of two texts can be obtained, wherein one is 001000001010100110, and the other is a 6 x 3 matrix
Figure BDA0003662271750000111
In some embodiments, the output Layer may include a Fully Connected Layer, which may be referred to as a full Connected Layer, and the Fully Connected Layer includes an activation function. Illustratively, the activation function may be a Sigmoid function. The Sigmoid function may normalize the input to map the variable of the input between 0, 1.
As an example, for any text, a word embedding vector of the text may be input into a fully-connected layer, the word embedding vector is converted through parameters of the fully-connected layer to obtain a relevance score of the text and each category, the relevance score is processed through an activation function, a probability that the text belongs to each category may be determined, a category corresponding to the maximum probability is determined as the category to which the text belongs, and the category is converted into a vector to represent category features of the text.
Exemplarily, assuming that the category of the text includes both contract and non-contract, the contract is represented by 1, and the non-contract is represented by 0, the category feature of the text may be a one-dimensional vector 0 or 1; alternatively, the class feature of the text may be an n (n ≧ 2) dimensional vector, and if the text is a contract, the class feature thereof may be 00.
For example, assuming that the category of the text includes both contracts and non-contracts, the contract being represented by 1 and the non-contract being represented by 2, the category characteristic of the text may be 1 or 2.
It should be noted that, in the embodiment of the present application, the operations performed on each text are the same, and for convenience of description, only any text is taken as an example to describe the process of determining the category feature of the text.
It should be noted that steps 402 to 406 are a specific implementation of step 204.
In the embodiment of the application, word embedding processing is performed on word units in a text through the feature extraction model, word embedding vectors of the text are determined according to the word embedding vectors of the word units, word embedding vectors capable of accurately reflecting semantic features of the text can be obtained, the category of the text is determined by using the more accurate word embedding vectors, and the accuracy of determining the category of the text through the model can be improved.
Fig. 5 is a flowchart illustrating another method for determining a category feature of a text according to an embodiment of the present application, which specifically includes the following steps:
step 502: for any text, performing attention calculation on a word embedding vector of a first word unit of the text and a word embedding vector of each word unit in the text through a word level attention layer to determine a feature vector of the text.
Wherein the first word unit is any word unit in the text. The output layers of the feature extraction model comprise a word level attention layer and a full connection layer.
In the embodiment of the application, the word embedding vectors of the word units are only for single word units, and although the feature vectors representing the text semantics can be obtained through simple vector concatenation, the obtained feature vectors ignore the relation between the word units in the text, so that the feature vectors considering the association relation between the word units in the text can be obtained as the feature vectors of the text by further processing the multiple word embedding vectors through a word-level attention layer.
In some embodiments, word embedding vectors of a plurality of word units in a text are input into a word-level attention layer, and an attention matrix may be obtained by performing attention calculation on the word embedding vector of each first word unit and the word embedding vector of each word unit including the first word unit in the text, where an element in the attention matrix is a correlation value of the first word unit and the word unit in the text. And then determining a plurality of weight values corresponding to each word unit based on the attention moment matrix, and determining the feature vector of the text according to the plurality of weight values corresponding to each word unit and the word embedding vector of each word unit.
As an example, the calculation of attention for the first word unit and each word unit in the text may be a determination of a similarity between the first word unit and each word unit in the text.
Exemplarily, assuming that the text includes 4 word units a, b, c, and d, the word embedding vector of word unit a and itself (i.e. the word embedding vector of word unit a) may be subjected to attention calculation to obtain a11, the word embedding vector of word unit a and the word embedding vector of word unit b are subjected to attention calculation to obtain a correlation value a12 of word unit a and word unit b as elements of the first row and the second column of the attention matrix, and the word embedding vector of word unit a and the word embedding vector of word unit b are subjected to attention calculationAnd performing attention calculation on the word embedding vector of the unit C to obtain a correlation value A13 of the word unit A and the word unit C as elements of the first row and the third column of the attention matrix, performing attention calculation on the word embedding vector of the word unit A and the word embedding vector of the word unit D to obtain a correlation value A14 of the word unit A and the word unit D as elements of the first row and the fourth column of the attention matrix, and performing the same processing on other word units by analogy to obtain the attention matrix. For example, assume that the attention matrix is
Figure BDA0003662271750000121
The attention matrix has the same number of rows and columns, and is equal to the number of word units in the text. And the element Aij in the ith row and the jth column in the attention matrix represents the correlation value between the ith word unit and the jth word unit in the text, wherein i and j are integers which are more than 0.
As an example, the specific implementation of determining the plurality of weight values corresponding to each word unit based on the attention moment matrix may include:
normalizing the correlation value in the attention matrix according to rows to obtain a normalized correlation value, wherein the correlation value of the ith row and the jth column is the weight value of the ith word unit relative to the jth word unit, and a plurality of weight values corresponding to each word unit in the text can be obtained;
or, normalizing the correlation value in the attention matrix by columns to obtain a normalized correlation value, where the correlation value of the ith column and the jth line is a weight value of the ith word unit relative to the jth word unit, and a plurality of weight values corresponding to each word unit in the text can be obtained.
Continuing with the above example, the correlation values in the attention matrix are normalized by row. For example, assuming that a11, a12, a13 and a14 can be obtained after normalization processing is performed on the correlation value of the first row, it can be determined that a11 is a weighted value of the word unit "a" relative to itself, a12 is a weighted value of the word unit "a" relative to the word unit "b", a13 is a weighted value of the word unit "a" relative to the word unit "c", a14 is a weighted value of the word unit "a" relative to the word unit "d", and so on, a weighted value of each word unit relative to itself and other word units in the text, that is, a plurality of weighted values corresponding to each word unit can be determined.
Similarly, normalization processing is performed according to columns, and a plurality of weight values corresponding to each word unit can also be obtained.
As an example, determining a specific implementation of the feature vector of the text according to the plurality of weight values corresponding to each word unit and the word embedding vector of each word unit may include: determining a feature vector of each word unit according to a plurality of weight values corresponding to each word unit and the word embedding vector of each word unit; and determining the feature vector of the text based on the feature vector of each word unit and a preset weight matrix.
The preset weight matrix is a general matrix existing in the word-level attention layer and can be determined by training the feature extraction model.
In one implementation, a word embedding vector matrix may be composed based on word embedding vectors of a plurality of word units; and forming a first weight matrix corresponding to each word unit based on a plurality of weighted values corresponding to each word unit, and determining the feature vector of each word unit based on the word embedding vector matrix and the first weight matrix corresponding to each word unit.
Continuing with the above example, for word units a, b, c, and d, each word unit corresponds to 4 weight values, and each weight value corresponds to one word unit of the 4 word units, that is, of the 4 weight values, 1 weight value corresponds to the word unit itself, and the other 3 weight values correspond to the remaining word units except the word unit. For example, word unit "a" corresponds to 4 weight values, which are weight value a11, weight value a12, weight value a13, and weight value a14, respectively, and weight value a11 corresponds to word unit "a", weight value a12 corresponds to word unit "b", weight value a13 corresponds to word unit "c", and weight value a14 corresponds to word unit "d". Assuming that the word embedding vector for each word unit in the text is an M-dimensional vector, a 4 × M word embedding vector matrix can be obtained based on the word embedding vectors for the 4 word units. For the word unit "a", the 4 weight values corresponding to the word unit "a" may form a 4 × 1 first weight matrix, and the transpose of the 4 × 1 first weight matrix may be multiplied by the 4 × M word embedding vector matrix, so as to obtain a1 × M matrix, where the 1 × M matrix is the feature vector of the word unit "a". Similarly, the feature vectors of the word units "b", "c", and "d" can be determined respectively.
In another implementation manner, for a first word unit, each weight value corresponding to the first word unit and a word embedding vector of the word unit corresponding to the weight value may be subjected to weighted fusion to obtain a feature vector of the first word unit.
Continuing with the above example, for word units a, b, c, and d, each word unit corresponds to 4 weight values, and each weight value corresponds to one word unit of the 4 word units, that is, of the 4 weight values, 1 weight value corresponds to the word unit itself, and the other 3 weight values correspond to the remaining word units except the word unit. For the word unit "a", which corresponds to 4 weight values of a11, a12, a13 and a14, respectively, a11 may be multiplied by the word embedding vector of "a", a12 may be multiplied by the word embedding vector of "b", a13 may be multiplied by the word embedding vector of "c", a14 may be multiplied by the word embedding vector of "d", and 4 products may be added as the feature vector of the word unit "a", so that the feature vectors of the word units "b", "c" and "d" may be determined, respectively.
Illustratively, after determining the feature vector of each word unit, a feature vector matrix is formed based on the feature vectors of a plurality of word units, and the feature vector of the text is determined based on the feature vector matrix and a preset weight matrix.
Continuing with the above example, assuming that feature vectors of 4 word units are determined, the feature vectors of the 4 word units are combined into a 4 × M feature vector matrix, and the transpose of the 4 × 1 preset weight matrix is multiplied by the 4 × M feature vector matrix, so that a1 × M matrix can be obtained, and the matrix is a feature vector of a text and fuses semantic features of all word units in the text.
Step 504: and determining the category characteristics of the text based on the characteristic vector of the text through the full connection layer.
In some embodiments, the feature vector of the text may be input into a Fully Connected Layer, which may be referred to as a full Connected Layer, and which includes an activation function. Illustratively, the activation function may be a Sigmoid function. The Sigmoid function may normalize the input to map the variable of the input between 0, 1.
As an example, for any text, inputting a feature vector of the text into a full-link layer, converting the feature vector by using parameters of the full-link layer to obtain a relevance score between the text and each category, processing the relevance score by using an activation function, determining a probability that the text belongs to each category, determining a category corresponding to the maximum probability as the category to which the text belongs, and converting the category into a vector to obtain a category feature of the text.
It should be noted that steps 502 to 504 are a specific implementation of step 406.
In the embodiment of the application, the attention calculation is performed on the first word unit in the text and each word unit in the text through the word-level attention layer, the relation between the word units in the text is considered, the context semantic relation of the text and the feature vector of the word unit semantic in the text can be accurately reflected, the text is represented by the more accurate feature vector, the category of the text is further determined, and the accuracy of the model for determining the category of the text can be improved.
Fig. 6 is a flowchart illustrating a further method for determining a category feature of a text according to an embodiment of the present application, which specifically includes the following steps:
step 602: and performing attention calculation on the feature vector of the text and the feature vector of each text in the plurality of texts through a text-level attention layer to determine an enhanced feature vector of the text.
In an embodiment of the present application, the output layer of the feature extraction model further comprises a text-level attention layer.
In the embodiment of the application, the feature vector of the text is only for a single text, and although the feature vector representing the semantics of the document can be obtained through simple splicing, the obtained feature vector omits the relation between the texts in the document, so that a plurality of feature vectors can be further processed through a text-level attention layer, and the feature vector considering the association relation between the texts in the document is obtained as the enhanced feature vector of the text.
In some embodiments, feature vectors of a plurality of texts are input into a text-level attention layer, and the feature vector of each text is subjected to attention calculation with the feature vector of each text including the text itself in the plurality of texts, so as to obtain an attention matrix, where an element in the attention matrix is a correlation value between the text and each text in the plurality of texts. And then determining a plurality of weight values corresponding to each text based on the attention matrix, and determining an enhanced feature vector of each text according to the plurality of weight values corresponding to each text and the feature vector of each text.
As an example, the attention calculation for the text and each of the plurality of texts including the text may be determining a similarity between the text and each of the plurality of texts.
Exemplarily, assuming that the document to be processed is divided into 4 texts X, Y, Z and W, attention calculation may be performed on the feature vector of the text X and itself (i.e., the feature vector of the text X) to obtain B11, attention calculation may be performed on the feature vector of the text X and the feature vector of the text Y to obtain a correlation value B12 of the text X and the text Y as elements in the first row and the second column of the attention matrix, attention calculation may be performed on the feature vector of the text X and the feature vector of the text Z to obtain a correlation value B13 of the text X and the text Z as elements in the first row and the third column of the attention matrix, attention calculation may be performed on the feature vector of the text X and the feature vector of the text W to obtain a correlation value B14 of the text X and the text W as elements in the first row and the fourth column of the attention matrix, and so on the other texts, an attention matrix can be derived. For example, assume that the attention matrix is
Figure BDA0003662271750000151
The attention matrix has the same number of rows and columns and is equal to the number of texts obtained by segmenting the document to be processed. And the element Bij in the ith row and the jth column in the attention matrix represents the correlation value between the ith text and the jth text in the document, wherein i and j are integers more than 0.
As an example, determining a specific implementation of a plurality of weight values corresponding to each text based on the attention moment matrix may include:
normalizing the correlation value in the attention matrix by row to obtain a normalized correlation value, wherein the correlation value of the ith row and the jth column is the weight value of the ith text relative to the jth text, and a plurality of weight values corresponding to each text can be obtained;
or, normalizing the correlation value in the attention matrix by columns to obtain a normalized correlation value, where the correlation value of the ith column and the jth line is a weight value of the ith text relative to the jth text, and multiple weight values corresponding to each text can be obtained.
Continuing with the above example, the correlation values in the attention matrix are normalized by row. For example, assuming that b11, b12, b13, and b14 can be obtained after normalization processing is performed on the correlation value of the first line, it can be determined that b11 is a weight value of text X relative to itself, b12 is a weight value of text X relative to text Y, b13 is a weight value of text X relative to text Z, b14 is a weight value of text X relative to text W, and so on, a plurality of weight values corresponding to each text can be determined.
In one implementation, determining a specific implementation of the enhanced feature vector of each text according to a plurality of weight values corresponding to each text and the feature vector of each text may include: forming a feature vector matrix based on the feature vectors of the plurality of texts; and forming a second weight matrix corresponding to each text based on a plurality of weight values corresponding to each text, and determining an enhanced feature vector of each text based on the feature vector matrix and the second weight matrix corresponding to each text.
Continuing with the above example, for text X, Y, Z, W, there are 4 weight values corresponding to each text, and each weight value corresponds to one text of the 4 texts, that is, of the 4 weight values, 1 weight value corresponds to the text itself, and the other 3 weight values correspond to the rest of texts except the text. For example, the text X corresponds to 4 weight values, which are respectively a weight value b11, a weight value b12, a weight value b13 and a weight value b14, and a weight value b11 corresponds to the text X, a weight value b12 corresponds to the text Y, a weight value b13 corresponds to the text Z, and a weight value b14 corresponds to the text W. Assuming that the feature vector of each text is an M-dimensional vector, a 4 × M feature vector matrix can be obtained based on the feature vectors of the 4 texts. For the text X, the 4 weight values corresponding to the text X may form a 4 × 1 second weight matrix, and the transpose of the 4 × 1 second weight matrix may be multiplied by the 4 × M eigenvector matrix, so as to obtain a1 × M matrix, where the 1 × M matrix is the enhanced eigenvector of the text X. Similarly, enhanced feature vectors for text Y, text Z, and text W may be determined separately.
In another implementation manner, determining a specific implementation of the enhanced feature vector of each text according to the plurality of weight values corresponding to each text and the feature vector of each text may include: and for the first text, performing weighted fusion on each weight value corresponding to the first text and the feature vector of the text corresponding to the weight value to obtain an enhanced feature vector of the first text.
Continuing with the above example, for text X, Y, Z, W, there are 4 weight values corresponding to each text, and each weight value corresponds to one text of the 4 texts, that is, of the 4 weight values, 1 weight value corresponds to the text itself, and the other 3 weight values correspond to the rest of texts except the text. For the text X, the 4 corresponding weighted values are b11, b12, b13 and b14, respectively, b11 may be multiplied by the feature vector of the text X, b12 may be multiplied by the feature vector of the text Y, b13 may be multiplied by the feature vector of the text Z, b14 may be multiplied by the feature vector of the text W, and the 4 products may be added to form the enhanced feature vector of the text X. Thus, enhanced feature vectors of the text Y, the text Z and the text W can be determined respectively.
Step 604: and determining the category characteristics of the text based on the enhanced characteristic vector of the text through the full connection layer.
In some embodiments, the enhanced feature vector of the text may be input into a fully-connected layer that includes an activation function. As an example, for any text, inputting an enhanced feature vector of the text into a full-link layer, converting the enhanced feature vector through parameters of the full-link layer to obtain a relevance score between the text and each category, processing the relevance score through an activation function to determine a probability that the text belongs to each category, determining a category corresponding to the maximum probability as the category to which the text belongs, and converting the category into a vector to obtain a category feature of the text.
It should be noted that steps 602 to 604 are a specific implementation manner of step 504 described above.
In the embodiment of the application, after the feature vectors of the texts are determined, which can accurately reflect the context semantic relationship of the texts and the word unit semantics in the texts, the feature vectors of the texts can be processed through the text-level attention layer, so that the enhanced feature vectors of each text, which are fused with the features of the text and other texts, are obtained, the enhanced feature vectors take the incidence relationship among the texts into consideration, so that not only the texts can be accurately represented, but also the incidence relationship among the texts can be represented, the categories of the texts are determined based on the enhanced feature vectors, and the incidence relationship of the whole document content can be taken into consideration in the determination of the categories of the texts, so that the accuracy of determining the categories of the texts can be improved.
Fig. 7 is a flowchart illustrating a method for determining a category feature vector of a document to be processed according to an embodiment of the present application, which specifically includes the following steps:
step 702: and splicing the category characteristics of the plurality of texts according to the sequence of the plurality of texts in the document to be processed to obtain a category characteristic vector of the document to be processed.
In a possible implementation manner, a sequence of a plurality of texts obtained by segmenting a document to be processed inevitably exists in the document to be processed, and in order to enable the obtained category feature vector to more accurately represent the document to be processed, the category features of the plurality of texts can be spliced according to the sequence of the plurality of texts in the document to be processed, so that the category feature vector of the document to be processed is obtained.
As an example, the class features may be stitched by increasing the feature dimension. For example, assuming that the document to be processed is divided into 3 texts, the category feature of text 1 is 00, the category feature of text 2 is 01, and the category feature of text 3 is 01, the category feature vector of the document to be processed may be obtained by stitching 000101. Or, assuming that the category feature of text 1 is 1, the category feature of text 2 is 2, and the category feature of text 3 is 1, the category feature vector of the documents to be processed can be obtained by stitching 121.
As another example, the class features may be stitched in the form of a matrix. For example, assuming that the document to be processed is segmented into 3 texts, the category feature of text 1 is 00, the category feature of text 2 is 01, and the category feature of text 3 is 01, the category feature vector of the document to be processed that can be spliced is
Figure BDA0003662271750000161
It should be noted that the above is only an example of obtaining the category feature vector of the document to be processed by splicing the category features, in practical application, a standard dimension of the category feature vector may be set, and when the dimension of the category feature vector obtained by splicing is less than the standard dimension, 0 is used for supplementing so that the dimension of the category feature vector reaches the standard dimension. For example, assuming that the dimension of the set category feature vector is 20, the category features of texts in the same category are represented by 1, the category features of texts in different categories are represented by 2, and assuming that the document to be processed is divided into 18 texts, and the category features of the 18 texts are 111111111122222111 respectively in sequence, the category feature vector of the document to be processed obtained by splicing is 11111111112222211100.
In another possible implementation manner, the category feature of each text may be subjected to attention calculation with the category feature of itself and the category features of other texts except itself, and an attention matrix may be obtained, where an element in the attention matrix is a correlation value of the text with itself and other texts. And then determining a plurality of weight values corresponding to each text based on the attention matrix, and determining a category feature vector of the document to be processed according to the plurality of weight values corresponding to each text and the category feature of each text.
Exemplarily, assuming that the document to be processed is divided into 4 texts X, Y, Z and W, the category feature of the text X and itself (i.e. the category feature of the text X) may be subjected to attention calculation to obtain C11, the category feature of the text X and the category feature of the text Y are subjected to attention calculation to obtain a relevance value C12 of the text X and the text Y as elements of a first row and a second column of the attention matrix, the category feature of the text X and the category feature of the text Z are subjected to attention calculation to obtain a relevance value C13 of the text X and the text Z as elements of a first row and a third column of the attention matrix, the category feature of the text X and the category feature of the text W are subjected to attention calculation to obtain a relevance value C14 of the text X and the text W as elements of a first row and a fourth column of the attention matrix, and so on, the same process is performed on other texts, an attention matrix can be derived. For example, assume that the attention matrix is
Figure BDA0003662271750000171
The attention matrix has the same number of rows and columns and is equal to the number of texts obtained by segmenting the document to be processed. And the element Cij of the ith row and the jth column in the attention matrix represents the correlation value between the ith text and the jth text in the document, wherein i and j are positive integers which are more than 0.
As an example, determining a specific implementation of a plurality of weight values corresponding to each text based on the attention moment matrix may include:
normalizing the correlation value in the attention matrix by row to obtain a normalized correlation value, wherein the correlation value of the ith row and the jth column is the weight value of the ith text relative to the jth text, and a plurality of weight values corresponding to each text can be obtained;
or, normalizing the correlation value in the attention matrix by columns to obtain a normalized correlation value, where the correlation value of the ith column and the jth line is a weight value of the ith text relative to the jth text, and multiple weight values corresponding to each text can be obtained.
Continuing with the above example, the correlation values in the attention matrix are normalized by row. For example, assuming that c11, c12, c13, and c14 can be obtained after normalization processing is performed on the correlation value of the first line, it can be determined that c11 is a weight value of text X relative to itself, c12 is a weight value of text X relative to text Y, c13 is a weight value of text X relative to text Z, c14 is a weight value of text X relative to text W, and so on, a weight value of each text relative to itself and other texts, that is, a plurality of weight values corresponding to each text can be determined.
Similarly, normalization processing is performed according to columns, and a plurality of weight values corresponding to each text can also be obtained.
As an example, determining a specific implementation of the category feature vector of the document to be processed according to the plurality of weight values corresponding to each text and the category feature of each text may include: determining an enhanced category characteristic of each text according to a plurality of weight values corresponding to each text and the category characteristic of each text; and determining the category characteristic vector of the document to be processed based on the enhanced characteristic vector and the preset weight matrix of each text.
In one implementation, a category feature matrix may be composed based on category features of a plurality of texts; and forming a third weight matrix corresponding to each text based on a plurality of weight values corresponding to each text, and determining the enhanced category characteristics of the text based on the category characteristic matrix and the third weight matrix corresponding to each text.
Continuing with the above example, for text A, B, C, D, each text corresponds to 4 weight values, and each weight value corresponds to one text other than the text, assuming that the category feature of each text is an M-dimensional vector, a 4 × M category feature matrix can be obtained based on the category features of the 4 texts. For the text a, the 4 weight values corresponding to the text a may form a 4 × 1 third weight matrix, and the transpose of the 4 × 1 third weight matrix may be multiplied by the 4 × M category feature matrix, so as to obtain a1 × M matrix, where the 1 × M matrix is the category feature vector of the text a. Similarly, enhanced category features for text B, text C, and text D may be determined separately.
In another implementation manner, for a first text, each weight value corresponding to the first text and the category feature of the text corresponding to the weight value are subjected to weighted fusion to obtain an enhanced category feature of the first text.
Continuing with the above example, for text X, Y, Z, W, there are 4 weight values corresponding to each text, and each weight value corresponds to one text of the 4 texts, that is, of the 4 weight values, 1 weight value corresponds to the text itself, and the other 3 weight values correspond to the rest of texts except the text. For example, the text X corresponds to 4 weight values, which are a weight value c11, a weight value c12, a weight value c13 and a weight value c14, and a weight value c11 corresponds to the text X, a weight value c12 corresponds to the text Y, a weight value c13 corresponds to the text Z, and a weight value c14 corresponds to the text W. For the text X, the 4 corresponding weighted values are c11, c12, c13 and c14, respectively, c11 may be multiplied by the category feature of the text X, c12 may be multiplied by the category feature of the text Y, c13 may be multiplied by the category feature of the text Z, c14 may be multiplied by the category feature of the text W, and the 4 products may be added as the enhanced category feature of the text X. Thus, the enhanced category characteristics of the text Y, the text Z and the text W can be respectively determined.
Illustratively, after the enhanced category features of each text are determined, an enhanced category feature matrix is formed based on the enhanced category features of the texts, and a category feature vector of the document to be processed is determined based on the enhanced category feature matrix and a preset weight matrix.
Continuing with the above example, assuming that the enhanced category features of 4 texts are determined, the enhanced category features of the 4 texts are combined into a 4 × M enhanced category feature matrix, and the transpose of the 4 × 1 preset weight matrix is multiplied by the 4 × M enhanced category feature matrix to obtain a1 × M matrix, which is the category feature vector of the document to be processed and fuses the category features of all the texts in the document to be processed.
It should be noted that step 702 is a specific implementation manner of step 206.
Step 704: and inputting the category feature vector into a classification model, and determining the category of the document to be processed.
It should be noted that, for specific implementation of step 704, reference may be made to the related description of step 208, and this embodiment is not described herein again.
According to the method and the device for processing the document, the category characteristics of the plurality of texts are spliced according to the sequence of the plurality of texts in the document to be processed, the literary logic which accords with the document to be processed can be obtained, the category characteristic vector of the incidence relation between the texts in the document to be processed can be represented, the category of the document to be processed is determined based on the category characteristics, the whole semantics of the document to be processed can be considered, the context relation of the document to be processed can also be considered, and the determined category can be more accurate.
Fig. 8 is a flowchart illustrating a method for segmenting a document to be processed according to an embodiment of the present application, which specifically includes the following steps:
step 802: and identifying the content of the document to be processed based on a character identification algorithm to obtain the character content of the document to be processed.
Wherein the character recognition algorithm is used to recognize character content in the document. For example, the character recognition algorithm may be an OCR algorithm or a PDF parsing tool.
In some embodiments, the document to be processed may be parsed by a PDF parsing tool; or character recognition can be carried out on the document to be processed through an OCR algorithm; or, the PDF analysis tool and the OCR algorithm may be fused to perform character recognition on the document to be processed, and the character content of the document to be processed can be determined by these several ways.
As an example, although character recognition based on the OCR algorithm is generally effective, the OCR algorithm may have problems such as a special character recognition error or a recognition error of some symbols as words. Although the PDF analysis tool identifies characters accurately, the PDF analysis tool alone cannot restore the layout information of the document to be processed. Therefore, the two modes can be fused for use, the problem that the OCR algorithm identifies the special characters wrongly can be solved, the problem that the PDF analysis tool cannot restore the format information to be processed can be solved, and the character content identification effect is improved.
Step 804: and segmenting the character content according to a preset segmentation strategy to obtain a plurality of texts.
The preset segmentation strategy may be a strategy that is manually and empirically set and is used for dividing the character content into a plurality of texts.
In some embodiments, the preset segmentation policy may be to segment the document to be processed according to chapters of the document to be processed; or, the preset segmentation strategy may be to segment the document to be processed according to the paragraph of the document to be processed; or, the preset segmentation strategy may be to divide the character content according to the number of specific characters, and the integrity of the text content needs to be ensured during the division; alternatively, the document to be processed may be divided by combining three ways, i.e., by chapter, by paragraph, and by specific number of characters.
As an example, the first chapter can be divided into one text, the second chapter can be divided into one text, and so on according to chapter numbers; or, the first segment may be divided into one text, the second segment may be divided into one text, and so on according to paragraphs; or H texts are obtained by dividing according to chapters and then are divided according to paragraphs in each text; or, obtaining K texts by paragraph division, and dividing each text according to the number of specific characters; or the S texts are obtained by dividing according to chapters, the sub-texts are obtained by dividing according to paragraphs in each text, and the sub-texts are obtained by dividing according to the number of specific characters in each sub-text.
It should be noted that steps 802-804 are a specific implementation of step 202.
In the embodiment of the application, before the documents to be processed are classified, character recognition is performed on the documents to be processed through an OCR algorithm and a PDF analysis tool to obtain the character contents of the documents to be processed, then the character contents of the documents to be processed are segmented according to a preset segmentation strategy to obtain a plurality of texts, and the problem that long documents cannot be directly input into a model for classification is solved.
The document classification method provided by the present application is further described below with reference to fig. 9 by taking the application of the document classification method to the identification problem of contract documents as an example. Fig. 9 shows a processing flow chart of a document classification method applied to identification of contract documents according to an embodiment of the present application, which specifically includes the following steps:
step 902: and identifying the content of the document to be processed based on a character identification algorithm to obtain the character content of the document to be processed.
Wherein the character recognition algorithm is used for recognizing character contents in the document.
Taking the to-be-processed document as a PDF document, the content of the to-be-processed document may be identified by a method combining an OCR algorithm and a PDF analysis tool, that is, the character content in the to-be-processed document is extracted.
Step 904: and segmenting the character content according to a preset segmentation strategy to obtain a plurality of texts.
Continuing with the above example, the segmentation may be performed according to the maximum text length that the BERT model can handle, and the completeness of the sentence is guaranteed. For example, if the maximum length is 510, each 510 characters may be divided into one text, but if the 510 th character is reached, the text is divided into half sentences, and the sentence is divided from the end of less than 510 characters.
For example, referring to fig. 10, fig. 10 is a schematic processing procedure diagram of a document classification method according to an embodiment of the present application. In fig. 10, N texts are input to the BERT model.
Step 906: and respectively inputting the plurality of texts into the feature extraction model, and performing word segmentation processing on each text to obtain a word unit of each text.
For example, the feature extraction model may be a BERT model.
Step 908: and respectively carrying out word embedding processing on the word units of each text to obtain word embedding vectors of the word units in each text.
Step 910: aiming at any text, performing attention calculation on a word embedding vector of a first word unit of the text and a word embedding vector of each word unit in the text through a word level attention layer of a feature extraction model, and determining a feature vector of the text.
Step 912: and performing attention calculation on the feature vector of the text and the feature vector of each text in the plurality of texts to determine an enhanced feature vector of the text.
Step 914: and inputting the enhanced feature vector of the text into a full connection layer of a feature extraction model, and determining the category feature of the text.
Step 916: and splicing the category characteristics of the plurality of texts according to the sequence of the plurality of texts in the document to be processed to obtain a category characteristic vector of the document to be processed.
For example, referring to fig. 10, after the processing is performed by the BERT model, N category features can be obtained, and the N category features are sequentially spliced to obtain a category feature vector of the document to be processed.
Step 918: and inputting the category characteristic vector of the document to be processed into the classification model, and determining the category of the document to be processed.
For example, the classification model may be Lightgbm, and referring to fig. 10, a category feature vector is input into the Lightgbm model, so that a probability that the to-be-processed document is a contract and a probability that the to-be-processed document is not a contract can be obtained, if the probability of the contract is greater than the probability of the non-contract, it is determined that the category of the to-be-processed document is a contract, if the probability of the contract is less than the probability of the non-contract, it is determined that the category of the to-be-processed document is not a contract, and if the probability of the contract is the same as the probability of the non-contract, it is necessary to re-determine the category of the to-be-processed document.
The document classification method provided by the application comprises the steps of segmenting a document to be processed to obtain a plurality of texts; respectively inputting the texts into a feature extraction model, and determining the category feature of each text; combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed; the category feature vector is input into the classification model, and whether the document to be processed is a contract or not can be determined. According to the method, the document to be processed is firstly divided into shorter texts, the method is suitable for long document processing, the category characteristics of each text are firstly determined, then the category characteristics of a plurality of texts are combined to obtain the category characteristic vector of the document to be processed, and the category characteristic vector can be considered to be fused with the category information of the full text of the document to be processed, namely the category characteristic vector not only can embody the category characteristics of each part of content in the document to be processed, but also can embody the association among each part of content in the document to be processed, so that the category characteristic vector is input into the classification model for classification, more information can be provided for the classification model, the classification result of the classification model is more accurate, and the accuracy of contract document identification is improved. In addition, the identification accuracy is improved, the use experience of a user can be improved, the use frequency of the method for the user can be improved, and the conversion rate of follow-up tasks such as contract auditing and the like can be improved.
Corresponding to the above method embodiment, the present application further provides a document sorting apparatus embodiment, and fig. 11 shows a schematic structural diagram of a document sorting apparatus provided in an embodiment of the present application. As shown in fig. 11, the apparatus includes:
a segmentation module 1102 configured to segment the document to be processed to obtain a plurality of texts;
a first determining module 1104, configured to input the texts into the feature extraction model respectively, and determine category features of each text;
a combination module 1106, configured to combine the category features of the multiple texts to obtain a category feature vector of the document to be processed;
a second determining module 1108 configured to input the category feature vector into a classification model, and determine a category of the document to be processed.
In one possible implementation manner of the present application, the feature extraction model includes an input layer, an embedding layer, and an output layer, and the first determining module 1104 is further configured to:
performing word segmentation processing on each text through the input layer to obtain a word unit of each text;
respectively carrying out word embedding processing on the word units of each text through the embedding layer to obtain word embedding vectors of the word units in each text;
for any text, determining the category characteristics of the text based on the word embedding vectors of word units in the text.
In one possible implementation manner of the present application, the output layers include a word-level attention layer and a full-connection layer, and the first determining module 1104 is further configured to:
for any text, performing attention calculation on a word embedding vector of a first word unit of the text and a word embedding vector of each word unit in the text through the word level attention layer to determine a feature vector of the text, wherein the first word unit is any word unit in the text;
and determining the category characteristics of the text based on the characteristic vector of the text through the full connection layer.
In one possible implementation manner of the present application, the output layer further includes a text-level attention layer, and the first determining module 1104 is further configured to:
performing attention calculation on the feature vector of the text and the feature vector of each text in a plurality of texts through the text-level attention layer to determine an enhanced feature vector of the text;
determining, by the fully-connected layer, a category feature of the text based on the feature vector of the text, including:
and determining the category characteristics of the text based on the enhanced characteristic vector of the text through the full connection layer.
In one possible implementation manner of the present application, the combining module 1106 is further configured to:
and splicing the category characteristics of the texts according to the sequence of the texts in the document to be processed to obtain a category characteristic vector of the document to be processed.
In one possible implementation of the present application, the feature extraction model includes a BERT model.
In one possible implementation manner of the present application, the apparatus further includes a classification model training module, where the classification model training module is configured to:
obtaining a plurality of sample documents, wherein each sample document corresponds to a category feature vector;
constructing a first decision tree based on a plurality of category feature vectors, and determining a prediction probability of each sample document based on the first decision tree;
and constructing a second decision tree based on the prediction probability of each sample document and the plurality of category feature vectors, determining the prediction probability of each sample document based on the second decision tree, and so on until a stopping condition is reached, and determining the plurality of constructed decision trees as the classification models.
In one possible implementation manner of the present application, the classification model includes a Lightgbm model, and the loss function of the classification model is a logarithmic loss function.
In one possible implementation manner of the present application, the segmentation module 1102 is further configured to:
identifying the content of the document to be processed based on a character identification algorithm to obtain the character content of the document to be processed, wherein the character identification algorithm is used for identifying the character content in the document;
and segmenting the character content according to a preset segmentation strategy to obtain the plurality of texts.
The document classification device provided by the application is used for segmenting a document to be processed to obtain a plurality of texts; respectively inputting the texts into a feature extraction model, and determining the category feature of each text; combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed; and inputting the category feature vector into a classification model, and determining the category of the document to be processed. Therefore, the document to be processed is firstly divided into shorter texts, the method is suitable for long document processing, the category characteristic of each text is firstly determined, then the category characteristic vectors of the documents to be processed are combined to obtain the category characteristic vector of the documents to be processed, and the category characteristic vector can be considered to be fused with the category information of the full text of the documents to be processed, namely the category characteristic vector not only can embody the category characteristics of all the contents in the documents to be processed, but also can embody the association among all the contents in the documents to be processed, so that the category characteristic vector is input into the classification model for classification, more information can be provided for the classification model, the classification result of the classification model is more accurate, and the accuracy of document classification is improved.
The above is an illustrative scheme of a document classification device of the present embodiment. It should be noted that the technical solution of the document classification apparatus and the technical solution of the document classification method described above belong to the same concept, and details that are not described in detail in the technical solution of the document classification apparatus can be referred to the description of the technical solution of the document classification method described above. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
Fig. 12 shows a block diagram of a computing device 1200 according to an embodiment of the present application. The components of the computing device 1200 include, but are not limited to, memory 1210 and processor 1220. Processor 1220 is coupled to memory 1210 via bus 1230, and database 1250 is used to store data.
The computing device 1200 also includes an access device 1240, the access device 1240 enabling the computing device 1200 to communicate via one or more networks 1260. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 120 may include one or more of any type of Network Interface (e.g., a Network Interface Controller (NIC)) whether wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a global microwave interconnect access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.
In one embodiment of the application, the above components of the computing device 1200 and other components not shown in fig. 12 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 12 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 1200 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or pc (personal computer). Computing device 1200 may also be a mobile or stationary server.
Processor 1220 is used, among other things, for executing the computer-executable instructions of the document classification method.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned document classification method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned document classification method.
An embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are for a document classification method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned document classification method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned document classification method.
An embodiment of the present application further provides a chip, in which a computer program is stored, and the computer program implements the steps of the document classification method when executed by the chip.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (12)

1. A method of classifying a document, the method comprising:
segmenting a document to be processed to obtain a plurality of texts;
respectively inputting the texts into a feature extraction model, and determining the category feature of each text;
combining the category characteristics of the texts to obtain a category characteristic vector of the document to be processed;
and inputting the category feature vector into a classification model, and determining the category of the document to be processed.
2. The method of claim 1, wherein the feature extraction model comprises an input layer, an embedding layer, and an output layer, and wherein inputting the plurality of texts into the feature extraction model separately, determining the class feature of each text comprises:
performing word segmentation processing on each text through the input layer to obtain a word unit of each text;
respectively carrying out word embedding processing on the word units of each text through the embedding layer to obtain word embedding vectors of the word units in each text;
for any text, determining the category characteristics of the text based on the word embedding vectors of word units in the text through the output layer.
3. The method of claim 2, wherein the output layers comprise a word-level attention layer and a full-concatenation layer, and for any text, determining, by the output layers, a category feature of the text based on a word-embedding vector of word units in the text comprises:
for any text, performing attention calculation on a word embedding vector of a first word unit of the text and a word embedding vector of each word unit in the text through the word level attention layer to determine a feature vector of the text, wherein the first word unit is any word unit in the text;
and determining the category characteristics of the text based on the characteristic vector of the text through the full connection layer.
4. The method of claim 3, wherein the output layer further comprises a text-level attention layer, and wherein determining, via the fully-connected layer, the category feature of the text based on the feature vector of the text further comprises:
performing attention calculation on the feature vector of the text and the feature vector of each text in a plurality of texts through the text-level attention layer to determine an enhanced feature vector of the text;
determining, by the fully-connected layer, a category feature of the text based on the feature vector of the text, including:
and determining the category characteristics of the text based on the enhanced characteristic vector of the text through the full connection layer.
5. The method of claim 1, wherein combining the category features of the plurality of texts to obtain a category feature vector of the document to be processed comprises:
and splicing the category characteristics of the texts according to the sequence of the texts in the document to be processed to obtain a category characteristic vector of the document to be processed.
6. The method of any one of claims 1-5, wherein the feature extraction model comprises a BERT model.
7. The method of claim 1, wherein the classification model is trained by:
obtaining a plurality of sample documents, wherein each sample document corresponds to a category feature vector;
constructing a first decision tree based on a plurality of category feature vectors, and determining a prediction probability of each sample document based on the first decision tree;
and constructing a second decision tree based on the prediction probability of each sample document and the plurality of category feature vectors, determining the prediction probability of each sample document based on the second decision tree, and so on until a stopping condition is reached, and determining the plurality of constructed decision trees as the classification models.
8. The method of claim 1 or 7, wherein the classification model comprises an optical gradient booster Lightgbm model and the loss function of the classification model is a logarithmic loss function.
9. The method of claim 1, wherein prior to segmenting the document to be processed into the plurality of texts, comprising:
identifying the content of the document to be processed based on a character identification algorithm to obtain the character content of the document to be processed, wherein the character identification algorithm is used for identifying the character content in the document;
the method for segmenting the document to be processed to obtain a plurality of texts comprises the following steps:
and segmenting the character content according to a preset segmentation strategy to obtain the plurality of texts.
10. A document sorting apparatus, comprising:
the segmentation module is configured to segment the document to be processed to obtain a plurality of texts;
the first determination module is configured to input the texts into a feature extraction model respectively and determine the category feature of each text;
the combination module is configured to combine the category characteristics of the texts to obtain a category characteristic vector of the document to be processed;
and the second determination module is configured to input the category feature vector into a classification model and determine the category of the document to be processed.
11. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the document classification method of any one of claims 1 to 9.
12. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the document classification method of any one of claims 1 to 9.
CN202210576341.8A 2022-05-25 2022-05-25 Document classification method and device Pending CN114896404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210576341.8A CN114896404A (en) 2022-05-25 2022-05-25 Document classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210576341.8A CN114896404A (en) 2022-05-25 2022-05-25 Document classification method and device

Publications (1)

Publication Number Publication Date
CN114896404A true CN114896404A (en) 2022-08-12

Family

ID=82725892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210576341.8A Pending CN114896404A (en) 2022-05-25 2022-05-25 Document classification method and device

Country Status (1)

Country Link
CN (1) CN114896404A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824609A (en) * 2023-06-29 2023-09-29 北京百度网讯科技有限公司 Document format detection method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824609A (en) * 2023-06-29 2023-09-29 北京百度网讯科技有限公司 Document format detection method and device and electronic equipment
CN116824609B (en) * 2023-06-29 2024-05-24 北京百度网讯科技有限公司 Document format detection method and device and electronic equipment

Similar Documents

Publication Publication Date Title
US11210306B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11741109B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US10789415B2 (en) Information processing method and related device
CN111783462A (en) Chinese named entity recognition model and method based on dual neural network fusion
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
CN113961685A (en) Information extraction method and device
Yan et al. ConvMath: a convolutional sequence network for mathematical expression recognition
Suleiman et al. Deep learning based extractive text summarization: approaches, datasets and evaluation measures
Ma et al. Tagging the web: Building a robust web tagger with neural network
Çakır et al. Multi-task regularization based on infrequent classes for audio captioning
CN111274829A (en) Sequence labeling method using cross-language information
Moeng et al. Canonical and surface morphological segmentation for nguni languages
Fu et al. RepSum: Unsupervised dialogue summarization based on replacement strategy
Chen et al. A Writing Style Embedding Based on Contrastive Learning for Multi-Author Writing Style Analysis.
CN113065349A (en) Named entity recognition method based on conditional random field
CN114925702A (en) Text similarity recognition method and device, electronic equipment and storage medium
CN114896404A (en) Document classification method and device
Gong et al. Improving extractive document summarization with sentence centrality
Goswami et al. ULD@ NUIG at SemEval-2020 Task 9: Generative morphemes with an attention model for sentiment analysis in code-mixed text
US11379534B2 (en) Document feature repository management
CN112528657A (en) Text intention recognition method and device based on bidirectional LSTM, server and medium
CN113362026A (en) Text processing method and device
CN116798417B (en) Voice intention recognition method, device, electronic equipment and storage medium
CN118093689A (en) Multi-mode document analysis and structuring processing system based on RPA
CN116562286A (en) Intelligent configuration event extraction method based on mixed graph attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination