CN115659969A - Document labeling method and device, electronic equipment and storage medium - Google Patents

Document labeling method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115659969A
CN115659969A CN202211592980.XA CN202211592980A CN115659969A CN 115659969 A CN115659969 A CN 115659969A CN 202211592980 A CN202211592980 A CN 202211592980A CN 115659969 A CN115659969 A CN 115659969A
Authority
CN
China
Prior art keywords
document
labeled
keyword
label
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211592980.XA
Other languages
Chinese (zh)
Other versions
CN115659969B (en
Inventor
郑玉玲
王凌云
王梓凝
刘兆蓬
宋丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengfang Financial Technology Co ltd
Original Assignee
Chengfang Financial Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengfang Financial Technology Co ltd filed Critical Chengfang Financial Technology Co ltd
Priority to CN202211592980.XA priority Critical patent/CN115659969B/en
Publication of CN115659969A publication Critical patent/CN115659969A/en
Application granted granted Critical
Publication of CN115659969B publication Critical patent/CN115659969B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of document marking, and provides a document marking method, a document marking device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled. The method, the device, the electronic equipment and the storage medium provided by the invention are used for determining the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, so that the reliability and the accuracy of determining the target label are ensured, the method is not limited by the acquisition quantity of labeled samples, the realization is easy, and the reliability of the target label is strong.

Description

Document labeling method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of document labeling technologies, and in particular, to a document labeling method and apparatus, an electronic device, and a storage medium.
Background
Automatic labeling of documents aims to label a given document with one or more labels, which facilitates subsequent processing of documents such as classification, searching, summarization, and the like.
In the prior art, a traditional machine learning document labeling method and a deep learning document labeling method are both supervised learning methods, and the training of a model thereof depends on a large amount of labeling data. However, in practical applications, in some scenarios, only a part of unlabeled documents and a label list can be obtained, and in other scenarios, due to problems such as data privacy, only the label list can be obtained, and the absence of labeled samples directly affects the reliability of automatic labeling of documents.
Disclosure of Invention
The invention provides a document labeling method, a document labeling device, electronic equipment and a storage medium, which are used for solving the defect that the document labeling method for supervised learning in the prior art depends on a large amount of labeling data for training.
The invention provides a document labeling method, which comprises the following steps:
acquiring a document to be annotated and a tag list;
extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting the word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
According to the document labeling method provided by the invention, the determining of the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the label scores of the plurality of labels.
According to a document labeling method provided by the present invention, determining label scores of a plurality of labels of a document to be labeled based on similarity between each keyword and each label in the label list and word frequency of each keyword in the document to be labeled, comprises:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
Figure 721092DEST_PATH_IMAGE001
wherein ,
Figure 130208DEST_PATH_IMAGE002
to represent the document to be annotated
Figure 591276DEST_PATH_IMAGE003
The score of the label of each label is given,
Figure 477847DEST_PATH_IMAGE004
is shown as
Figure 391576DEST_PATH_IMAGE005
The number of the key words is one,
Figure 869962DEST_PATH_IMAGE006
is shown as
Figure 451116DEST_PATH_IMAGE007
The number of the labels is one,
Figure 222500DEST_PATH_IMAGE008
indicates the total number of the keywords to be used,
Figure 889105DEST_PATH_IMAGE009
is a first
Figure 905603DEST_PATH_IMAGE010
A key word and
Figure 341263DEST_PATH_IMAGE003
the similarity of the individual labels is determined,
Figure 286479DEST_PATH_IMAGE011
is the first one
Figure 237117DEST_PATH_IMAGE012
The word frequency of the keyword in the document to be labeled,
Figure 57306DEST_PATH_IMAGE013
is to
Figure 347473DEST_PATH_IMAGE014
And carrying out normalized word frequency.
According to the document labeling method provided by the invention, the step of determining the target label of the document to be labeled based on the label scores of the plurality of labels comprises the following steps:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
According to the document labeling method provided by the invention, the extracting of the keywords from the document to be labeled to obtain a plurality of keywords comprises the following steps:
applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
According to the document labeling method provided by the invention, the step of obtaining the sample text and the sample keywords corresponding to the sample text comprises the following steps:
acquiring a thesis document related to each label in the label list, wherein the thesis document carries a thesis keyword;
and determining the sample text based on the paper document, and determining a sample keyword corresponding to the sample text based on the paper keyword.
According to a document labeling method provided by the present invention, the determining the sample text based on the thesis document comprises:
sample text is determined based on the title and abstract in the thesis document.
The invention also provides a document labeling device, comprising:
the acquisition unit is used for acquiring a document to be annotated and a label list;
the keyword extraction unit is used for extracting keywords from the document to be labeled to obtain a plurality of keywords and counting the word frequency of each keyword in the document to be labeled;
and the tag determining unit is used for determining a target tag of the document to be labeled based on the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be labeled.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the document marking method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document annotation process as described in any one of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a document annotation process as described in any one of the above.
The document labeling method, the device, the electronic equipment and the storage medium provided by the invention determine the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the combination of the similarity and the word frequency ensures the reliability and the accuracy of the determination of the target label, is not limited by the acquisition quantity of labeled samples, is easy to realize and has strong reliability of the target label.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a document labeling method according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of step 130 in the document labeling method provided in the present invention;
FIG. 3 is a schematic flow chart illustrating steps of obtaining a sample text and a sample keyword corresponding to the sample text according to the present invention;
FIG. 4 is a second flowchart illustrating a document labeling method according to the present invention;
FIG. 5 is a schematic structural diagram of a document labeling apparatus provided in the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the related art, automatic labeling of documents aims to mark one or more labels on a given document, so that subsequent processing of classifying, searching, abstracting and the like on the document is facilitated. In a document management scenario, such as scenarios of artificial intelligence, big data, block chains, etc., a tag library is usually existed, and when a new document is put in storage, tags in the existing tag library need to be marked on the new document.
The common document labeling method is a text classification method, and the text classification method solves the problem of text labels as a multi-classification task. In the traditional text classification method, firstly, the text features are obtained by using methods such as BoW (Bag of Words), TF-IDF (Term Frequency-Inverse Document Frequency) and the like, then a text classification model is constructed by using Machine learning algorithms such as Naive Bayes (Naive Bayes), SVM (Support Vector Machine), and random forest, and since the Bert model was proposed in 2019, a deep learning text classification model based on the Bert (Bidirectional Encoder replication from transformations) model becomes the mainstream text classification method.
In the context of labeling of english text, a text classification method using only tag names without tag data is proposed, however, the method relies on predicting synonyms of tags using the Bert model. In order to obtain synonyms with correct semantics, the labels must be the smallest units of inseparable words, such as common words like good, bad, commerce, economi.
However, in the labeling scenario of the chinese text, the label length is usually greater than or equal to 2, for example, "artificial intelligence", however, "artificial intelligence" is divided into 4 tokens in the Bert model, and therefore, it is difficult for the Bert model to give a phrase with correct semantics, so that the method cannot be directly applied in the labeling scenario of the chinese text.
In view of the above problem, the present invention provides a document labeling method, and fig. 1 is a schematic flow chart of the document labeling method provided by the present invention, as shown in fig. 1, the method includes:
and step 110, acquiring a document to be annotated and a tag list.
Specifically, a to-be-annotated document and a tag list may be obtained, where the to-be-annotated document is a document that needs to be subsequently annotated, the to-be-annotated document may be a document formed by a text directly input by a user, or a document formed by a text obtained by performing voice transcription on an acquired audio, or a document formed by a text obtained by acquiring an image through an image acquisition device such as a scanner, a mobile phone, or a camera, and performing Optical Character Recognition (OCR) on the image, which is not specifically limited in the embodiment of the present invention.
The tag list refers to a set of tags, and the tag list may be preset or crawled on a web page, which is not specifically limited in this embodiment of the present invention.
And 120, extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting the word frequency of each keyword in the document to be labeled.
Specifically, after the document to be annotated is obtained, the keyword extraction may be performed on the document to be annotated to obtain a plurality of keywords. The keyword extraction may use a keyword extraction model, where the keyword extraction model may be a Bert (Bidirectional Encoder reporting from Transformers) model, may also be an LSTM-CRF (Long Short Term Memory-Conditional Random Field) algorithm, and may also be a Bert-CRF algorithm, which is not specifically limited in this embodiment of the present invention.
The keywords in the document to be labeled reflect the key points in the document to be labeled, and may be "artificial intelligence", "blockchain", or "big data", "natural language processing", or "artificial intelligence", "big data", "natural language processing", "blockchain", and the like, which is not specifically limited in this embodiment of the present invention.
After obtaining the keywords, the word frequency of each keyword in the document to be labeled can be counted, where the word frequency refers to the number of times that each keyword appears in the document to be labeled, for example, the word frequency of each keyword in the document to be labeled can be [ ("artificial intelligence", 5), ("big data", 2), ("natural language processing", 1) ] or the like.
Step 130, determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
Specifically, after the word frequency of each keyword in the document to be labeled is obtained through statistics, the target label of the document to be labeled can be determined based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled. The target tag here refers to a final tag of the document to be labeled, and the target tag may be one, multiple, or empty, which is not specifically limited in this embodiment of the present invention.
The similarity between each keyword and each tag in the tag list may be obtained by calculating using methods such as cosine similarity and Pearson Correlation Coefficient (Pearson Correlation Coefficient), and before the similarity is calculated, word encoding may be performed on each keyword and each tag in the tag list using word2vec embedded representation (Embedding), and then the similarity is calculated based on a vector after the word encoding, which is not specifically limited in the embodiment of the present invention.
Here, the similarity between each keyword and each tag in the tag list reflects the matching degree between each keyword and each tag in the tag list. It can be understood that the higher the similarity between each keyword and each tag in the tag list, the more matched each keyword and each tag in the tag list; the lower the similarity between each keyword and each tag in the tag list, the more mismatched each keyword and each tag in the tag list.
The word frequency of each keyword in the document to be labeled reflects the occurrence frequency of each keyword in the document to be labeled, and the occurrence frequency of a certain keyword in the document to be labeled can reflect the importance degree of the keyword in the document to be labeled.
For example, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be annotated can be used as the criterion for evaluating the target tag of the document to be annotated, so as to obtain the target tag of the document to be annotated.
The method provided by the embodiment of the invention determines the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the combination of the similarity and the word frequency ensures the reliability and the accuracy of the determination of the target label, is not limited by the acquisition quantity of the labeled samples, is easy to realize and has strong reliability of the target label.
Based on the above embodiment, fig. 2 is a schematic flow chart of step 130 in the document annotation method provided by the present invention, and as shown in fig. 2, step 130 includes:
step 131, determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
step 132, determining a target label of the document to be labeled based on the label scores of the plurality of labels.
Specifically, after the word frequency of each keyword in the document to be labeled is obtained, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be labeled may be weighted to obtain tag scores of multiple tags of the document to be labeled, where the tag score reflects the score of each tag as the target tag, or reflects the probability of each tag as the target tag, and may be 0.5, or 0.8, or 0.7, and the like, which is not specifically limited in this embodiment of the present invention.
The word frequency of each keyword in the document to be labeled reflects the occurrence frequency of each keyword in the document to be labeled, and the occurrence frequency of a certain keyword in the document to be labeled can reflect the importance degree of the keyword in the document to be labeled. It can be understood that the greater the word frequency of the keyword in the document to be labeled, the more the keyword can affect the label score of the label similar to the keyword in the document to be labeled; the smaller the word frequency of the keyword in the document to be labeled is, the less the keyword affects the label score of the label of the document to be labeled similar to the keyword, so that the word frequency of each keyword in the document to be labeled can be used as the judgment basis of the label scores of a plurality of labels of the document to be labeled.
After the tag scores of the multiple tags of the document to be annotated are obtained, the target tag of the document to be annotated can be determined based on the tag scores of the multiple tags. The target label is the final label of the document to be labeled.
For example, the plurality of tags may be filtered based on the tag scores of the plurality of tags, and those tags with higher scores in the tag scores of the plurality of tags may be determined as the target tags of the document to be labeled.
The method provided by the embodiment of the invention determines the target label of the document to be labeled based on the label scores of the plurality of labels, wherein the label score reflects the score of each label as the target label or reflects the probability of each label as the target label, thereby ensuring the reliability and the accuracy of the target label of the document to be labeled.
Based on the above embodiment, step 131 includes:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
Figure 726239DEST_PATH_IMAGE015
wherein ,
Figure 101857DEST_PATH_IMAGE016
the first to represent the document to be annotated
Figure 460157DEST_PATH_IMAGE017
The score of the label of each label is given,
Figure 870410DEST_PATH_IMAGE018
denotes the first
Figure 885990DEST_PATH_IMAGE018
The number of the key words is one,
Figure 748903DEST_PATH_IMAGE003
denotes the first
Figure 176474DEST_PATH_IMAGE003
The number of the labels is one,
Figure 441233DEST_PATH_IMAGE019
indicates the total number of the keywords and,
Figure 161802DEST_PATH_IMAGE020
is a first
Figure 512012DEST_PATH_IMAGE012
Individual key word and
Figure 212115DEST_PATH_IMAGE021
the similarity of the individual labels is determined by the similarity,
Figure 331380DEST_PATH_IMAGE022
is a first
Figure 225780DEST_PATH_IMAGE023
The word frequency of each keyword in the document to be labeled,
Figure 63286DEST_PATH_IMAGE024
is to
Figure 567080DEST_PATH_IMAGE025
And carrying out normalized word frequency.
Based on the above embodiment, step 132 includes:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
Specifically, after the tag scores of the multiple tags are obtained, the multiple tags may be screened based on the tag scores of the multiple tags and the threshold score, and the tags obtained by screening are determined as the target tags of the document to be labeled; the method also can be used for screening the multiple labels based on the label scores of the multiple labels and the number of preset labels of the document to be labeled, and determining the labels obtained by screening as target labels of the document to be labeled; the multiple tags may also be filtered based on the tag scores of the multiple tags, the threshold score and/or the preset number of tags of the document to be labeled, and the filtered tags are determined as the target tags of the document to be labeled.
The threshold score is a threshold label score, and may be set in advance or may be set according to actual conditions. The preset number of tags of the document to be annotated refers to the number of tags required by the document to be annotated, and may be preset or set according to an actual situation, which is not specifically limited in the embodiment of the present invention.
For example, if the threshold score is 0.5, the number of preset tags of the document to be labeled is 5, the tag scores of the multiple tags are 0.6, 0.7, and 0.8, and the tag score of 0.6 corresponds to the "artificial intelligence" tag, the tag score of 0.7 corresponds to the "support vector machine", and the tag score of 0.8 corresponds to the "natural language processing", then the multiple tags are filtered based on the tag scores of the multiple tags, and the threshold score and/or the number of preset tags of the document to be labeled, and the filtered tags "artificial intelligence", "support vector machine", and "natural language processing" may be determined as the target tag of the document to be labeled.
In addition, before the plurality of tags are screened based on the tag scores of the plurality of tags, the threshold score and/or the preset number of tags of the document to be labeled, the tag scores of the plurality of tags may be sorted, and the plurality of tags may be screened based on the sorted tag scores of the plurality of tags. Here, the sorting of the label scores of the plurality of labels may be performed by sorting the label scores of the plurality of labels from high to low, or by sorting the label scores of the plurality of labels from low to high, which is not specifically limited in the embodiment of the present invention.
The method provided by the embodiment of the invention screens the plurality of labels based on the label scores of the plurality of labels and in combination with the conditions of the threshold score and/or the preset number of labels of the document to be marked, and determines the screened labels as the target labels of the document to be marked, thereby ensuring the accuracy of determining the target labels of the document to be marked.
Based on the above embodiment, step 120 includes:
step 121, applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
Specifically, in order to extract the keywords of the document to be labeled, before step 121, the keyword extraction model needs to be obtained through the following steps:
the sample texts and the sample keywords corresponding to the sample texts can be collected in advance, and an initial keyword extraction model can be constructed, wherein the initial keyword extraction model is an initial model for training the keyword extraction model. Here, the initial keyword extraction model may include a Bert model and a classification layer, where the classification layer may be a softmax layer, and may also be a CRF (Conditional Random Field algorithm), which is not specifically limited in this embodiment of the present invention.
After the initial keyword extraction model is obtained, the sample texts collected in advance and the sample keywords corresponding to the sample texts can be applied to train the initial keyword extraction model:
the sample text can be input into the initial keyword extraction model, and the initial keyword extraction model is used for extracting keywords from the sample text to obtain and output predicted keywords of the sample text.
After the prediction keywords are obtained based on the initial keyword extraction model, the prediction keywords can be compared with sample keywords corresponding to a sample text collected in advance, a loss function value is obtained through calculation according to the difference degree between the prediction keywords and the sample keywords, parameter iteration is carried out on the initial keyword extraction model based on the loss function value, and the initial keyword extraction model after the parameter iteration is completed is recorded as a keyword extraction model.
It can be understood that the greater the difference degree between the prediction keywords and the sample keywords corresponding to the sample texts collected in advance, the greater the loss function value; the smaller the difference between the prediction keyword and the sample keyword corresponding to the sample text collected in advance, the smaller the loss function value.
In other words, in the training process of the initial keyword extraction model, the keyword extraction of the document to be labeled is learned so as to extract the keywords which can be used for determining the target label of the document to be labeled.
In the related art, when a sample text and sample keywords corresponding to the sample text are applied to perform keyword extraction model training, the sample keywords corresponding to the sample text are usually difficult to obtain, and aiming at the problems, in the embodiment of the invention, the sample text is determined based on a thesis document related to each label in a label list, and the sample keywords corresponding to the sample text are the thesis keywords carried in the thesis document.
Based on the above embodiment, fig. 3 is a schematic flow chart of the steps of obtaining the sample text and the sample keywords corresponding to the sample text, and as shown in fig. 3, the steps of obtaining the sample text and the sample keywords corresponding to the sample text include:
step 310, acquiring a thesis document related to each label in the label list, wherein the thesis document carries a thesis keyword;
step 320, determining the sample text based on the thesis document, and determining a sample keyword corresponding to the sample text based on the thesis keyword.
Specifically, the paper documents related to each tag in the tag list can be obtained, and the paper documents carry the paper keywords, that is, the paper keywords do not need to be manually labeled, so that a large amount of time cost is saved, and the obtaining efficiency of the subsequent sample text and the sample keywords corresponding to the sample text is improved.
It can be understood that after each tag in the tag list is obtained, a thesis document related to each tag can be matched from the open source data set, and the open source data set can be obtained by crawling from a download website of each thesis document.
After obtaining the thesis document associated with each tag in the tag list, the sample text may be determined based on the thesis document. For example, a paper document may be directly used as sample text, and for example, text that represents a core idea in the paper document may be used as sample text.
Then, a sample keyword corresponding to the sample text can be determined based on the paper keyword. For example, a paper keyword carried by the paper document itself may be used as a sample keyword corresponding to the sample text.
For example, the sample text and the sample text may correspond to a sample keyword (sample text 1, [ sample keyword 1,.. ] corresponding to sample text 1), (sample text 2, [ sample keyword 1,.. ] corresponding to sample text 2,.., (sample text N, [ sample keyword 1,.... ] corresponding to sample text N), and so on.
In the method provided by the embodiment of the invention, the sample text is determined based on the paper document, the paper document carries the paper keywords, and the sample keywords corresponding to the sample text are determined based on the paper keywords, namely the sample keywords corresponding to the sample text do not need to be labeled manually, so that a large amount of time cost is saved.
In the related art, when the sample text and the sample keywords corresponding to the sample text are applied to the training of the keyword extraction model, the sample text usually uses the whole document, so that the training cost of the keyword extraction model is increased, and the training efficiency of the keyword extraction model is reduced.
Based on the above embodiment, step 320 includes:
sample text is determined based on the title and abstract in the paper document.
Specifically, after obtaining a thesis document related to each tag in the tag list, the sample text may be determined based on the title and the abstract in the thesis document. For example, titles and summaries in a paper document may be taken directly as sample text.
The method provided by the embodiment of the invention determines the sample text based on the title and the abstract in the thesis document, and compared with the traditional method based on the whole document, the method determines the sample text, thereby improving the efficiency of the keyword extraction model training.
Based on any of the above embodiments, the present invention provides a document annotation method, and fig. 4 is a second flowchart of the document annotation method provided by the present invention, as shown in fig. 4, the method includes:
in step 410, a list of documents to be annotated and tags may be obtained.
Step 420, a keyword extraction model can be applied to extract keywords from the document to be labeled to obtain a plurality of keywords, and the word frequency of each keyword in the document to be labeled is counted. The keyword extraction model is obtained by training based on the sample text and sample keywords corresponding to the sample text.
Here, the step of obtaining the sample text and the sample keywords corresponding to the sample text includes:
the method can obtain the thesis documents related to each label in the label list, wherein the thesis documents carry the thesis keywords;
the sample text may be determined based on the title and abstract in the paper document, and the sample keywords corresponding to the sample text may be determined based on the keywords of the paper.
Step 430, tag scores of a plurality of tags of the document to be tagged can be determined based on the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be tagged.
The label scores of a plurality of labels of the document to be labeled can be determined based on the following formula:
Figure 540852DEST_PATH_IMAGE026
wherein ,
Figure 603224DEST_PATH_IMAGE027
the first to represent the document to be annotated
Figure 928026DEST_PATH_IMAGE028
The score of the label of each label is given,
Figure 969931DEST_PATH_IMAGE029
is shown as
Figure 267052DEST_PATH_IMAGE030
The number of the key words is one,
Figure 497395DEST_PATH_IMAGE031
denotes the first
Figure 43914DEST_PATH_IMAGE003
The number of the labels is one,
Figure 155089DEST_PATH_IMAGE032
indicates the total number of the keywords and,
Figure 103454DEST_PATH_IMAGE033
is as follows
Figure 507628DEST_PATH_IMAGE018
A key word and
Figure 338181DEST_PATH_IMAGE017
the similarity of the individual labels is determined,
Figure 721889DEST_PATH_IMAGE034
is as follows
Figure 790339DEST_PATH_IMAGE035
The word frequency of each keyword in the document to be labeled,
Figure 368344DEST_PATH_IMAGE036
is to
Figure 889455DEST_PATH_IMAGE037
And carrying out normalized word frequency.
Step 440, the multiple tags may be filtered based on the tag scores of the multiple tags, the threshold score and/or the preset number of tags of the document to be labeled, and the filtered tags are determined as the target tags of the document to be labeled.
The document labeling device provided by the present invention is described below, and the document labeling device described below and the document labeling method described above may be referred to in a corresponding manner.
Based on any of the above embodiments, the present invention provides a document annotation device, and fig. 5 is a schematic structural diagram of the document annotation device provided by the present invention, as shown in fig. 5, the device includes:
an obtaining unit 510, configured to obtain a to-be-annotated document and a tag list;
a keyword extraction unit 520, configured to perform keyword extraction on the document to be labeled to obtain a plurality of keywords, and count word frequencies of the keywords in the document to be labeled;
a tag determining unit 530, configured to determine a target tag of the document to be labeled based on similarities between the keywords and the tags in the tag list and word frequencies of the keywords in the document to be labeled.
The device provided by the embodiment of the invention determines the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the combination of the similarity and the word frequency ensures the reliability and accuracy of the determination of the target label, is not limited by the acquisition quantity of labeled samples, is easy to realize and has strong reliability of the target label.
Based on any of the above embodiments, determining the tag unit specifically includes:
a tag score determining unit, configured to determine tag scores of multiple tags of the document to be tagged based on similarities between the keywords and the tags in the tag list and word frequencies of the keywords in the document to be tagged;
and the target label determining unit is used for determining a target label of the document to be labeled based on the label scores of the plurality of labels.
Based on any of the above embodiments, the tag score determining unit is specifically configured to:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
Figure 76854DEST_PATH_IMAGE038
wherein ,
Figure 999811DEST_PATH_IMAGE039
to represent the document to be annotated
Figure 949050DEST_PATH_IMAGE003
The score of the label of each label is given,
Figure 957457DEST_PATH_IMAGE004
denotes the first
Figure 948547DEST_PATH_IMAGE010
The number of the key words is one,
Figure 726010DEST_PATH_IMAGE040
is shown as
Figure 639959DEST_PATH_IMAGE041
The number of the labels is one,
Figure 870083DEST_PATH_IMAGE042
indicates the total number of the keywords and,
Figure 664864DEST_PATH_IMAGE043
is as follows
Figure 296833DEST_PATH_IMAGE044
A key word and
Figure 384613DEST_PATH_IMAGE045
the similarity of the individual labels is determined,
Figure 367612DEST_PATH_IMAGE046
is as follows
Figure 700505DEST_PATH_IMAGE047
The word frequency of each keyword in the document to be labeled,
Figure 186981DEST_PATH_IMAGE048
is to
Figure 714170DEST_PATH_IMAGE049
And carrying out normalized word frequency.
Based on any of the embodiments described above, determining a target tag unit is specifically configured to:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
Based on any of the above embodiments, the keyword extraction unit specifically includes:
applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
Based on any one of the above embodiments, the obtaining step of the sample text and the sample keywords corresponding to the sample text includes:
a document obtaining unit, configured to obtain a thesis document related to each tag in the tag list, where the thesis document carries a thesis keyword;
and the text and keyword determining unit is used for determining the sample text based on the paper document and determining a sample keyword corresponding to the sample text based on the paper keyword.
Based on any of the embodiments described above, determining text and keyword units is specifically configured to:
sample text is determined based on the title and abstract in the thesis document.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor) 610, a communication Interface 620, a memory (memory) 630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 complete communication with each other through the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a document annotation method comprising: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the document annotation method provided by the above methods, and the method includes: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute a document annotation method provided by the above methods, the method including: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting the word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A document labeling method is characterized by comprising the following steps:
acquiring a document to be annotated and a tag list;
extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
2. The method according to claim 1, wherein the determining the target tag of the to-be-annotated document based on the similarity between the keyword and the tag in the tag list and the word frequency of the keyword in the to-be-annotated document comprises:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the label scores of the plurality of labels.
3. The method of claim 2, wherein the determining the tag scores of the tags of the document to be tagged based on the similarity between the keywords and the tags in the tag list and the word frequency of the keywords in the document to be tagged comprises:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
Figure 888400DEST_PATH_IMAGE001
wherein ,
Figure 523518DEST_PATH_IMAGE002
the first to represent the document to be annotated
Figure 536605DEST_PATH_IMAGE003
The label score of each label is calculated,
Figure 733231DEST_PATH_IMAGE004
denotes the first
Figure 395550DEST_PATH_IMAGE005
The number of the key words is one,
Figure 335824DEST_PATH_IMAGE006
denotes the first
Figure 796892DEST_PATH_IMAGE007
The number of the labels is one,
Figure 695578DEST_PATH_IMAGE008
indicates the total number of the keywords to be used,
Figure 373422DEST_PATH_IMAGE009
is a first
Figure 55070DEST_PATH_IMAGE010
Individual key word and
Figure 370645DEST_PATH_IMAGE003
the similarity of the individual labels is determined,
Figure 909073DEST_PATH_IMAGE011
is as follows
Figure 805704DEST_PATH_IMAGE012
The word frequency of each keyword in the document to be labeled,
Figure 353360DEST_PATH_IMAGE013
is to
Figure 116917DEST_PATH_IMAGE014
And carrying out normalized word frequency.
4. The document annotation method of claim 2, wherein the determining the target tag of the document to be annotated based on the tag scores of the plurality of tags comprises:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
5. The method according to any one of claims 1 to 4, wherein the extracting keywords from the document to be labeled to obtain a plurality of keywords comprises:
applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
6. The document labeling method of claim 5, wherein the obtaining step of the sample text and the sample keywords corresponding to the sample text comprises:
acquiring a thesis document related to each label in the label list, wherein the thesis document carries a thesis keyword;
and determining the sample text based on the paper document, and determining a sample keyword corresponding to the sample text based on the paper keyword.
7. The method of claim 6, wherein said determining the sample text based on the paper document comprises:
sample text is determined based on the title and abstract in the paper document.
8. A document labeling apparatus, comprising:
the acquisition unit is used for acquiring a document to be annotated and a label list;
the keyword extraction unit is used for extracting keywords from the document to be labeled to obtain a plurality of keywords and counting the word frequency of each keyword in the document to be labeled;
and the tag determining unit is used for determining a target tag of the document to be labeled based on the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be labeled.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document annotation method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the document annotation method according to any one of claims 1 to 7.
CN202211592980.XA 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium Active CN115659969B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211592980.XA CN115659969B (en) 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211592980.XA CN115659969B (en) 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115659969A true CN115659969A (en) 2023-01-31
CN115659969B CN115659969B (en) 2023-04-28

Family

ID=85017459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211592980.XA Active CN115659969B (en) 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115659969B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971780A (en) * 2023-12-29 2024-05-03 青矩技术股份有限公司 Document storage method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235774A (en) * 2013-04-27 2013-08-07 杭州电子科技大学 Extraction method of feature words of science and technology project application form
CN110489649A (en) * 2019-08-19 2019-11-22 北京创鑫旅程网络技术有限公司 The method and device of label association content
CN110717092A (en) * 2018-06-27 2020-01-21 北京京东尚科信息技术有限公司 Method, system, device and storage medium for matching objects for articles
CN110781297A (en) * 2019-09-18 2020-02-11 国家计算机网络与信息安全管理中心 Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN111967262A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Method and device for determining entity tag
US20220019741A1 (en) * 2020-07-16 2022-01-20 Optum Technology, Inc. An unsupervised approach to assignment of pre-defined labels to text documents

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235774A (en) * 2013-04-27 2013-08-07 杭州电子科技大学 Extraction method of feature words of science and technology project application form
CN110717092A (en) * 2018-06-27 2020-01-21 北京京东尚科信息技术有限公司 Method, system, device and storage medium for matching objects for articles
CN110489649A (en) * 2019-08-19 2019-11-22 北京创鑫旅程网络技术有限公司 The method and device of label association content
CN110781297A (en) * 2019-09-18 2020-02-11 国家计算机网络与信息安全管理中心 Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN111967262A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Method and device for determining entity tag
US20220019741A1 (en) * 2020-07-16 2022-01-20 Optum Technology, Inc. An unsupervised approach to assignment of pre-defined labels to text documents

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971780A (en) * 2023-12-29 2024-05-03 青矩技术股份有限公司 Document storage method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115659969B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
CN107229668B (en) Text extraction method based on keyword matching
CN107463605B (en) Method and device for identifying low-quality news resource, computer equipment and readable medium
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN107437038B (en) Webpage tampering detection method and device
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
CN109241277B (en) Text vector weighting method and system based on news keywords
CN111160019B (en) Public opinion monitoring method, device and system
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
CN109145180B (en) Enterprise hot event mining method based on incremental clustering
US20230074771A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN113722492A (en) Intention identification method and device
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN113255319B (en) Model training method, text segmentation method, abstract extraction method and device
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN115659969B (en) Document labeling method, device, electronic equipment and storage medium
TWI681304B (en) System and method for adaptively adjusting related search words
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN113011174B (en) Method for identifying purse string based on text analysis
CN112949299A (en) Method and device for generating news manuscript, storage medium and electronic device
CN118134422A (en) File content auditing method, device, equipment, storage medium and product
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
CN112699949B (en) Potential user identification method and device based on social platform data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant