CN115659969A - Document labeling method and device, electronic equipment and storage medium - Google Patents
Document labeling method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115659969A CN115659969A CN202211592980.XA CN202211592980A CN115659969A CN 115659969 A CN115659969 A CN 115659969A CN 202211592980 A CN202211592980 A CN 202211592980A CN 115659969 A CN115659969 A CN 115659969A
- Authority
- CN
- China
- Prior art keywords
- document
- labeled
- keyword
- label
- tag
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000002372 labelling Methods 0.000 title claims description 39
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000000605 extraction Methods 0.000 claims description 39
- 238000004590 computer program Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 14
- 238000012216 screening Methods 0.000 claims description 11
- 238000013473 artificial intelligence Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of document marking, and provides a document marking method, a document marking device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled. The method, the device, the electronic equipment and the storage medium provided by the invention are used for determining the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, so that the reliability and the accuracy of determining the target label are ensured, the method is not limited by the acquisition quantity of labeled samples, the realization is easy, and the reliability of the target label is strong.
Description
Technical Field
The present invention relates to the field of document labeling technologies, and in particular, to a document labeling method and apparatus, an electronic device, and a storage medium.
Background
Automatic labeling of documents aims to label a given document with one or more labels, which facilitates subsequent processing of documents such as classification, searching, summarization, and the like.
In the prior art, a traditional machine learning document labeling method and a deep learning document labeling method are both supervised learning methods, and the training of a model thereof depends on a large amount of labeling data. However, in practical applications, in some scenarios, only a part of unlabeled documents and a label list can be obtained, and in other scenarios, due to problems such as data privacy, only the label list can be obtained, and the absence of labeled samples directly affects the reliability of automatic labeling of documents.
Disclosure of Invention
The invention provides a document labeling method, a document labeling device, electronic equipment and a storage medium, which are used for solving the defect that the document labeling method for supervised learning in the prior art depends on a large amount of labeling data for training.
The invention provides a document labeling method, which comprises the following steps:
acquiring a document to be annotated and a tag list;
extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting the word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
According to the document labeling method provided by the invention, the determining of the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the label scores of the plurality of labels.
According to a document labeling method provided by the present invention, determining label scores of a plurality of labels of a document to be labeled based on similarity between each keyword and each label in the label list and word frequency of each keyword in the document to be labeled, comprises:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
wherein ,to represent the document to be annotatedThe score of the label of each label is given,is shown asThe number of the key words is one,is shown asThe number of the labels is one,indicates the total number of the keywords to be used,is a firstA key word andthe similarity of the individual labels is determined,is the first oneThe word frequency of the keyword in the document to be labeled,is toAnd carrying out normalized word frequency.
According to the document labeling method provided by the invention, the step of determining the target label of the document to be labeled based on the label scores of the plurality of labels comprises the following steps:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
According to the document labeling method provided by the invention, the extracting of the keywords from the document to be labeled to obtain a plurality of keywords comprises the following steps:
applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
According to the document labeling method provided by the invention, the step of obtaining the sample text and the sample keywords corresponding to the sample text comprises the following steps:
acquiring a thesis document related to each label in the label list, wherein the thesis document carries a thesis keyword;
and determining the sample text based on the paper document, and determining a sample keyword corresponding to the sample text based on the paper keyword.
According to a document labeling method provided by the present invention, the determining the sample text based on the thesis document comprises:
sample text is determined based on the title and abstract in the thesis document.
The invention also provides a document labeling device, comprising:
the acquisition unit is used for acquiring a document to be annotated and a label list;
the keyword extraction unit is used for extracting keywords from the document to be labeled to obtain a plurality of keywords and counting the word frequency of each keyword in the document to be labeled;
and the tag determining unit is used for determining a target tag of the document to be labeled based on the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be labeled.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the document marking method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document annotation process as described in any one of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a document annotation process as described in any one of the above.
The document labeling method, the device, the electronic equipment and the storage medium provided by the invention determine the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the combination of the similarity and the word frequency ensures the reliability and the accuracy of the determination of the target label, is not limited by the acquisition quantity of labeled samples, is easy to realize and has strong reliability of the target label.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a document labeling method according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of step 130 in the document labeling method provided in the present invention;
FIG. 3 is a schematic flow chart illustrating steps of obtaining a sample text and a sample keyword corresponding to the sample text according to the present invention;
FIG. 4 is a second flowchart illustrating a document labeling method according to the present invention;
FIG. 5 is a schematic structural diagram of a document labeling apparatus provided in the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the related art, automatic labeling of documents aims to mark one or more labels on a given document, so that subsequent processing of classifying, searching, abstracting and the like on the document is facilitated. In a document management scenario, such as scenarios of artificial intelligence, big data, block chains, etc., a tag library is usually existed, and when a new document is put in storage, tags in the existing tag library need to be marked on the new document.
The common document labeling method is a text classification method, and the text classification method solves the problem of text labels as a multi-classification task. In the traditional text classification method, firstly, the text features are obtained by using methods such as BoW (Bag of Words), TF-IDF (Term Frequency-Inverse Document Frequency) and the like, then a text classification model is constructed by using Machine learning algorithms such as Naive Bayes (Naive Bayes), SVM (Support Vector Machine), and random forest, and since the Bert model was proposed in 2019, a deep learning text classification model based on the Bert (Bidirectional Encoder replication from transformations) model becomes the mainstream text classification method.
In the context of labeling of english text, a text classification method using only tag names without tag data is proposed, however, the method relies on predicting synonyms of tags using the Bert model. In order to obtain synonyms with correct semantics, the labels must be the smallest units of inseparable words, such as common words like good, bad, commerce, economi.
However, in the labeling scenario of the chinese text, the label length is usually greater than or equal to 2, for example, "artificial intelligence", however, "artificial intelligence" is divided into 4 tokens in the Bert model, and therefore, it is difficult for the Bert model to give a phrase with correct semantics, so that the method cannot be directly applied in the labeling scenario of the chinese text.
In view of the above problem, the present invention provides a document labeling method, and fig. 1 is a schematic flow chart of the document labeling method provided by the present invention, as shown in fig. 1, the method includes:
and step 110, acquiring a document to be annotated and a tag list.
Specifically, a to-be-annotated document and a tag list may be obtained, where the to-be-annotated document is a document that needs to be subsequently annotated, the to-be-annotated document may be a document formed by a text directly input by a user, or a document formed by a text obtained by performing voice transcription on an acquired audio, or a document formed by a text obtained by acquiring an image through an image acquisition device such as a scanner, a mobile phone, or a camera, and performing Optical Character Recognition (OCR) on the image, which is not specifically limited in the embodiment of the present invention.
The tag list refers to a set of tags, and the tag list may be preset or crawled on a web page, which is not specifically limited in this embodiment of the present invention.
And 120, extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting the word frequency of each keyword in the document to be labeled.
Specifically, after the document to be annotated is obtained, the keyword extraction may be performed on the document to be annotated to obtain a plurality of keywords. The keyword extraction may use a keyword extraction model, where the keyword extraction model may be a Bert (Bidirectional Encoder reporting from Transformers) model, may also be an LSTM-CRF (Long Short Term Memory-Conditional Random Field) algorithm, and may also be a Bert-CRF algorithm, which is not specifically limited in this embodiment of the present invention.
The keywords in the document to be labeled reflect the key points in the document to be labeled, and may be "artificial intelligence", "blockchain", or "big data", "natural language processing", or "artificial intelligence", "big data", "natural language processing", "blockchain", and the like, which is not specifically limited in this embodiment of the present invention.
After obtaining the keywords, the word frequency of each keyword in the document to be labeled can be counted, where the word frequency refers to the number of times that each keyword appears in the document to be labeled, for example, the word frequency of each keyword in the document to be labeled can be [ ("artificial intelligence", 5), ("big data", 2), ("natural language processing", 1) ] or the like.
Specifically, after the word frequency of each keyword in the document to be labeled is obtained through statistics, the target label of the document to be labeled can be determined based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled. The target tag here refers to a final tag of the document to be labeled, and the target tag may be one, multiple, or empty, which is not specifically limited in this embodiment of the present invention.
The similarity between each keyword and each tag in the tag list may be obtained by calculating using methods such as cosine similarity and Pearson Correlation Coefficient (Pearson Correlation Coefficient), and before the similarity is calculated, word encoding may be performed on each keyword and each tag in the tag list using word2vec embedded representation (Embedding), and then the similarity is calculated based on a vector after the word encoding, which is not specifically limited in the embodiment of the present invention.
Here, the similarity between each keyword and each tag in the tag list reflects the matching degree between each keyword and each tag in the tag list. It can be understood that the higher the similarity between each keyword and each tag in the tag list, the more matched each keyword and each tag in the tag list; the lower the similarity between each keyword and each tag in the tag list, the more mismatched each keyword and each tag in the tag list.
The word frequency of each keyword in the document to be labeled reflects the occurrence frequency of each keyword in the document to be labeled, and the occurrence frequency of a certain keyword in the document to be labeled can reflect the importance degree of the keyword in the document to be labeled.
For example, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be annotated can be used as the criterion for evaluating the target tag of the document to be annotated, so as to obtain the target tag of the document to be annotated.
The method provided by the embodiment of the invention determines the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the combination of the similarity and the word frequency ensures the reliability and the accuracy of the determination of the target label, is not limited by the acquisition quantity of the labeled samples, is easy to realize and has strong reliability of the target label.
Based on the above embodiment, fig. 2 is a schematic flow chart of step 130 in the document annotation method provided by the present invention, and as shown in fig. 2, step 130 includes:
Specifically, after the word frequency of each keyword in the document to be labeled is obtained, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be labeled may be weighted to obtain tag scores of multiple tags of the document to be labeled, where the tag score reflects the score of each tag as the target tag, or reflects the probability of each tag as the target tag, and may be 0.5, or 0.8, or 0.7, and the like, which is not specifically limited in this embodiment of the present invention.
The word frequency of each keyword in the document to be labeled reflects the occurrence frequency of each keyword in the document to be labeled, and the occurrence frequency of a certain keyword in the document to be labeled can reflect the importance degree of the keyword in the document to be labeled. It can be understood that the greater the word frequency of the keyword in the document to be labeled, the more the keyword can affect the label score of the label similar to the keyword in the document to be labeled; the smaller the word frequency of the keyword in the document to be labeled is, the less the keyword affects the label score of the label of the document to be labeled similar to the keyword, so that the word frequency of each keyword in the document to be labeled can be used as the judgment basis of the label scores of a plurality of labels of the document to be labeled.
After the tag scores of the multiple tags of the document to be annotated are obtained, the target tag of the document to be annotated can be determined based on the tag scores of the multiple tags. The target label is the final label of the document to be labeled.
For example, the plurality of tags may be filtered based on the tag scores of the plurality of tags, and those tags with higher scores in the tag scores of the plurality of tags may be determined as the target tags of the document to be labeled.
The method provided by the embodiment of the invention determines the target label of the document to be labeled based on the label scores of the plurality of labels, wherein the label score reflects the score of each label as the target label or reflects the probability of each label as the target label, thereby ensuring the reliability and the accuracy of the target label of the document to be labeled.
Based on the above embodiment, step 131 includes:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
wherein ,the first to represent the document to be annotatedThe score of the label of each label is given,denotes the firstThe number of the key words is one,denotes the firstThe number of the labels is one,indicates the total number of the keywords and,is a firstIndividual key word andthe similarity of the individual labels is determined by the similarity,is a firstThe word frequency of each keyword in the document to be labeled,is toAnd carrying out normalized word frequency.
Based on the above embodiment, step 132 includes:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
Specifically, after the tag scores of the multiple tags are obtained, the multiple tags may be screened based on the tag scores of the multiple tags and the threshold score, and the tags obtained by screening are determined as the target tags of the document to be labeled; the method also can be used for screening the multiple labels based on the label scores of the multiple labels and the number of preset labels of the document to be labeled, and determining the labels obtained by screening as target labels of the document to be labeled; the multiple tags may also be filtered based on the tag scores of the multiple tags, the threshold score and/or the preset number of tags of the document to be labeled, and the filtered tags are determined as the target tags of the document to be labeled.
The threshold score is a threshold label score, and may be set in advance or may be set according to actual conditions. The preset number of tags of the document to be annotated refers to the number of tags required by the document to be annotated, and may be preset or set according to an actual situation, which is not specifically limited in the embodiment of the present invention.
For example, if the threshold score is 0.5, the number of preset tags of the document to be labeled is 5, the tag scores of the multiple tags are 0.6, 0.7, and 0.8, and the tag score of 0.6 corresponds to the "artificial intelligence" tag, the tag score of 0.7 corresponds to the "support vector machine", and the tag score of 0.8 corresponds to the "natural language processing", then the multiple tags are filtered based on the tag scores of the multiple tags, and the threshold score and/or the number of preset tags of the document to be labeled, and the filtered tags "artificial intelligence", "support vector machine", and "natural language processing" may be determined as the target tag of the document to be labeled.
In addition, before the plurality of tags are screened based on the tag scores of the plurality of tags, the threshold score and/or the preset number of tags of the document to be labeled, the tag scores of the plurality of tags may be sorted, and the plurality of tags may be screened based on the sorted tag scores of the plurality of tags. Here, the sorting of the label scores of the plurality of labels may be performed by sorting the label scores of the plurality of labels from high to low, or by sorting the label scores of the plurality of labels from low to high, which is not specifically limited in the embodiment of the present invention.
The method provided by the embodiment of the invention screens the plurality of labels based on the label scores of the plurality of labels and in combination with the conditions of the threshold score and/or the preset number of labels of the document to be marked, and determines the screened labels as the target labels of the document to be marked, thereby ensuring the accuracy of determining the target labels of the document to be marked.
Based on the above embodiment, step 120 includes:
step 121, applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
Specifically, in order to extract the keywords of the document to be labeled, before step 121, the keyword extraction model needs to be obtained through the following steps:
the sample texts and the sample keywords corresponding to the sample texts can be collected in advance, and an initial keyword extraction model can be constructed, wherein the initial keyword extraction model is an initial model for training the keyword extraction model. Here, the initial keyword extraction model may include a Bert model and a classification layer, where the classification layer may be a softmax layer, and may also be a CRF (Conditional Random Field algorithm), which is not specifically limited in this embodiment of the present invention.
After the initial keyword extraction model is obtained, the sample texts collected in advance and the sample keywords corresponding to the sample texts can be applied to train the initial keyword extraction model:
the sample text can be input into the initial keyword extraction model, and the initial keyword extraction model is used for extracting keywords from the sample text to obtain and output predicted keywords of the sample text.
After the prediction keywords are obtained based on the initial keyword extraction model, the prediction keywords can be compared with sample keywords corresponding to a sample text collected in advance, a loss function value is obtained through calculation according to the difference degree between the prediction keywords and the sample keywords, parameter iteration is carried out on the initial keyword extraction model based on the loss function value, and the initial keyword extraction model after the parameter iteration is completed is recorded as a keyword extraction model.
It can be understood that the greater the difference degree between the prediction keywords and the sample keywords corresponding to the sample texts collected in advance, the greater the loss function value; the smaller the difference between the prediction keyword and the sample keyword corresponding to the sample text collected in advance, the smaller the loss function value.
In other words, in the training process of the initial keyword extraction model, the keyword extraction of the document to be labeled is learned so as to extract the keywords which can be used for determining the target label of the document to be labeled.
In the related art, when a sample text and sample keywords corresponding to the sample text are applied to perform keyword extraction model training, the sample keywords corresponding to the sample text are usually difficult to obtain, and aiming at the problems, in the embodiment of the invention, the sample text is determined based on a thesis document related to each label in a label list, and the sample keywords corresponding to the sample text are the thesis keywords carried in the thesis document.
Based on the above embodiment, fig. 3 is a schematic flow chart of the steps of obtaining the sample text and the sample keywords corresponding to the sample text, and as shown in fig. 3, the steps of obtaining the sample text and the sample keywords corresponding to the sample text include:
Specifically, the paper documents related to each tag in the tag list can be obtained, and the paper documents carry the paper keywords, that is, the paper keywords do not need to be manually labeled, so that a large amount of time cost is saved, and the obtaining efficiency of the subsequent sample text and the sample keywords corresponding to the sample text is improved.
It can be understood that after each tag in the tag list is obtained, a thesis document related to each tag can be matched from the open source data set, and the open source data set can be obtained by crawling from a download website of each thesis document.
After obtaining the thesis document associated with each tag in the tag list, the sample text may be determined based on the thesis document. For example, a paper document may be directly used as sample text, and for example, text that represents a core idea in the paper document may be used as sample text.
Then, a sample keyword corresponding to the sample text can be determined based on the paper keyword. For example, a paper keyword carried by the paper document itself may be used as a sample keyword corresponding to the sample text.
For example, the sample text and the sample text may correspond to a sample keyword (sample text 1, [ sample keyword 1,.. ] corresponding to sample text 1), (sample text 2, [ sample keyword 1,.. ] corresponding to sample text 2,.., (sample text N, [ sample keyword 1,.... ] corresponding to sample text N), and so on.
In the method provided by the embodiment of the invention, the sample text is determined based on the paper document, the paper document carries the paper keywords, and the sample keywords corresponding to the sample text are determined based on the paper keywords, namely the sample keywords corresponding to the sample text do not need to be labeled manually, so that a large amount of time cost is saved.
In the related art, when the sample text and the sample keywords corresponding to the sample text are applied to the training of the keyword extraction model, the sample text usually uses the whole document, so that the training cost of the keyword extraction model is increased, and the training efficiency of the keyword extraction model is reduced.
Based on the above embodiment, step 320 includes:
sample text is determined based on the title and abstract in the paper document.
Specifically, after obtaining a thesis document related to each tag in the tag list, the sample text may be determined based on the title and the abstract in the thesis document. For example, titles and summaries in a paper document may be taken directly as sample text.
The method provided by the embodiment of the invention determines the sample text based on the title and the abstract in the thesis document, and compared with the traditional method based on the whole document, the method determines the sample text, thereby improving the efficiency of the keyword extraction model training.
Based on any of the above embodiments, the present invention provides a document annotation method, and fig. 4 is a second flowchart of the document annotation method provided by the present invention, as shown in fig. 4, the method includes:
in step 410, a list of documents to be annotated and tags may be obtained.
Here, the step of obtaining the sample text and the sample keywords corresponding to the sample text includes:
the method can obtain the thesis documents related to each label in the label list, wherein the thesis documents carry the thesis keywords;
the sample text may be determined based on the title and abstract in the paper document, and the sample keywords corresponding to the sample text may be determined based on the keywords of the paper.
The label scores of a plurality of labels of the document to be labeled can be determined based on the following formula:
wherein ,the first to represent the document to be annotatedThe score of the label of each label is given,is shown asThe number of the key words is one,denotes the firstThe number of the labels is one,indicates the total number of the keywords and,is as followsA key word andthe similarity of the individual labels is determined,is as followsThe word frequency of each keyword in the document to be labeled,is toAnd carrying out normalized word frequency.
The document labeling device provided by the present invention is described below, and the document labeling device described below and the document labeling method described above may be referred to in a corresponding manner.
Based on any of the above embodiments, the present invention provides a document annotation device, and fig. 5 is a schematic structural diagram of the document annotation device provided by the present invention, as shown in fig. 5, the device includes:
an obtaining unit 510, configured to obtain a to-be-annotated document and a tag list;
a keyword extraction unit 520, configured to perform keyword extraction on the document to be labeled to obtain a plurality of keywords, and count word frequencies of the keywords in the document to be labeled;
a tag determining unit 530, configured to determine a target tag of the document to be labeled based on similarities between the keywords and the tags in the tag list and word frequencies of the keywords in the document to be labeled.
The device provided by the embodiment of the invention determines the target label of the document to be labeled by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the combination of the similarity and the word frequency ensures the reliability and accuracy of the determination of the target label, is not limited by the acquisition quantity of labeled samples, is easy to realize and has strong reliability of the target label.
Based on any of the above embodiments, determining the tag unit specifically includes:
a tag score determining unit, configured to determine tag scores of multiple tags of the document to be tagged based on similarities between the keywords and the tags in the tag list and word frequencies of the keywords in the document to be tagged;
and the target label determining unit is used for determining a target label of the document to be labeled based on the label scores of the plurality of labels.
Based on any of the above embodiments, the tag score determining unit is specifically configured to:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
wherein ,to represent the document to be annotatedThe score of the label of each label is given,denotes the firstThe number of the key words is one,is shown asThe number of the labels is one,indicates the total number of the keywords and,is as followsA key word andthe similarity of the individual labels is determined,is as followsThe word frequency of each keyword in the document to be labeled,is toAnd carrying out normalized word frequency.
Based on any of the embodiments described above, determining a target tag unit is specifically configured to:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
Based on any of the above embodiments, the keyword extraction unit specifically includes:
applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
Based on any one of the above embodiments, the obtaining step of the sample text and the sample keywords corresponding to the sample text includes:
a document obtaining unit, configured to obtain a thesis document related to each tag in the tag list, where the thesis document carries a thesis keyword;
and the text and keyword determining unit is used for determining the sample text based on the paper document and determining a sample keyword corresponding to the sample text based on the paper keyword.
Based on any of the embodiments described above, determining text and keyword units is specifically configured to:
sample text is determined based on the title and abstract in the thesis document.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor) 610, a communication Interface 620, a memory (memory) 630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 complete communication with each other through the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a document annotation method comprising: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the document annotation method provided by the above methods, and the method includes: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute a document annotation method provided by the above methods, the method including: acquiring a document to be annotated and a tag list; extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting the word frequency of each keyword in the document to be labeled; and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A document labeling method is characterized by comprising the following steps:
acquiring a document to be annotated and a tag list;
extracting keywords from the document to be labeled to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled.
2. The method according to claim 1, wherein the determining the target tag of the to-be-annotated document based on the similarity between the keyword and the tag in the tag list and the word frequency of the keyword in the to-be-annotated document comprises:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
and determining a target label of the document to be labeled based on the label scores of the plurality of labels.
3. The method of claim 2, wherein the determining the tag scores of the tags of the document to be tagged based on the similarity between the keywords and the tags in the tag list and the word frequency of the keywords in the document to be tagged comprises:
determining label scores of a plurality of labels of the document to be labeled based on the following formula:
wherein ,the first to represent the document to be annotatedThe label score of each label is calculated,denotes the firstThe number of the key words is one,denotes the firstThe number of the labels is one,indicates the total number of the keywords to be used,is a firstIndividual key word andthe similarity of the individual labels is determined,is as followsThe word frequency of each keyword in the document to be labeled,is toAnd carrying out normalized word frequency.
4. The document annotation method of claim 2, wherein the determining the target tag of the document to be annotated based on the tag scores of the plurality of tags comprises:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold score and/or the preset label number of the document to be labeled, and determining the labels obtained by screening as the target labels of the document to be labeled.
5. The method according to any one of claims 1 to 4, wherein the extracting keywords from the document to be labeled to obtain a plurality of keywords comprises:
applying a keyword extraction model to extract keywords from the document to be labeled to obtain a plurality of keywords;
the keyword extraction model is obtained by training based on a sample text and sample keywords corresponding to the sample text.
6. The document labeling method of claim 5, wherein the obtaining step of the sample text and the sample keywords corresponding to the sample text comprises:
acquiring a thesis document related to each label in the label list, wherein the thesis document carries a thesis keyword;
and determining the sample text based on the paper document, and determining a sample keyword corresponding to the sample text based on the paper keyword.
7. The method of claim 6, wherein said determining the sample text based on the paper document comprises:
sample text is determined based on the title and abstract in the paper document.
8. A document labeling apparatus, comprising:
the acquisition unit is used for acquiring a document to be annotated and a label list;
the keyword extraction unit is used for extracting keywords from the document to be labeled to obtain a plurality of keywords and counting the word frequency of each keyword in the document to be labeled;
and the tag determining unit is used for determining a target tag of the document to be labeled based on the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be labeled.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document annotation method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the document annotation method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211592980.XA CN115659969B (en) | 2022-12-13 | 2022-12-13 | Document labeling method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211592980.XA CN115659969B (en) | 2022-12-13 | 2022-12-13 | Document labeling method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115659969A true CN115659969A (en) | 2023-01-31 |
CN115659969B CN115659969B (en) | 2023-04-28 |
Family
ID=85017459
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211592980.XA Active CN115659969B (en) | 2022-12-13 | 2022-12-13 | Document labeling method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115659969B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117971780A (en) * | 2023-12-29 | 2024-05-03 | 青矩技术股份有限公司 | Document storage method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103235774A (en) * | 2013-04-27 | 2013-08-07 | 杭州电子科技大学 | Extraction method of feature words of science and technology project application form |
CN110489649A (en) * | 2019-08-19 | 2019-11-22 | 北京创鑫旅程网络技术有限公司 | The method and device of label association content |
CN110717092A (en) * | 2018-06-27 | 2020-01-21 | 北京京东尚科信息技术有限公司 | Method, system, device and storage medium for matching objects for articles |
CN110781297A (en) * | 2019-09-18 | 2020-02-11 | 国家计算机网络与信息安全管理中心 | Classification method of multi-label scientific research papers based on hierarchical discriminant trees |
CN111967262A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Method and device for determining entity tag |
US20220019741A1 (en) * | 2020-07-16 | 2022-01-20 | Optum Technology, Inc. | An unsupervised approach to assignment of pre-defined labels to text documents |
-
2022
- 2022-12-13 CN CN202211592980.XA patent/CN115659969B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103235774A (en) * | 2013-04-27 | 2013-08-07 | 杭州电子科技大学 | Extraction method of feature words of science and technology project application form |
CN110717092A (en) * | 2018-06-27 | 2020-01-21 | 北京京东尚科信息技术有限公司 | Method, system, device and storage medium for matching objects for articles |
CN110489649A (en) * | 2019-08-19 | 2019-11-22 | 北京创鑫旅程网络技术有限公司 | The method and device of label association content |
CN110781297A (en) * | 2019-09-18 | 2020-02-11 | 国家计算机网络与信息安全管理中心 | Classification method of multi-label scientific research papers based on hierarchical discriminant trees |
CN111967262A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Method and device for determining entity tag |
US20220019741A1 (en) * | 2020-07-16 | 2022-01-20 | Optum Technology, Inc. | An unsupervised approach to assignment of pre-defined labels to text documents |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117971780A (en) * | 2023-12-29 | 2024-05-03 | 青矩技术股份有限公司 | Document storage method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115659969B (en) | 2023-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019218514A1 (en) | Method for extracting webpage target information, device, and storage medium | |
CN107229668B (en) | Text extraction method based on keyword matching | |
CN107463605B (en) | Method and device for identifying low-quality news resource, computer equipment and readable medium | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN107437038B (en) | Webpage tampering detection method and device | |
WO2022095374A1 (en) | Keyword extraction method and apparatus, and terminal device and storage medium | |
CN109241277B (en) | Text vector weighting method and system based on news keywords | |
CN111160019B (en) | Public opinion monitoring method, device and system | |
CN113569050B (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
CN109472022B (en) | New word recognition method based on machine learning and terminal equipment | |
CN109145180B (en) | Enterprise hot event mining method based on incremental clustering | |
US20230074771A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN113722492A (en) | Intention identification method and device | |
CN111985212A (en) | Text keyword recognition method and device, computer equipment and readable storage medium | |
CN113255319B (en) | Model training method, text segmentation method, abstract extraction method and device | |
CN113204956B (en) | Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device | |
CN115659969B (en) | Document labeling method, device, electronic equipment and storage medium | |
TWI681304B (en) | System and method for adaptively adjusting related search words | |
CN115062621A (en) | Label extraction method and device, electronic equipment and storage medium | |
CN114676346A (en) | News event processing method and device, computer equipment and storage medium | |
CN113011174B (en) | Method for identifying purse string based on text analysis | |
CN112949299A (en) | Method and device for generating news manuscript, storage medium and electronic device | |
CN118134422A (en) | File content auditing method, device, equipment, storage medium and product | |
CN110941713A (en) | Self-optimization financial information plate classification method based on topic model | |
CN112699949B (en) | Potential user identification method and device based on social platform data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |