CN112926297A - Method, apparatus, device and storage medium for processing information - Google Patents

Method, apparatus, device and storage medium for processing information Download PDF

Info

Publication number
CN112926297A
CN112926297A CN202110222722.1A CN202110222722A CN112926297A CN 112926297 A CN112926297 A CN 112926297A CN 202110222722 A CN202110222722 A CN 202110222722A CN 112926297 A CN112926297 A CN 112926297A
Authority
CN
China
Prior art keywords
topic
target
information
target topic
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110222722.1A
Other languages
Chinese (zh)
Other versions
CN112926297B (en
Inventor
徐海东
刘继辉
邢卓然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110222722.1A priority Critical patent/CN112926297B/en
Publication of CN112926297A publication Critical patent/CN112926297A/en
Application granted granted Critical
Publication of CN112926297B publication Critical patent/CN112926297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure discloses a method, an apparatus, a device and a storage medium for processing information, which are applied to the field of computer technology, in particular to the fields of natural language processing and deep learning. The specific implementation scheme is as follows: acquiring a target topic and text information aiming at the target topic; extracting keywords from the target topic to obtain keywords aiming at the target topic; determining similarity between keywords and text information aiming at the target topic; and determining information irrelevant to the target topic in the text information according to the similarity.

Description

Method, apparatus, device and storage medium for processing information
Technical Field
The present disclosure relates to the field of computer technologies, particularly to the field of natural language processing and deep learning, and more particularly, to a method, an apparatus, a device, and a storage medium for processing information.
Background
The real-time performance and hot spot coverage of information updating are important indexes influencing user experience. The topic is used as a novel hotspot showing form, and various resources with topic identification are aggregated and sequenced, so that the topic has good real-time performance and resource diversity and can meet the requirements of users.
Disclosure of Invention
A method, apparatus, device, medium, and program product for processing information are provided to improve the accuracy of information for each topic.
According to a first aspect, there is provided a method of processing information, comprising: acquiring a target topic and text information aiming at the target topic; extracting keywords from the target topic to obtain keywords aiming at the target topic; determining similarity between keywords and text information aiming at the target topic; and determining information irrelevant to the target topic in the text information according to the similarity.
According to a second aspect, there is provided an apparatus for processing information, comprising: the information acquisition module is used for acquiring a target topic and text information aiming at the target topic; the keyword extraction module is used for extracting keywords from the target topic to obtain keywords aiming at the target topic; the similarity determining module is used for determining the similarity between the keywords aiming at the target topic and the text information; and the information determining module is used for determining information irrelevant to the target topic in the text information according to the similarity.
According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of processing information provided by the present disclosure.
According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method of processing information provided by the present disclosure.
According to a fifth aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of processing information provided by the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of an application scenario of a method, an apparatus, a device and a storage medium for processing information according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a method of processing information according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a schematic diagram of obtaining a target topic according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of obtaining textual information pertaining to a target topic in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a schematic diagram of deriving keywords for a target topic according to an embodiment of the disclosure;
FIG. 6 schematically illustrates a schematic diagram of determining information in textual information that is not relevant to a target topic, in accordance with an embodiment of the disclosure;
FIG. 7 is a block diagram of an apparatus for processing information according to an embodiment of the present disclosure; and
FIG. 8 is a block diagram of an electronic device used to implement a method of processing information of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The present disclosure provides a method of processing information, the method comprising an information acquisition phase, a keyword extraction phase, a similarity determination phase and an information determination phase. In the information acquisition stage, a target topic and text information for the target topic are acquired. In the keyword extraction stage, keywords are extracted from the target topic to obtain keywords aiming at the target topic. In the similarity determining stage, the similarity between the keywords and the text information for the target topic is determined. And in the information determining stage, determining information irrelevant to the target topic in the text information according to the similarity.
An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.
Fig. 1 is an application scenario diagram of a method, apparatus, device, medium, and program product for processing information according to an embodiment of the present disclosure.
As shown in fig. 1, the application scenario 100 of this embodiment may include a terminal device 110 and a server 120. The terminal device 110 and the server 120 may communicate with each other via a network, which may include, for example, a wired or wireless communication network.
Terminal device 110 may have installed thereon various client applications such as, for example only, a shopping-type application, a web browser application, a search-type application, a web-disk-type application, a mailbox client, a social-type application, and the like. Terminal device 110 may be a variety of electronic devices having a display screen and having processing functionality including, but not limited to, smart phones, tablets, laptop and desktop computers, and the like.
Illustratively, a user may use terminal device 110 to interact with server 120 via a network. Server 120 may be, for example, an application server that provides support for client applications run by terminal device 110. In this embodiment, the terminal device 110 may initiate a topic information verification request to the server 120 in response to a user operation. The server 120 may check the information for the topic from the topic information check request, for example, to check the correlation between the information for the topic and the topic. The information irrelevant to the topic in the information aiming at the topic is removed, and the quality of the information aiming at each topic maintained by the client application is improved.
In one embodiment, server 120 may be, for example, a server incorporating a blockchain. Alternatively, the server 120 may also be a virtual server, a cloud server, or the like. The server 120 may, for example, feed back the verified information 130 that is not related to the topic to the terminal device 110 for further manual review by the user.
In an embodiment, the application scenario 100 may also include a database 140, and the database 140 may be maintained with information for all topics in the client application, for example. The server 120 may obtain information 150 for a topic from the database 140 according to the topic identification, for example, when verifying the correlation between the information and the topic.
It should be noted that the method for processing information provided by the present disclosure may be executed by the server 120. Accordingly, the apparatus for processing information provided by the present disclosure may be provided in the server 120.
It should be understood that the terminal devices, servers and databases in fig. 1 are merely illustrative. Any type of terminal device, presentation page, and server may be provided, as desired for implementation.
The method for processing information provided by the present disclosure will be described in detail with fig. 2 to 6 in conjunction with the application scenario described in fig. 1.
FIG. 2 is a flow diagram of a method of processing information in accordance with an embodiment of the present disclosure. As shown in fig. 2, the method 200 of processing information of this embodiment includes operation S210, operation S230, operation S250, and operation S270.
In operation S210, a target topic and text information for the target topic are acquired.
According to embodiments of the present disclosure, the target topic may be, for example, a topic determined in response to a request. The request may be generated by the terminal device in response to a user operation, for example, and the request may include a topic name, a topic identifier, and other attribute information capable of uniquely identifying a topic.
According to the embodiment of the disclosure, the non-generalization topic can be screened out from all the maintained topics as the target topic. This is because information on a generalized topic is extensive, and there is generally no requirement for correlation between information and topics. When a topic is to be screened, proper noun recognition may be performed on the topic name, for example. And when the topic names comprise proper nouns, determining that the topic is a non-generalization topic, otherwise, determining that the topic is a generalization topic. The proper nouns may include, for example, names of persons, places, companies, organizations, etc., that may represent specific or unique persons or things. It is to be understood that a non-generalized topic generally refers to a topic generated by aggregating the content of a particular event, which may be, for example, an event that is not ubiquitous in society.
According to an embodiment of the present disclosure, the text information for the target topic may be, for example, information including a topic name of the target topic. When acquiring text information for a target topic, characters such as the topic name may be removed from information including the topic name of the target topic so as to avoid the influence of the topic name on the relevance determination. Illustratively, the information including the topic name of the target topic is, for example, information posted by the user in a social-type application. For example, if the information is XXXX # yyyy #, and YYY is a topic name, the acquired text information for the target topic is XXXX.
In operation S230, keywords are extracted from the target topic, resulting in keywords for the target topic.
According to the embodiment of the present disclosure, keyword extraction may be performed on the topic name of the target topic, and the extracted keyword is taken as a keyword for the target topic.
For example, when extracting the keyword, the topic name may be segmented first. And comparing a plurality of words obtained by word segmentation processing with the words in the keyword word bank, and taking the words belonging to the keyword word bank in the plurality of words as the keywords aiming at the target topic. The keyword lexicon may be pre-constructed according to actual requirements, and may be composed of, for example, encyclopedic entries, proper nouns, an input method cell lexicon, and the like, which is not limited in this disclosure.
Illustratively, the extraction of the keywords may be performed using a Term Frequency-Inverse text Frequency index technique (TF-IDF), a sequence labeling model-based method, or the like.
In operation S250, a similarity between the keyword for the target topic and the text information is determined.
According to the embodiment of the disclosure, keywords and text information aiming at a target topic can be respectively converted into word vectors by adopting a word2vec method. And taking the Pearson correlation coefficient, the Spireman correlation coefficient or the Jacard similarity coefficient and the like between the two converted word vectors as the similarity between the key words and the text information. Alternatively, word frequency-inverse text frequency index technology, a probabilistic topic (LDA) model (also called Latent Dirichlet Allocation model), and the like may be employed to calculate the similarity between the keyword and the text information. It is to be understood that the above method for calculating the similarity is only an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.
In operation S270, information irrelevant to the target topic in the text information is determined according to the similarity.
According to the embodiment of the present disclosure, information in which the similarity to the keyword for the target topic is lower than the similarity threshold in the text information may be taken as information unrelated to the target topic.
According to the embodiment of the disclosure, if there are a plurality of keywords for the target topic, the text information may be determined to be information unrelated to the target topic when the similarity between each keyword in the plurality of keywords and the text information is less than the similarity threshold. Alternatively, in a case where an average value of the similarities between the plurality of keywords and the text information is smaller than a similarity threshold value, the text information may be determined to be information unrelated to the target topic. It is understood that the similarity threshold may be set according to actual requirements, which is not limited by the present disclosure.
According to the embodiment, the keywords are extracted from the topics, and the information irrelevant to the target topic is screened from the text information according to the similarity between the extracted keywords and the text information, so that the relevance between the topic and the topic content can be automatically judged. Therefore, low-quality information irrelevant to the topic can be conveniently mined from the information aiming at the topic, the quality of the information under the topic is improved, and the user experience is improved.
Fig. 3 schematically illustrates a schematic diagram of obtaining a target topic according to an embodiment of the present disclosure.
According to the embodiment of the disclosure, when the target topic is obtained, for example, the non-generalized topic can be obtained from the topic library as the target topic, so that the accuracy and the effectiveness of the topic content audit verification are realized. As shown in fig. 3, operation S311 may be performed to obtain a plurality of topics from a predetermined topic library, and then pick a non-generalization topic from the plurality of topics.
According to the embodiment of the present disclosure, when a non-generalized topic is selected from a plurality of topics, for example, lexical analysis may be performed on the topic name of each of the plurality of topics, so as to obtain a target word for each topic. And determining whether each topic is a non-generalization topic according to the obtained target words, thereby picking out the target topic from the plurality of topics.
Illustratively, when the lexical analysis is performed on the topic name of each topic, it may be implemented by performing operation S312, that is, segmenting the topic name and performing part-of-speech tagging. The embodiment may adopt a lexical analysis tool to analyze the topic names, implement word segmentation processing and part-of-speech tagging processing, obtain nouns and verbs included in the topic names according to the part-of-speech tagging results, and use the nouns and verbs included in the topic names as target words for the target topics.
Exemplarily, the Lexical Analysis tool may be, for example, a Lexical Analysis tool of hundredths (left Analysis of Chinese 2.0, LAC2.0), a Language Technology Platform (LTP), a university of qinghua Lexical Analyzer (THU Lexical Analyzer for Chinese), and the like. The lexical analysis tool can realize the functions of Chinese word segmentation, part of speech tagging, proper noun identification and the like. The type of the lexical analysis tool is merely an example to facilitate understanding of the present disclosure, and any type of lexical analysis tool may be used according to actual needs, which is not limited by the present disclosure.
After the target words of each topic are obtained, as shown in fig. 3, operation S313 may be executed to determine whether there is a proper noun in the target words of each topic. If there is a proper noun, operation S315 is performed to determine each topic as a target topic. If there is no proper noun, operation S314 is performed to determine whether the number of nouns included in the target word for each topic is greater than or equal to a third predetermined value and whether the number of verbs included in the target word for each topic is greater than or equal to the third predetermined value. If the number of nouns is equal to or greater than the third predetermined value or the number of verbs is equal to or greater than the third predetermined value, operation S315 is performed to determine each topic as the target topic. And if the number of the nouns and the number of the verbs are smaller than a third preset value and the target words do not comprise proper nouns, returning to analyzing another topic obtained from the preset topic library to determine whether the another topic is the target topic or not until all the target topics in the preset topic library are selected. The third predetermined value may be set according to actual requirements, for example, may be set to 3, which is not limited by the present disclosure.
Fig. 4 schematically shows a schematic diagram of acquiring text information belonging to a target topic according to an embodiment of the present disclosure.
According to the embodiment of the disclosure, after the target topic is obtained, information for the target topic can be obtained, and text information can be extracted from the obtained information.
Illustratively, the embodiment 400 may obtain information for a target topic from the information repository 420 indexed by the topic identification 410 of the target topic. The topic identification 410 may be information uniquely indicating a target topic, such as a topic name. Information for all topics is maintained in the information repository 420.
Illustratively, the information for the target topic may include, for example, various types of information such as text, pictures, videos, and so on. When the text information for the target topic is acquired, the information only including pictures and/or videos (i.e. information not including texts) can be filtered from the acquired various types of information, and the remaining information is taken as the image-text information 430 for the target topic. When filtering information that does not include text, for example, the storage format of the information may be identified, and information stored in a picture format and a video format may be filtered out.
According to the embodiment of the disclosure, after the image-text information for the target topic is acquired, the text field can be extracted from the image-text information. In extracting the text field, as shown in fig. 4, a downstream service 440 may be queried according to the identification of the teletext information to obtain the details 450 of the teletext information from the downstream service 440. The downstream service may be, for example, a database storing the detail content 450 indexed by the identification of the teletext information. Upon obtaining the details, a text field 460 in the details, which may be, for example, a content field, etc., may be identified and extracted. After extracting the text field 460, for example, a special field in the text field 460 may be eliminated, and the text field with the special field eliminated is used as the text information 470 for the target topic.
Illustratively, the special fields may include, for example, special characters other than kanji, twenty-six english letters, fields between a pair of special characters, and the like. The special field may be set according to actual requirements, for example, and the disclosure is not limited thereto. When the special field is removed, for example, the special character recognition may be performed on the text field first to obtain all the special characters included in the text field. Two characters which are adjacent and same in position in all the special characters are taken as a pair of special characters.
According to the embodiment, the text information is extracted from the image-text information, and the examination related to the target topic is only carried out on the text information, so that the accuracy of the examination can be improved.
FIG. 5 schematically illustrates a schematic diagram of deriving keywords for a target topic according to an embodiment of the disclosure.
According to an embodiment of the present disclosure, in determining a keyword for a target topic, it may be extracted from a topic name of the target topic. As shown in fig. 5, for example, operation S531 may be performed to extract proper nouns from the topic names as keywords, since proper nouns may generally represent unique persons or objects, and thus the target topic can be better characterized.
Illustratively, the lexical analysis tool described above may be employed to analyze the topic names of the target topics to obtain proper nouns in the topic names, for example.
After obtaining the proper nouns, for example, operation S532 may be further performed to determine whether the number of proper nouns is less than a first predetermined value. If the number of the words is less than the first predetermined value, for example, other words than the proper noun may be extracted from the topic name, and the proper noun and the other words may be used as the keywords. And if the number is larger than or equal to the first preset value, other words do not need to be extracted, and the extracted proper nouns are used as the finally determined keywords. The first predetermined value may be set according to actual requirements, for example, and the disclosure does not limit this.
According to an embodiment of the present disclosure, when extracting other words than proper nouns, for example, operation S533 may be performed to implement. In operation S533, words whose weights satisfy a predetermined condition are extracted from the topic names using a word weight calculation algorithm as keywords for the target topic.
Illustratively, the word weight calculation algorithm may include, for example, a TF-IDF algorithm, a text ranking algorithm (TextRank), an information gain algorithm, a Conditional Random Field (CRF) model, or a term importance operator of a Baidu natural language processing cloud platform, etc. The words satisfying the predetermined condition may be words whose calculated weight is greater than a predetermined weight. Alternatively, the word satisfying the predetermined condition may be a predetermined number of words having a higher calculated weight. The predetermined number may be determined, for example, according to the number of proper nouns, so that the sum of the number of proper nouns and the predetermined number is a fixed value, which may be, for example, the aforementioned first predetermined value.
According to the embodiment, the words with weights meeting the preset conditions are extracted when the number of proper nouns is small, so that the keywords extracted from the topic names can more sufficiently express the topics, and the accuracy of the similarity between the finally determined text information and the target topic is improved. Therefore, the accuracy of finally determined irrelevant information can be improved, and the accidental injury to relevant information is reduced.
According to the embodiment of the present disclosure, after extracting keywords from the topic names, for example, the keywords may also be expanded, and the expanded words may be used as keywords for the target topic. Therefore, the situation that the similarity is low due to the existence of different words representing the same meaning when the similarity is determined is avoided, and therefore the accuracy of the determined similarity between the keywords and the text information can be further improved.
Illustratively, the expansion of the keyword may be achieved through operation S534. In operation S534, a neighboring word for a word whose weight satisfies a predetermined condition is determined using a neighboring algorithm as a keyword for the target topic. The proximity algorithm may include, for example, a proximity word expansion operator provided by a hundredth natural language processing cloud platform, a K-nearest neighbor classification algorithm (KNN), and so on.
Illustratively, the present disclosure may maintain, for example, a word bank, and when determining neighboring words, a word whose weight satisfies a predetermined condition may be used as a center, and a predetermined number of words in the word bank that are closest to the word whose weight satisfies the predetermined condition may be determined as neighboring words using a proximity algorithm. The predetermined number may be set according to actual requirements, which is not limited by this disclosure.
With the above method, when the number of proper nouns is greater than or equal to the first predetermined value, the extracted proper nouns may be used as the keyword for the target topic, thereby completing operation S535. When the number of proper nouns is less than the first predetermined value, the extracted proper nouns and words whose weights satisfy the predetermined condition may be taken as the keywords for the target topic, or the extracted proper nouns, words whose weights satisfy the predetermined condition, and neighboring words to words whose weights satisfy the predetermined condition may be taken as the keywords for the target topic, thereby completing operation S535.
Fig. 6 schematically illustrates a schematic diagram of determining information in text information that is not relevant to a target topic, according to an embodiment of the disclosure.
According to the embodiment of the disclosure, when determining information irrelevant to the target topic in the text information, for example, the matching similarity model may be selected according to the length of the text information. The reason is that the sentence length difference which is processed by different similarity models is good at, the similarity model is selected according to the length of the text information, and the similarity between the text information and the keyword is calculated by adopting the selected similarity model, so that the accuracy of the determined information irrelevant to the target topic can be improved.
Illustratively, as shown in fig. 6, when determining the similarity between the keyword and the text information and determining whether the text information is related to the target topic according to the similarity, the text information may be first sentence-divided through operation S601 and the number of sentences included in the text information may be determined. When the text information is divided into sentences, for example, punctuations in the text information can be recognized first, and the positions of the specific punctuations are used as sentence dividing points. The particular punctuation may include, for example, a sentence end symbol. ","! ","? "and" - "etc., and may also include the sentence separator", "and"; "and the like. It is to be understood that the method of sentence-dividing the text information is only an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.
After the number of sentences is obtained, a similarity model having a mapping relation with the number of sentences can be determined, so that the similarity between the keywords aiming at the target topic and the text information can be determined. The embodiment of the disclosure can maintain the mapping relationship between the statement number and the similarity model in advance. The type and number of the similarity model in the mapping relationship can be set according to actual requirements, which is not limited by the disclosure. For example, three similarity models may be set according to the number of sentences to determine the similarity between a long sentence and a keyword, the similarity between a medium sentence and a keyword, and the similarity between a short sentence and a keyword, respectively.
For example, two similarity models may be set for determining the similarity between the keyword and the longer sentence and the shorter sentence, respectively. In this embodiment, after obtaining the number of sentences, operation S602 may be performed first to determine whether the number of sentences is smaller than a second predetermined value. And if the value is smaller than the second preset value, determining that the text information is a short sentence, otherwise, determining that the text information is a long sentence. The second predetermined value may be set according to actual requirements, for example, the second predetermined value may be 4, which is not limited by the present disclosure.
When it is determined that the text information is a long sentence, operation S603 is performed, and the similarity model is determined to be a similarity model based on a General Regression Neural Network (GRNN).
When the text information is determined to be a short sentence, operation S604 is performed to determine that the similarity model is a similarity model based on a Latent Dirichlet Allocation algorithm (LDA).
After the similarity model is determined, the similarity model can be used to determine the similarity between the text information and each keyword in the keywords for the target topic. According to the embodiment of the present disclosure, when there are a plurality of text information for the target topic, for each of the plurality of text information, a similarity model may be determined through operations S601 to S604, and a similarity with each of the keywords for the target topic is calculated using the determined similarity model.
After calculating the similarity using the similarity model based on the generalized regression neural network, operation S605 may be performed to determine whether the similarity between the text information and each keyword is less than a first similarity threshold, and if the similarities with all the keywords on the target topic are less than the first similarity threshold, operation S607 is performed to determine that the text information is not related to the target topic. If the similarity between the text information and a keyword for the target topic is greater than or equal to the first similarity threshold, operation S608 is performed to determine that the text information is related to the target topic.
Similarly, after the similarity is calculated by using the similarity model based on the latent dirichlet allocation algorithm, operation S606 may be performed to determine whether the similarity between the text information and each keyword is smaller than a second similarity threshold, and if the similarities with all the keywords on the target topic are smaller than the second similarity threshold, operation S607 is performed to determine that the text information is not related to the target topic. If the similarity between the text information and a keyword for the target topic is greater than or equal to the second similarity threshold, operation S608 is performed to determine that the text information is related to the target topic.
Can clean upIt is to be understood that the first similarity threshold and the second similarity threshold may be set according to actual requirements. For example, in one embodiment, the first similarity threshold and the second similarity threshold may be any values close to 0, e.g., the first similarity threshold is 10-5The second similarity threshold is 10-2. The values of the first similarity threshold and the second similarity threshold are not limited in this disclosure.
Based on the method for processing information described above, the present disclosure also provides an apparatus for processing information. The apparatus will be described in detail below with reference to fig. 7.
Fig. 7 is a block diagram of a structure of an apparatus for processing information according to an embodiment of the present disclosure.
As shown in fig. 7, the apparatus 700 for processing information of this embodiment may include an information acquisition module 710, a keyword extraction module 730, a similarity determination module 750, and an information determination module 770.
The information obtaining module 710 is configured to obtain a target topic and text information for the target topic. In an embodiment, the information obtaining module 710 may be configured to perform the operation S210 described above, for example, and is not described herein again.
The keyword extraction module 730 is configured to extract keywords from the target topic to obtain keywords for the target topic. In an embodiment, the keyword extraction module 730 may be configured to perform the operation S230 described above, for example, and is not described herein again.
The similarity determination module 750 is used for determining the similarity between the keywords and the text information for the target topic. In an embodiment, the similarity determining module 750 may be configured to perform the operation S250 described above, for example, and is not described herein again.
The information determining module 770 is configured to determine information, which is irrelevant to the target topic, in the text information according to the similarity. In an embodiment, the information determining module 770 may be configured to perform the operation S270 described above, for example, and will not be described herein again.
According to an embodiment of the present disclosure, the keyword extraction module 730 may be configured to analyze a topic name of a target topic by using a lexical analysis tool, to obtain a proper noun in the topic name, as a keyword for the target topic; and extracting words with weights meeting a predetermined condition from the topic names of the target topic by adopting a word weight calculation algorithm as keywords for the target topic in the case that the number of proper nouns in the target topic is less than a first predetermined value.
According to an embodiment of the present disclosure, the keyword extraction module 730 may be further configured to determine neighboring words for words whose weights satisfy predetermined conditions by using a neighboring algorithm, for example, as the keywords for the target topic.
According to an embodiment of the present disclosure, the similarity determining module 750 may include, for example, a sentence number determining sub-module, a model determining sub-module, and a similarity determining sub-module. The sentence number determining submodule is used for determining the number of sentences included in the text information. And the model determining submodule is used for determining a similarity model with a mapping relation with the number of the sentences. The similarity determining submodule is used for determining the similarity between the keywords aiming at the target topic and the text information by adopting a similarity model.
According to an embodiment of the present disclosure, the model determining submodule is specifically configured to determine that the similarity model is a similarity model based on a generalized regression neural network when the number of sentences is less than a second predetermined value; and under the condition that the number of sentences is larger than or equal to a second preset value, determining the similarity model as a similarity model based on the potential Dirichlet distribution algorithm.
According to an embodiment of the present disclosure, the information obtaining module 710 may include, for example, a topic obtaining sub-module, a target word determining sub-module, and a target topic determining sub-module. The topic acquisition sub-module is used for acquiring a plurality of topics in a predetermined topic library. The lexical analysis submodule is used for carrying out lexical analysis on the topic name of each topic in the plurality of topics to obtain a target word aiming at each topic. The target topic determination sub-module is used for determining a target topic in the plurality of topics according to the target words aiming at each topic.
According to an embodiment of the present disclosure, the target word determination sub-module may be configured to analyze the topic name of each topic by using a lexical analysis tool, and obtain a noun and a verb included in the topic name as a target word for each topic. The target topic determination sub-module may determine each topic as a target topic, for example, when a proper noun is included in a target word for each topic; and in the case that the target words for each topic do not include proper nouns, and the number of included nouns or the number of included verbs is greater than or equal to a third predetermined value, determining each topic as the target topic.
According to an embodiment of the present disclosure, the information obtaining module 710 may include, for example, an image-text information obtaining sub-module, a text field extracting sub-module, and a character eliminating sub-module. The image-text information acquisition submodule is used for acquiring image-text information aiming at the target topic. The text field extraction submodule is used for extracting the text field from the image-text information. The character eliminating submodule is used for eliminating special characters in the text field to obtain text information aiming at the target topic.
According to an embodiment of the present disclosure, the number of keywords for the target topic is multiple, the number of text messages is multiple, and the information determining module 770 may be specifically configured to, for each text message in the multiple text messages, determine that each text message is information unrelated to the target topic when the similarity between each keyword in the multiple keywords and each text message is less than the similarity threshold.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 8 shows a schematic block diagram of an electronic device 800 that may be used to implement the method of processing information of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a method of processing information. For example, in some embodiments, the method of processing information may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more of the steps of the method of processing information described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method of processing information in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (13)

1. A method of processing information, comprising:
acquiring a target topic and text information aiming at the target topic;
extracting keywords from the target topic to obtain keywords aiming at the target topic;
determining similarity between keywords for the target topic and the text information; and
and determining information irrelevant to the target topic in the text information according to the similarity.
2. The method of claim 1, wherein extracting keywords from the target topic, and obtaining the keywords for the target topic comprises:
analyzing the topic name of the target topic by adopting a lexical analysis tool to obtain a proper noun in the topic name as a keyword aiming at the target topic;
and under the condition that the number of proper nouns in the topic names is less than a first preset value, extracting words with weights meeting preset conditions from the topic names of the target topics by adopting a word weight calculation algorithm to serve as the keywords aiming at the target topics.
3. The method of claim 1, wherein extracting keywords from the target topic, obtaining keywords for the target topic further comprises:
and adopting a proximity algorithm to determine a proximity word of the word which meets a preset condition aiming at the weight as the key word aiming at the target topic.
4. The method of claim 1, wherein determining a similarity between the keywords for the target topic and the textual information comprises:
determining the number of sentences included in the text information;
determining a similarity model having a mapping relation with the statement number; and
and determining the similarity between the keywords aiming at the target topic and the text information by adopting the similarity model.
5. The method of claim 4, wherein determining a similarity model having a mapping relationship with the number of sentences comprises:
determining the similarity model to be a similarity model based on a generalized regression neural network under the condition that the number of the sentences is greater than or equal to a second preset value;
determining the similarity model to be a similarity model based on a latent Dirichlet distribution algorithm if the number of sentences is less than the second predetermined value.
6. The method of claim 1, wherein obtaining a target topic comprises:
acquiring a plurality of topics in a preset topic library;
performing lexical analysis on the topic name of each topic in the plurality of topics to obtain a target word for each topic; and
determining a target topic in the plurality of topics according to the target words for each topic.
7. The method of claim 6, wherein:
lexical analysis of the topic name of each of the plurality of topics includes: analyzing the topic name of each topic by adopting a lexical analysis tool to obtain a noun and a verb included in the topic name as a target word aiming at each topic;
determining a target topic of the plurality of topics comprises:
determining each topic as a target topic in the case that a proper noun is included in the target word for each topic;
and determining each topic as a target topic when the target word for each topic does not include a proper noun and the number of included nouns or the number of included verbs is greater than or equal to a third preset value.
8. The method of claim 1, wherein obtaining textual information for the target topic comprises:
acquiring image-text information aiming at the target topic;
extracting text fields from the image-text information; and
and removing special fields in the text fields to obtain the text information aiming at the target topic.
9. The method of claim 1, wherein the number of keywords for the target topic is multiple, the number of text messages is multiple, and determining information in the text messages that is not related to the target topic comprises:
for each text information in a plurality of text information, determining each text information as information irrelevant to the target topic when the similarity between each keyword in the plurality of keywords and each text information is less than a similarity threshold value.
10. An apparatus for processing information, comprising:
the information acquisition module is used for acquiring a target topic and text information aiming at the target topic;
the keyword extraction module is used for extracting keywords from the target topic to obtain keywords aiming at the target topic;
the similarity determining module is used for determining the similarity between the keywords aiming at the target topic and the text information; and
and the information determining module is used for determining information irrelevant to the target topic in the text information according to the similarity.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-9.
13. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 9.
CN202110222722.1A 2021-02-26 2021-02-26 Method, apparatus, device and storage medium for processing information Active CN112926297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110222722.1A CN112926297B (en) 2021-02-26 2021-02-26 Method, apparatus, device and storage medium for processing information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110222722.1A CN112926297B (en) 2021-02-26 2021-02-26 Method, apparatus, device and storage medium for processing information

Publications (2)

Publication Number Publication Date
CN112926297A true CN112926297A (en) 2021-06-08
CN112926297B CN112926297B (en) 2023-06-30

Family

ID=76172599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110222722.1A Active CN112926297B (en) 2021-02-26 2021-02-26 Method, apparatus, device and storage medium for processing information

Country Status (1)

Country Link
CN (1) CN112926297B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887217A (en) * 2021-10-20 2022-01-04 美的集团(上海)有限公司 Word vector increment method, electronic device and computer storage medium
CN116578673A (en) * 2023-07-03 2023-08-11 北京凌霄文苑教育科技有限公司 Text feature retrieval method based on linguistic logics in digital economy field

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193508A1 (en) * 2009-10-02 2015-07-09 Flipboard, Inc. Topical Search System
CN106980870A (en) * 2016-12-30 2017-07-25 中国银联股份有限公司 Text matches degree computational methods between short text
CN109615001A (en) * 2018-12-05 2019-04-12 上海恺英网络科技有限公司 A kind of method and apparatus identifying similar article
CN109871433A (en) * 2019-02-21 2019-06-11 北京奇艺世纪科技有限公司 Calculation method, device, equipment and the medium of document and the topic degree of correlation
CN109918653A (en) * 2019-02-21 2019-06-21 腾讯科技(深圳)有限公司 Determine the association topic of text data and training method, device and the equipment of model
CN110134787A (en) * 2019-05-15 2019-08-16 北京信息科技大学 A kind of news topic detection method
CN111061837A (en) * 2019-12-18 2020-04-24 国网浙江省电力有限公司电力科学研究院 Topic identification method, device, equipment and medium
CN111694958A (en) * 2020-06-05 2020-09-22 深兰人工智能芯片研究院(江苏)有限公司 Microblog topic clustering method based on word vector and single-pass fusion
WO2020258662A1 (en) * 2019-06-25 2020-12-30 平安科技(深圳)有限公司 Keyword determination method and apparatus, electronic device, and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193508A1 (en) * 2009-10-02 2015-07-09 Flipboard, Inc. Topical Search System
CN106980870A (en) * 2016-12-30 2017-07-25 中国银联股份有限公司 Text matches degree computational methods between short text
CN109615001A (en) * 2018-12-05 2019-04-12 上海恺英网络科技有限公司 A kind of method and apparatus identifying similar article
CN109871433A (en) * 2019-02-21 2019-06-11 北京奇艺世纪科技有限公司 Calculation method, device, equipment and the medium of document and the topic degree of correlation
CN109918653A (en) * 2019-02-21 2019-06-21 腾讯科技(深圳)有限公司 Determine the association topic of text data and training method, device and the equipment of model
CN110134787A (en) * 2019-05-15 2019-08-16 北京信息科技大学 A kind of news topic detection method
WO2020258662A1 (en) * 2019-06-25 2020-12-30 平安科技(深圳)有限公司 Keyword determination method and apparatus, electronic device, and storage medium
CN111061837A (en) * 2019-12-18 2020-04-24 国网浙江省电力有限公司电力科学研究院 Topic identification method, device, equipment and medium
CN111694958A (en) * 2020-06-05 2020-09-22 深兰人工智能芯片研究院(江苏)有限公司 Microblog topic clustering method based on word vector and single-pass fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王立平;赵晖;: "融合词向量与关键词提取的微博话题发现", 现代计算机, no. 23 *
郭蓝天;李扬;慕德俊;杨涛;李哲;: "一种基于LDA主题模型的话题发现方法", 西北工业大学学报, no. 04 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887217A (en) * 2021-10-20 2022-01-04 美的集团(上海)有限公司 Word vector increment method, electronic device and computer storage medium
CN116578673A (en) * 2023-07-03 2023-08-11 北京凌霄文苑教育科技有限公司 Text feature retrieval method based on linguistic logics in digital economy field
CN116578673B (en) * 2023-07-03 2024-02-09 北京凌霄文苑教育科技有限公司 Text feature retrieval method based on linguistic logics in digital economy field

Also Published As

Publication number Publication date
CN112926297B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
US11093854B2 (en) Emoji recommendation method and device thereof
US11544459B2 (en) Method and apparatus for determining feature words and server
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
US20160299955A1 (en) Text mining system and tool
CN111460083A (en) Document title tree construction method and device, electronic equipment and storage medium
CN112633000B (en) Method and device for associating entities in text, electronic equipment and storage medium
CN113204621B (en) Document warehouse-in and document retrieval method, device, equipment and storage medium
CN107885717B (en) Keyword extraction method and device
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN113660541B (en) Method and device for generating abstract of news video
CN113836314B (en) Knowledge graph construction method, device, equipment and storage medium
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN113408660A (en) Book clustering method, device, equipment and storage medium
CN112989235A (en) Knowledge base-based internal link construction method, device, equipment and storage medium
CN112506864A (en) File retrieval method and device, electronic equipment and readable storage medium
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN112528644B (en) Entity mounting method, device, equipment and storage medium
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN112307183B (en) Search data identification method, apparatus, electronic device and computer storage medium
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN113378015A (en) Search method, search apparatus, electronic device, storage medium, and program product
CN113191145A (en) Keyword processing method and device, electronic equipment and medium
CN112784046B (en) Text clustering method, device, equipment and storage medium
CN115563242A (en) Automobile information screening method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant