CN116521858B - Context semantic sequence comparison method based on dynamic clustering and visualization - Google Patents
Context semantic sequence comparison method based on dynamic clustering and visualization Download PDFInfo
- Publication number
- CN116521858B CN116521858B CN202310445169.7A CN202310445169A CN116521858B CN 116521858 B CN116521858 B CN 116521858B CN 202310445169 A CN202310445169 A CN 202310445169A CN 116521858 B CN116521858 B CN 116521858B
- Authority
- CN
- China
- Prior art keywords
- context
- word
- sequence
- keywords
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000012800 visualization Methods 0.000 title claims abstract description 19
- 238000004458 analytical method Methods 0.000 claims abstract description 22
- 230000000007 visual effect Effects 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 12
- 210000003746 feather Anatomy 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 2
- 238000013461 design Methods 0.000 abstract description 10
- 230000003068 static effect Effects 0.000 abstract description 6
- 230000002452 interceptive effect Effects 0.000 abstract description 2
- 239000010410 layer Substances 0.000 description 15
- 239000013598 vector Substances 0.000 description 14
- 230000008569 process Effects 0.000 description 9
- 230000008859 change Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000000670 limiting effect Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 230000001427 coherent effect Effects 0.000 description 2
- 238000005094 computer simulation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000007794 visualization technique Methods 0.000 description 2
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- FFBHFFJDDLITSX-UHFFFAOYSA-N benzyl N-[2-hydroxy-4-(3-oxomorpholin-4-yl)phenyl]carbamate Chemical compound OC1=C(NC(=O)OCC2=CC=CC=C2)C=CC(=C1)N1CCOCC1=O FFBHFFJDDLITSX-UHFFFAOYSA-N 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a context semantic sequence comparison method based on dynamic clustering and visualization, wherein a ContextWing system is provided for supporting pairwise comparison of context sequence modes continuously evolving between two data streams. The computation model section is capable of generating dynamic topics and sequence patterns, and computing public attention and pairwise correlations. The system also comprises a novel multi-layer double-wing metaphor design, which can intuitively display sequence modes fused by different contexts to reveal the difference of two sequences in time and semantics. The interactive tool then supports selection of the center word and its contextual keywords to iteratively generate patterns for focused exploration. In addition, the system also supports static and streaming set analysis and wider application scenes.
Description
Technical Field
The invention relates to the technical field of data analysis, in particular to a context semantic sequence comparison method based on dynamic clustering and visualization.
Background
With the rapid development of social media, many people like to express their own views and concepts by posting messages, spreading important news, which appear in a data stream, and a collection of tweets containing the same keywords forming a social media data stream. In order to facilitate social science researchers and public opinion analysts to quickly understand a large amount of social media data, it is important to provide opinion summaries embedded with social media information. The visual summary of these tweets allows the user to quickly understand these text data.
Word clouds are a common method of providing visual summaries for text data. However, the word cloud provides limited contextual information and cannot provide links between keywords to convey meaning of sentences. Therefore, we extract the keyword sequences that appear in sequence in sentences as digests of the tweets. Meanwhile, since many of the tweets contain the same sequence, we define this sequence as "pattern". For example, "election theory starts at 9 on monday evening", "election theory will start on monday", and so on. People have different expressions, but they all mention the same keywords and sequences: "election-dialect-four week start" so that the frequently occurring semantic sequence is a pattern. The patterns are very diverse and require comparison of differences between them to understand the opinion. Furthermore, since these patterns belong to different time periods, it is also necessary to compare the patterns from the time level. Furthermore, to help analyze the public attitude, it is necessary to compare the relationships between patterns and different data streams. To handle these complex analyses, visualization techniques may be used to support the comparison.
Visual comparison of text is a widely studied topic. But currently there is a lack of methods of analysis in different data streams that support simultaneous comparison of time-varying features and semantic features of sequences. First, it is difficult to combine semantic comparison with dynamic comparison in sequence analysis. Some students use tree structures to solve the challenges of sequence comparison, helping people understand the basic concepts and ideas quickly, however, this approach is limited to static text sequence data and does not support temporal comparison. The effort to support time trend comparisons between multiple tag clouds cannot support sequence comparisons again because of the lack of connections between keywords. Thus, it is difficult to visualize both temporal and semantic comparisons of sequences simultaneously. Second, it is challenging to compare semantics and dynamics in different data streams. Some work addresses the challenge of pairwise visual comparison of multiple items between two data streams, but still cannot be applied to sequences to display more context and connections. Third, in addition to historical social media data, real-time analysis is more challenging for real-world streaming data, but is also more important in that it requires fast modeling methods and dynamic visualization to reveal features in a short time. In general, there is a lack of a visualization technique to support both time and semantic sequence patterns in two data streams for comparison, and analysis to support real-time patterns
Disclosure of Invention
The invention aims to realize simultaneous visualization of time and semantic comparison of text sequences and realize semantic and dynamic comparison among different data streams, and provides a context semantic sequence comparison method based on dynamic clustering and visualization.
To achieve the purpose, the invention adopts the following technical scheme:
providing a context semantic sequence comparison method based on dynamic clustering and visualization, wherein for real-time stream data, dynamic clustering is carried out on continuously updated texts based on BERTopic and KMeans ++ dynamic clustering methods, then visual analysis is carried out on dynamic streams, and the visual analysis specifically comprises the steps of;
S1, extracting context keywords of a central word by calculating the similarity between each word in a push text and the central word according to the central word selected by a user; and calculating the public attention of the context key words and the central word
S2, calculating the association degree between the context keywords and the two key entitiesAnd visualizing;
S3, generating a semantic sequence mode and visualizing through an iterative search method according to the central word and the context keyword set thereof.
Preferably, the method for dynamically clustering continuously updated tweets based on BERTopic and KMeans ++ dynamic clustering method comprises the following steps:
A1, carrying out text recognition on the context keywords in the continuously updated push text by using the BERTopic model according to the center word given by the user to obtain the context keywords to be clustered at the time of initialization t;
A2, initializing the clustering at the t moment by using KMeans ++ algorithm After the first clustering is completed, the clustering center is transferred to the clustering/>, at the time t+1
A3, judging at each clustering timeWhether the first m of the context keywords are also present in/>If yes, will/>And/>Merging clusters, sorting the context keywords in the merged clusters according to class-based TF-IDF scores, and taking a set formed by the context keywords with the top x rank as data after updating
And A4, completing clustering of the context keywords identified at all moments by adopting the method of the steps A2-A3, and taking the context in which the first y context keywords in the finally combined cluster are located as an object to be subjected to visual analysis.
Preferably, in step S1, similarity calculation is performed on each word in the center word and the push text by using a cosine similarity calculation method, and the word with the top n rank is used as the context keyword set.
Preferably, in step S1, the method of calculating the public attention of the context keyword of the center word includes the steps of:
s11, calculating the public attention degree The calculation method is expressed by the following formula (1):
in formula (1), k represents the center word selected by the user or system;
c represents the context keyword;
n represents the total number of tweets in the dataset;
u i (c, k) is an inclusion condition indicating whether the ith tweet contains c and k, if so, u i (c, k) =1, otherwise 0;
u i (c, -k) denotes whether the ith tweet contains c but not k, if so, u i (c, -k) =1, otherwise 0;
η i denotes whether the ith push is forwarded or not, if so, η i =1, otherwise 0;
r i represents the number of i-th push messages to be forwarded;
S12, according to Is visualized.
Preferably, in step S2,Is expressed by the following formula (2):
In the formula (2), The co-occurrence frequency of the context keyword i, the key entity A and the key entity B at the time t is respectively represented;
rank represents the difference between co-occurrence frequencies of the context keyword i Ranking in all i e W t;
n t represents the total number of context keywords of the central word i at time t;
W t represents the set of all contextual keywords of the center word at time t.
Preferably, in step S3, the method for generating the semantic sequence pattern includes the steps of:
s31, forming an initial sequence, wherein the initial sequence comprises the center word and the context key words which are selected by a user and keep the appearance sequence in a push text;
S32, traversing each context keyword in the keyword set, searching a word with the largest co-occurrence frequency of the word in the formed semantic new sequence in a pushing text after a word in the set is newly added in the initial sequence, adding the found context keywords into the initial sequence to realize sequence expansion, and filtering the context keywords newly added into the initial sequence in the keyword set;
And S33, taking the new semantic sequence obtained by expansion in the step S32 as the initial sequence, returning to the step S31, continuing to expand the initial sequence from the filtered residual keyword set until the expanded sequence reaches a preset sequence length, and taking the new semantic sequence obtained finally as the generated semantic sequence mode.
The ContextWing system provided by the invention supports the pairwise comparison of the continuously evolving context sequence patterns between two data streams. The computation model section is capable of generating dynamic topics and sequence patterns, and computing public attention and pairwise correlations. The system also comprises a novel multi-layer double-wing metaphor design, which can intuitively display sequence modes fused by different contexts to reveal the difference of two sequences in time and semantics. The interactive tool then supports selection of the center word and its contextual keywords to iteratively generate patterns for focused exploration. In addition, the system also supports static and streaming set analysis and wider application scenes.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a system interface diagram of social media context text visualization provided by an embodiment of the present invention;
FIG. 2 is an enlarged view of an interface of a subject view shown in area A of FIG. 1;
FIG. 3 is an enlarged view of an interface of a control view shown in area B of FIG. 1;
FIG. 4 is a partial enlarged view of the mode view shown in area C of FIG. 1;
FIG. 5 is a histogram of the number of tweets displayed in the area a1 of FIG. 2;
FIG. 6 is an interface schematic of the dynamic word cloud shown in area a2 of FIG. 2;
FIG. 7 is an enlarged interface view of a detail view of the original tweet shown in area D of FIG. 1;
FIG. 8 is a schematic diagram of the visual metaphor design provided by an embodiment of the present invention;
FIG. 9 is a schematic diagram of a semantic merging method of a visual metaphor;
FIG. 10 is a system architecture flow diagram of a visual analysis interface;
Fig. 11 is an example diagram of a topic view.
Detailed Description
The technical scheme of the invention is further described below by the specific embodiments with reference to the accompanying drawings.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to be limiting of the present patent; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if the terms "upper", "lower", "left", "right", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, only for convenience in describing the present invention and simplifying the description, rather than indicating or implying that the apparatus or elements being referred to must have a specific orientation, be constructed and operated in a specific orientation, so that the terms describing the positional relationships in the drawings are merely for exemplary illustration and should not be construed as limiting the present patent, and that the specific meaning of the terms described above may be understood by those of ordinary skill in the art according to specific circumstances.
In the description of the present invention, unless explicitly stated and limited otherwise, the term "coupled" or the like should be interpreted broadly, as it may be fixedly coupled, detachably coupled, or integrally formed, as indicating the relationship of components; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between the two parts or interaction relationship between the two parts. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
The embodiment of the invention provides a context semantic sequence comparison method based on dynamic clustering and visualization, which comprises the following analysis processes:
The present invention proposes a social media context visualization system named ContextWi ng as shown in fig. 1, which integrates an integrated computational model, a novel visual design and a symmetrical wing structure, connects sequences with the same center word (e.g. "chi na" in fig. 1), merges the sequences according to the same context keywords (e.g. "ukrai ne", "trademark" etc. in fig. 1), and distinguishes the merged sequences by color and paired hierarchy to more clearly show their semantic differences and similarities. The sequences are arranged vertically from top to bottom, corresponding to different time periods. Keywords in the schema are concatenated from left to right in the order in which they appear in the text. Meanwhile, the positions and colors of different levels encode the correlation between semantic information and key entities (such as ' person A ' and ' person B ' in social media events), and people's tendencies on the key entities can be known by comparing the relation between the semantics and the paired entities. Therefore, the visual design enables the user to perform paired visual comparison on the time characteristics and the semantic characteristics of the context at the same time, and the limitation of word cloud and word tree is overcome.
The system interface provided in fig. 1 includes A, B, C, D four regions respectively corresponding to the 4 view portions of the topic view, the control view, the mode view and the detail view of the original text shown in fig. 2-4 and fig. 7, where the topic view portion further includes the histogram of the number of the original text shown in fig. 5 and the interface diagram of the dynamic word cloud shown in fig. 6. The user may select different center words and context keywords in the topic view interface displayed in the area a shown in fig. 2 to generate a semantic sequence pattern, may observe the user-selected center words and context keywords through the control view displayed in the area B shown in fig. 3, and may reset or return the user-selected center words and/or context keywords to the topic view shown in fig. 2 through a reset or return function embedded in the control view. The pattern view (i.e., visualized wing metaphor design) displayed in the region C shown in FIG. 4 is used to visualize the generated semantic sequence patterns.
How to generate semantic sequence patterns and how to perform real-time pattern analysis on different data streams and how to visualize analysis results are key technical contents of the invention, the following three major blocks of contents are used for specifically explaining the principle of realizing the key technology:
1. Building a computing model and data flow pattern analysis
The built calculation model mainly bears the following calculation functions: keyword classification computation, pairwise correlation computation, public attention computation, context semantic sequence pattern generation, and data analysis of different data streams according to the generated patterns.
1. Keyword classification calculation
In a static setting (representing that the data is historical data and is not updated in real time), word2Vec is used (Word 2Vec is a neural network model for converting words into vector form, processing of text content can be simplified into vector operation in vector space through conversion, similarity in vector space is calculated to represent similarity in text semantics), a vector of each Word in an original text is obtained, cosine similarity of the vector is calculated to find out keywords similar to a central Word given by a user, and higher similarity indicates higher semantic correlation between two Word vectors. Because the given original text is historical data, the invention can assign the center word by obtaining priori knowledge, and the clustering effect of the center word is more in line with the expectations of experts. Since there are typically a large number of words per cluster, the present invention retains the top n words, which are more frequent, as the keywords for visualization. Similarly, based on the cosine similarity of the vectors, the extraction of the keywords of the context of each central word can be performed to obtain the first n words with higher similarity. Considering that the topic view needs to be ensured to be clear when visualization is performed, the top 20-30 context keywords are generally selected.
The most straightforward way to quantify the relationship between a center word and its contextual keywords is to calculate its co-occurrence frequency in the text. However, we find that, in practical application, there is a limitation in information presented simply based on co-occurrence frequency, and in order to promote the effect of pair-wise comparison of subsequent sequences, the invention also innovatively proposes to use public attention to represent the close relationship between the central word and the context key word. We characterize the proposed public interest asThe method is used for calculating the distance between the center word and the context keyword, and the distance can accurately reflect the popularity of the forwarded text.Is expressed by the following formula (2):
In formula (2), k represents a user or system selected center word;
c represents a context keyword;
n represents the total number of tweets in the dataset;
u i (c, k) is an inclusion condition indicating whether the ith tweet contains c and k, if so, u i (c, k) =1, otherwise 0;
u i (c, -k) denotes whether the ith tweet contains c but not k, if so, u i (c, -k) =1, otherwise 0;
η i denotes whether the ith push is forwarded or not, if so, η i =1, otherwise 0;
r i represents the number of i-th push messages forwarded.
In equation (2), both the numerator and denominator reflect empirical estimates of the number of forwarding under inclusion conditions. This approach can help describe the distance between the center word and each context keyword ifThe closer the relationship between c and k is, the higher the public interest, and if negative, the less closely they are, the lower the public interest.
2. Paired correlation computation
Each event necessarily has two key subjects, which are the focus of the discussion and have a great influence on the trend of public opinion. The invention quantifies the relatedness of two data streams to keywords according to their co-occurrence frequency (i.e. "relatedness"), and marks asAnd/>
The invention innovatively providesIs calculated by/>Calculated by the following formula (1):
In the formula (1), The co-occurrence frequency of the context keyword i and the key entity A and the key entity B at the time t are respectively represented;
Rank represents the co-occurrence frequency difference calculated for the center word i Ranking in all i e W t;
N t represents the total number of context keywords of the center word at time t;
W t represents the set of context keywords for the center word at time t.
If it isClose to 1, the context keyword i is more relevant to the key entity a or the data stream in which the key entity a is located at time t.
The calculation method of (a) is exemplified as follows:
For example, the context keyword i is "apple", and the co-occurrence frequency of the keyword and the keyword entity A (such as name A) Co-occurrence frequency/>, 10, with key entity B (e.g., name B)Is 5, then/>Assuming that there are 4 words other than "apple" with such co-occurrence frequency difference, and the value of the co-occurrence frequency difference of "apple" is ranked second from large to small according to the value of the co-occurrence frequency difference, the degree of association/>, of "apple" with the key entity a
3. Generating context semantic sequence patterns
To summarize the information of the original tweet more briefly, the invention sets that the semantic sequence consists of verbs, nouns and adjectives, and the sequence length can be 4 (4 words) or further adjusted. The repeated sequence is a sequence pattern, the generation process of the pattern is a searching process, and the searching process is specifically as follows:
assume that the user-selected center word and the context keyword are respectively marked as centralkeyword and w, and pass through the relevance After the calculation of (1), the context keywords having the top n of the relevance rank with the center word form a keyword set. First, an initial sequence is formed, which includes a center word centralkeyword and a context keyword w, forming a binary group. The order of the center word centralkeyword and the context keyword w in the initial sequence is consistent with the appearance sequence originally in the push text, and is w-central keyword or central keyword-w.
Then, traversing each context keyword in the keyword set, searching a word with the largest co-occurrence frequency of the word in the formed semantic new sequence in the push text after a word in the new set is added in the initial sequence, determining the word as the context keyword which is finally taken out from the keyword set and is newly added in the initial sequence, and changing the form of the binary group of the initial sequence into the form of the triplet after the context keyword is newly added, thereby realizing the expansion of the initial sequence. In order to more flexibly set the coverage of the semantic sequence to the text of the push text, the invention also adds a skip value, namely, the position relation between the newly added context keywords and the keywords in the current tuple is allowed to fluctuate within the skip value range. Further, the skip value is set to according to the sequence length lFor example, assuming a sequence length l of 20 and a skip value of 11, the position of the last newly added word for the current tuple is 5, the fluctuation range of the position relationship of the newly added context keyword and the keywords in the current tuple is allowed to be 1 to 16 bits in the current tuple. The relative distance of the keywords is adjusted according to the length of the text by the sequence through setting the skip value, so that the coverage of the semantic sequence on the text of the push text is set more flexibly.
4. Data analysis of different data streams based on generated semantic sequence patterns
Flow data analysis faces many difficulties compared to static settings, on the one hand, faster and more accurate computation efficiency and, on the other hand, flexible visualization support is required. However, due to the characteristics of continuous change, inheritance, disappearance and the like of the theme of the event, clustering is more complex. In order to solve the problem, the invention adopts a dynamic clustering method based on BERTopic and KMeans ++, and processes continuously updated text in real time, and the processing method is as follows:
First, a BERTopic model (BERTopic is a topic modeling technique, using a transfomer and c-TF-IDF to create dense clusters that allow topics to be interpreted while preserving important words in the topic description) is applied to generate semantic vectors for documents in high-dimensional space, and further dimensionality reduction is facilitated for subsequent computation by UMAP (Uniform Manifold Approximation and Projection is a new dimensionality reduction manifold learning technique. UMAP is a theoretical framework based on Riemann geometry and algebraic topology, which assumes that available data samples are evenly distributed in the topology space, can be approximated from these limited data samples and mapped to the low-dimensional space). Since BERTopic model does not support dynamic clustering in the stream dataset, the present invention combines it with KMeans ++ algorithm, KMeans ++ being one of the fastest clustering algorithms applicable to stream data. After the KMeans ++ algorithm is used to obtain the topic clusters of each event, a topic representation is generated by using the class-based TF-IDF vector.
The reason why the Word2 Vec-based method is not used in the stream data mode includes two aspects. First, word vectors are generated dependent on the corpus per minute, but the vectors of the same word in the data per minute will change. Thus, the cluster center cannot pass on to the next generation unless a large sliding window is set and the entire window is considered a bag of words. But this approach can bring about a time difference from the real time. Thus, the present invention can produce the same word vector per minute using a transducer-based pre-training model. Therefore, the clustering center can transmit to the next minute, so that real-time clustering is realized, and a coherent theme is obtained. Secondly, word2 Vec-based methods require initial keywords to extract words with high similarity, and require prior knowledge of the topic of the event. There is therefore a need for an automatic clustering method to help users learn about upcoming topics. Therefore, the method adopts BERTopic + KMeans ++ to dynamically cluster the continuously updated push messages.
The process of dynamically clustering continuously updated tweets by the method BERTopic + KMeans ++ will be described in detail below:
the basic principle of the dynamic KMeans algorithm is to initialize the cluster center with the last clustering result, and when data arrives within one minute, KMeans ++ is used to initialize the clusters first After the first clustering is completed, the clustering center is transferred to the clustering/>, of the next minuteThe information obtained in the last step is maintained, and the clustering efficiency is improved. Considering the limitation of the user on the real-time change information, setting up to 6 topics generated in each clustering, and generating up to 20 context keywords under each topic.
In order to obtain a coherent topic, after clustering per minute, ifThe first 25% of keywords are also present inEach cluster/>, at the current timeCluster with previous moment/>Merging, sorting the keywords in the merged clusters according to class-based TF-IDF scores, and taking a set formed by context keywords with the top x rank as the updated/>, of the dataIf/>The first 25% of keywords are absent/>In the middle, nothing is doneAnd carrying out data updating.
By adopting the method, the clustering of the context keywords identified by the BERTopic model at all times is completed, and the context where the first y context keywords in the finally combined clusters are located is taken as an object to be subjected to visual analysis.
For the performance of the above-described BERTopic model + KMeans ++ clustering method, the present invention evaluates BERTopic + KMeans ++ and bow+ KMeans ++ (Bag of words model (Bag of words)) in a standard dataset (20 Newsgroups dataset: a collection of about 20,000 newsgroup documents, divided equally (almost) into 20 different newsgroups, that has become popular datasets for text application experiments of machine learning techniques, such as text classification and text clustering), and fills all words into a Bag irrespective of their lexical and word order, i.e., each word is independent. BERTopic is a word vector model, a neural network model that takes word positional relationships into consideration. The invention tests NMI (normalized mutual information, NMI is a measure of similarity between two tags whose mutual information is the same data),
Where |U i | is the number of samples in the cluster, |V i | is the number of samples in the cluster, and the mutual information of U and V in the cluster is as follows, normalized Mutual Information (NMI) is the normalization of Mutual Information (MI) scores, scaling the result between 0 (no mutual information) and 1 (complete correlation).
And judging the quality of the result clustering according to the class labels. Each method we run 5 times to calculate the median value and found that BERTopic + KMeans ++ had a median NMI of 0.61[0.60-0.62], and bow+ KMeans ++ had a median NMI of 0.42[0.39-0.43], indicating that BERTopic + KMeans ++ was superior to BoW methods in terms of clustering results. Regarding the computational efficiency problem, the BERTopic + KMeans ++ method was tested on the dataset of the case study, containing an average of 80 tweets per minute. BERTopic + KMeans ++ was found to process 1 minute of data in 6-7 seconds. Thus, for many social media event datasets with a number of tweets per minute around 800, the BERTopic + KMeans ++ approach is feasible in terms of both clustering effect and time efficiency.
2. Visualizing analysis results
The design principles, visual coding and concrete construction process of the wing metaphors in the semantic sequence pattern view (as shown in fig. 4) will be described below:
The present invention proposes a new design that can be used to visualize the changing sequential patterns of contexts and allow interactions to compare them. In ContextWing, the main metaphors are wings and feathers, as shown in FIG. 8.
Wing metaphor: wings visualize the connection between the sequence patterns of the center word. In the horizontal direction, the wings are divided into a left-right symmetrical structure. The words on the left wing represent words that appear in the text before the center word and vice versa.
Feather metaphor: each pair of horizontally symmetric feathers (also referred to as each layer) in the wing exhibits a sequence pattern that merges according to the same context keyword. The color and vertical position of the feathers represent the correlation between each context keyword and two key entities. The horizontal position represents a degree of public concern.
Next, how the semantic sequence schema view is constructed will be described.
1. The feather layer described above is built for the selected context key. The present invention assigns each selected context keyword to a layer having the same length, and the width can be automatically fine-tuned according to the number of selected words. The vertical position and color coding of the layers represent a pair-wise comparison. The color and vertical position of the layers are used to encode the pairwise correlations generated by the computational model. To facilitate expression of pairwise relatedness, as shown in fig. 8, the lower the position the layer is associated with key entity a and vice versa the closer to key entity B. The horizontal position of the hierarchy is a quantification result based on the public attention, which indicates public attention to the center word and its selected context keywords. If the layers are horizontally closer to the center keyword, this means that they have more attention. Furthermore, the width of the link to the layer represents the total frequency of the modes on the layer.
2. Context keywords are laid out on each layer. As shown in a of fig. 8, the present invention places words on the feathers around the center keyword in the order of appearance and arranges patterns from the top to the bottom of the layers in time order. The present invention vertically aligns the context keywords with the center word. Keywords that are on the same horizontal line as the center word form a pattern. The time scale of the layer side indicates the corresponding time period of the pattern in the same row. The size of the keyword encodes the pattern frequency after the word is included. Thus, the last keyword frequency represents the pattern frequency.
3. The selected context keywords are merged. During the placement process, many repeated keywords are found in the same column, so that it is not easy to compare different semantic information in the sequence. For example, a selected context keyword such as "flu" will not be apparent because "pandemic" will also appear repeatedly and more nearly centered (a in FIG. 9). Therefore, it is necessary to avoid the influence of other context keywords and emphasize the context keywords of the selected layer. As shown in b of fig. 9, the present invention combines these keywords in the same column, maintaining the overall structure and avoiding misunderstanding. After merging, the frequency evolution information of the context keywords is lost. Therefore, the invention also adds a mini trend graph to visualize the change of word frequency with time so as to enhance the information display.
4. And adding connecting lines among words. With the idea of a tree structure, the present invention connects words of the same pattern by adding lines for better understanding. As indicated by (a) in fig. 9, there is a case where the context keyword is the final word of the pattern (a "flu" in fig. 9). If repeated words are merged, the position of the word may become blank and may appear as if no word is on the correct level, resulting in misunderstanding. Thus, the present invention adds a line to connect the context keyword with the next blank position on the horizontal line to indicate the presence of the context keyword, as shown in FIG. 9.
The ContextWing system provided by the present invention includes the subject view, control view, mode view, and detail view shown in fig. 10.
1. Topic view
The present invention provides a topic view to select keywords as input for a mode view. As shown in fig. 11, the top view is a histogram showing the percentage of change in the pushers of the two data streams. The symbols of the two streams are placed at the top and bottom of the view, respectively. The theme is marked with a different color on the lower left button. In the bubble map, keywords are aggregated and divided into several time periods. Since each keyword can actually generate a pattern wing structure, the invention designs the keyword bubble as a wing-shaped carving. The size and opacity represent the frequency of occurrence of the word and the color represents the subject to which the word belongs. The vertical position of the bubble represents a correlation with two key roles. Some important indicators, such as frequency and mood profile, are displayed in the tool tip. In order to intuitively observe the consistency of the theme, the invention adds connecting lines for keywords frequently appearing in different stages. The user may hover the keywords over the screen observing the frequency and relevance. The histogram may be used to select a time period by swiping the screen, and the data for the selected time period may be reassembled and displayed in multiple pools. The design of the theme view may also be extended to stream settings. The histogram, bubble pool, and subject buttons are updated synchronously at preset intervals (e.g., 1 minute). According to the modeling result, if a new theme appears, the old theme will be replaced and highlighted with a new color. The color and name of the theme buttons always correspond to the category of the update bubble, so that a user can be helped to more intuitively perceive the dynamic change of the theme. In the case of dynamic changes, it is difficult for the user to keep a map of previous information in mind. Therefore, the invention combines the histogram and the bubble chart, and can help the user to view the real-time historical data. The user may also click a "pause" button to pause/continue the update.
2. Control view
The invention sets data set options and analysis modes, and the user can choose to switch between static and stream analysis modes. Furthermore, starting from the topic view (shown in FIG. 11), there are two methods that can explore (each context word can also have a context keyword as a center word, so each keyword can be clicked down continuously in the exploration mode) the center word and its context keywords. The user can click on the 'change mode', and the iteration control panel (shown in figure 3) is opened, wherein the search mode is (1) the user can drill down the context key words of a central word continuously by clicking. (2) Analysis mode-the user can click on a keyword as a central word and then select its contextual keywords. To maintain consistency of information, the color of the selected keywords in the control view still represents its theme. Clicking on "Go Pattern" then observes the derived Pattern in the Pattern view on the right. For example, FIG. 1 shows a selection operation in analysis mode that supports contextual keyword selection of a center word. In this process, the user can re-click the bubble to update the selection, click "back" and "restart" to the previous or initial state, and conduct an iterative exploration.
3. Mode view
Once the wing structure is constructed, the user can compare from different aspects. The present invention provides the following four interactions to make a detailed comparison. First, to support temporal comparison of patterns of the same context keywords, the user can hover over any keyword, the corresponding pattern will be highlighted, and other patterns will be hidden. Thus, the user can better observe the single pattern of time stamps on the single layer. Second, patterns are compared from the perspective of the selected context keyword. The user can click on any time scale to highlight the different levels of the pattern during that time period. In addition, when the user hovers the mouse over the side of the layer, a mini-spark-up (small data plot represented by a line without axes) is displayed, indicating the frequency of evolution of the selected context keyword throughout the cycle. Finally, the invention also provides a tool-tip that allows the user to click on any keyword to view the frequency and emotion profile of each pattern. The mode view also supports real-time updates, displaying the mode at the same time intervals as in the subject view, vertically aligned to correspond to several times of the current time.
4. Detail view
To assist the user in understanding the pattern, we provide a detailed view (fig. 7) that can display information such as time, mood score, etc. of the original tweet. In the mode view, the user can select a mode, and the original tweet will be displayed in the detail view. In addition, the user can select a time period and type in words of interest to them.
It should be understood that the above description is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be apparent to those skilled in the art that various modifications, equivalents, variations, and the like can be made to the present application. Such variations are intended to be within the scope of the application without departing from the spirit thereof. In addition, some terms used in the description and claims of the present application are not limiting, but are merely for convenience of description.
Claims (5)
1. A context semantic sequence comparison method based on dynamic clustering and visualization is characterized in that for real-time stream data, dynamic clustering is carried out on continuously updated tweets based on a BERTopic and KMeans ++ dynamic clustering method, then visual analysis is carried out on the dynamic stream, and the visual analysis specifically comprises the steps of;
S1, extracting context keywords of a central word by calculating the similarity between each word in a push text and the central word according to the central word selected by a user; and calculating the public attention of the context key words and the central word For calculating the distance between the center word and the context keyword to pass/>Quantifying a horizontal position of a hierarchy of feather layers in a subsequently constructed semantic sequence pattern view for each context keyword in a subsequently selected set of context keywords;
S2, calculating the association degree between the context keywords and the two key entities And visualizing, wherein an entity which calculates the co-occurrence frequency with the context keyword is defined as the key entity; through the degree of association/>After calculation, forming the context keywords which have association degrees with the central word and are ranked n before as a context keyword set;
s3, according to the public attention degree The central word and the context keyword set thereof generate a semantic sequence mode and are visualized through an iterative search method;
in step S1, the method for calculating the public attention of the context keyword of the center word includes the steps of:
s11, calculating the public attention degree The calculation method is expressed by the following formula (1):
in formula (1), k represents the center word selected by the user or system;
c represents the context keyword;
n represents the total number of tweets in the dataset;
u i (c, k) is an inclusion condition indicating whether the ith tweet contains c and k, if so, u i (c, k) =1, otherwise 0;
u i (c, -k) denotes whether the ith tweet contains c but not k, if so, u i (c, -k) =1, otherwise 0;
η i denotes whether the ith push is forwarded or not, if so, η i =1, otherwise 0;
r i represents the number of i-th push messages to be forwarded;
S12, according to Is visualized.
2. The context semantic sequence comparison method based on dynamic clustering and visualization according to claim 1, wherein the method for dynamically clustering continuously updated tweets based on the dynamic clustering method of BERTopic and KMeans ++ comprises the steps of:
A1, carrying out text recognition on the context keywords in the continuously updated push text by using the BERTopic model according to the center word given by the user to obtain the context keywords to be clustered at the time of initialization t;
A2, initializing the clustering at the t moment by using KMeans ++ algorithm After the first clustering is completed, the clustering center is transferred to the clustering/>, at the time t+1
A3, judging at each clustering timeWhether the first m of the context keywords are also present in/>If yes, will/>And/>Merging clusters, sorting the context keywords in the merged clusters according to class-based TF-IDF scores, and taking a set formed by the context keywords with the top x rank as the/>, after data updating
And A4, completing clustering of the context keywords identified at all moments by adopting the method of the steps A2-A3, and taking the context in which the first y context keywords in the finally combined cluster are located as an object to be subjected to visual analysis.
3. The context semantic sequence comparison method based on dynamic clustering and visualization according to claim 1, wherein in step S1, similarity calculation is performed on each word in the center word and the push word by a cosine similarity calculation method, and the word with the top n rank is used as the context keyword set.
4. The method for comparing context semantic sequences based on dynamic clustering and visualization according to claim 1, wherein in step S2,Is expressed by the following formula (2):
In the formula (2), The co-occurrence frequency of the context keyword i, the key entity A and the key entity B at the time t is respectively represented;
rank represents the difference between co-occurrence frequencies of the context keyword i Ranking in all i e W t;
n t represents the total number of context keywords of the central word i at time t;
W t represents the set of all contextual keywords of the center word at time t.
5. The context semantic sequence comparison method based on dynamic clustering and visualization according to claim 1, wherein in step S3, the method of generating the semantic sequence pattern comprises the steps of:
s31, forming an initial sequence, wherein the initial sequence comprises the center word and the context key words which are selected by a user and keep the appearance sequence in a push text;
S32, traversing each context keyword in the keyword set, searching a word with the largest co-occurrence frequency of the word in the formed semantic new sequence in a pushing text after a word in the set is newly added in the initial sequence, adding the found context keywords into the initial sequence to realize sequence expansion, and filtering the context keywords newly added into the initial sequence in the keyword set;
And S33, taking the new semantic sequence obtained by expansion in the step S32 as the initial sequence, returning to the step S31, continuing to expand the initial sequence from the filtered residual keyword set until the expanded sequence reaches a preset sequence length, and taking the new semantic sequence obtained finally as the generated semantic sequence mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310445169.7A CN116521858B (en) | 2023-04-20 | 2023-04-20 | Context semantic sequence comparison method based on dynamic clustering and visualization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310445169.7A CN116521858B (en) | 2023-04-20 | 2023-04-20 | Context semantic sequence comparison method based on dynamic clustering and visualization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116521858A CN116521858A (en) | 2023-08-01 |
CN116521858B true CN116521858B (en) | 2024-04-30 |
Family
ID=87407620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310445169.7A Active CN116521858B (en) | 2023-04-20 | 2023-04-20 | Context semantic sequence comparison method based on dynamic clustering and visualization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116521858B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116796754A (en) * | 2023-04-20 | 2023-09-22 | 浙江浙里信征信有限公司 | Visual analysis method and system based on time-varying context semantic sequence pair comparison |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN110543559A (en) * | 2019-06-28 | 2019-12-06 | 谭浩 | Method for generating interview report, computer-readable storage medium and terminal device |
CN110909153A (en) * | 2019-10-22 | 2020-03-24 | 中国船舶重工集团公司第七0九研究所 | Knowledge graph visualization method based on semantic attention model |
CN115470344A (en) * | 2022-08-24 | 2022-12-13 | 西南财经大学 | Video barrage and comment theme fusion method based on text clustering |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10229193B2 (en) * | 2016-10-03 | 2019-03-12 | Sap Se | Collecting event related tweets |
-
2023
- 2023-04-20 CN CN202310445169.7A patent/CN116521858B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN110543559A (en) * | 2019-06-28 | 2019-12-06 | 谭浩 | Method for generating interview report, computer-readable storage medium and terminal device |
CN110909153A (en) * | 2019-10-22 | 2020-03-24 | 中国船舶重工集团公司第七0九研究所 | Knowledge graph visualization method based on semantic attention model |
CN115470344A (en) * | 2022-08-24 | 2022-12-13 | 西南财经大学 | Video barrage and comment theme fusion method based on text clustering |
Also Published As
Publication number | Publication date |
---|---|
CN116521858A (en) | 2023-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yang et al. | Image-text multimodal emotion classification via multi-view attentional network | |
Liu et al. | Bridging text visualization and mining: A task-driven survey | |
Bordes et al. | Translating embeddings for modeling multi-relational data | |
Yang et al. | Interactive steering of hierarchical clustering | |
El-Assady et al. | Semantic concept spaces: Guided topic model refinement using word-embedding projections | |
CN112966091B (en) | Knowledge map recommendation system fusing entity information and heat | |
CN109635102B (en) | Theme model lifting method based on user interaction | |
Nabati et al. | Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm | |
CN109214454A (en) | A kind of emotion community classification method towards microblogging | |
CN113707339A (en) | Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases | |
Mai et al. | A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning | |
Wang et al. | Detecting hot topics from academic big data | |
CN116521858B (en) | Context semantic sequence comparison method based on dynamic clustering and visualization | |
Zhu et al. | Multimodal emotion classification with multi-level semantic reasoning network | |
CN116775812A (en) | Traditional Chinese medicine patent analysis and excavation tool based on natural voice processing | |
Liang et al. | Identifying emotional causes of mental disorders from social media for effective intervention | |
Suresh et al. | Data mining and text mining—a survey | |
Park et al. | Survey and challenges of story generation models-A multimodal perspective with five steps: Data embedding, topic modeling, storyline generation, draft story generation, and story evaluation | |
WO2024139925A1 (en) | Method and system for constructing visualization graph based on natural language | |
Yi et al. | Graphical Visual Analysis of Consumer Electronics Public Comment Information Mining under Knowledge Graph | |
CN116796754A (en) | Visual analysis method and system based on time-varying context semantic sequence pair comparison | |
Suresh | An innovative and efficient method for Twitter sentiment analysis | |
Marchenko et al. | Examining the historical development of techno-scientific biomedical communication in Russia | |
Jenkins et al. | Natural language annotations for search engine optimization | |
Zbakh et al. | An online reversed French Sign Language dictionary based on a learning approach for signs classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |