CN109635107A - The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source - Google Patents
The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source Download PDFInfo
- Publication number
- CN109635107A CN109635107A CN201811378557.3A CN201811378557A CN109635107A CN 109635107 A CN109635107 A CN 109635107A CN 201811378557 A CN201811378557 A CN 201811378557A CN 109635107 A CN109635107 A CN 109635107A
- Authority
- CN
- China
- Prior art keywords
- data
- information
- content
- event
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 24
- 238000004458 analytical method Methods 0.000 title claims abstract description 22
- 238000000034 method Methods 0.000 title claims description 38
- 238000011084 recovery Methods 0.000 claims abstract description 34
- 230000010354 integration Effects 0.000 claims abstract description 16
- 238000005516 engineering process Methods 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 46
- 238000013500 data storage Methods 0.000 claims description 30
- 238000004422 calculation algorithm Methods 0.000 claims description 28
- 238000000605 extraction Methods 0.000 claims description 28
- 238000004140 cleaning Methods 0.000 claims description 15
- 239000000284 extract Substances 0.000 claims description 12
- 238000012544 monitoring process Methods 0.000 claims description 10
- 239000000203 mixture Substances 0.000 claims description 9
- 230000002123 temporal effect Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 8
- 238000012790 confirmation Methods 0.000 claims description 6
- 230000009193 crawling Effects 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000007726 management method Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000001105 regulatory effect Effects 0.000 abstract description 5
- 238000007405 data analysis Methods 0.000 abstract description 3
- 238000012423 maintenance Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 description 8
- 238000003780 insertion Methods 0.000 description 7
- 230000037431 insertion Effects 0.000 description 7
- 239000003795 chemical substances by application Substances 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000007547 defect Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 238000009825 accumulation Methods 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 241000239290 Araneae Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 210000000805 cytoplasm Anatomy 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000005086 pumping Methods 0.000 description 1
- ZLIBICFPKPWGIZ-UHFFFAOYSA-N pyrimethanil Chemical compound CC1=CC(C)=NC(NC=2C=CC=CC=2)=N1 ZLIBICFPKPWGIZ-UHFFFAOYSA-N 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001256 tonic effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The device of semantic intellectual analysis and the event scenarios reduction of multi-data source is related to the scenario reduction of information technology field, especially big data analysis and semantics recognition technical field.The present invention is made of data source acquisition module, Data Integration module, affair clustering module, entity abstraction module, event recovery module;After being collected it solves regulatory agency for large-scale information security events or public sentiment hot information the problems such as manual analysis heavy workload.Artificial amount is effectively reduced by way of approaching scenario reduction, solving the problems, such as current scenario reduction, there are manual maintenance amount is larger.
Description
Technical field
The event reduction that this patent is related to information technology field, especially big data analysis and language semantic analysis aspect is answered
Use field.
Background technique
Under the current networking big epoch, each network supervision mechanism is for the various network large size information securities on network
Event, public sentiment hot event are all paid special attention to.It is related after one large-scale information security events or public sentiment hot event occur
Personnel be intended to comprehensively comprehensively to understand the event occurred.However, each regulatory agency is all now
Lack the means that can all types of events be carried out with comprehensive scenario reduction.Currently, associated mechanisms are for all types of events
Often relevant event sets can only be found by keyword interrelational form, lack the polymerization for event correlation, and can not
Corresponding statistics is carried out to carry out scenario reduction.
Rely only on crawler and traditional public sentiment system defect:
Firstly, the attitude that public sentiment hot event refers to plurality masses social phenomenon problem of interest and shows.
Therefore, it is studied and judged for the public sentiment Study on Trend on internet and needs a large amount of public sentiment source data as analysis object, data acquisition
The source of acquisition is more extensive, and the data of acquisition are more comprehensive, and information data is more (such as content of pages is climbed by counter), then the heat obtained
Point or the analysis of public opinion result are more accurate.Currently used internet letter content information data source is only that web crawlers crawls
Mode, due to anti-universal (and needs registration etc. intercepts) for climbing technology, so that the valuable public sentiment data that crawler crawls is less,
Data analysis can not accurately and effectively be carried out.
Secondly, original keyword search mode, the analysis of public opinion result data of acquisition intuitively can not carry out scene also
It is former;After carrying out preliminary sentiment analysis, the content of displaying is only just to be forsaked one's love based on the public sentiment in the case of current acquisition data source
Condition, the public sentiment development of interest for people, public sentiment serious conditions, the public sentiments scene such as public sentiment disaster area distribution situation all not into
The effective scenario reduction of row.
Rely only on information security acquisition and conventional information security system defect:
Firstly, there are problems for data source, due to being acquired using flow mode, in weak filtering, information security data are collected
Data volume it is huge, and finely tune filter condition, acquisition information trip is again few.In addition the information of the speciality acquisition of useful flow information
In there are a large amount of junk datas, the interference due to a large amount of junk information to normal semanteme, can not or it is very difficult directly it is right
Cluster or semantic analysis are carried out in collected data.
Finally, regulatory agency lacks effective information categorization, screening and comprehensive scene also for each large-scale security incident
Former means.At present major part regulatory agency be only capable of that security incident is described by the means of event notification, this for
The scene of large-scale security incident is extremely short of by internet information direct-reduction aspect in time, for the hair of large-scale event
Hair tonic exhibition can not be restored comprehensively.
To sum up, each network supervision mechanism all can not be for the large-scale information security events to happen suddenly on network or carriage
Feelings focus incident carries out effective scenario reduction, and searches to the bottom, and is the acquisition and finally analysis and displaying of data source
In the defect of means.Therefore, this patent, which will focus on, studies these defects, and providing one can be effectively for supervision
The method that mechanism problem event of interest carries out effective scenario reduction.
Shared technology
Depth web page crawl data: comparatively mature technology passes through news website, blog, forum, microblogging, the wechat public
Number, the place such as social network sites, audio-video website comment carry out text collection, obtain corresponding event data source.Depth webpage is climbed
It takes, crawling object is known forum's class website, it is crawled for specific content, crawls content compared to general spiders,
Noise is few, and amount of available data is big.
Conventional web page crawler crawls data: frequently-used data acquisition means crawl the web page contents under top-level domain.
The result data that the Monitoring instruction that information safety system is assigned is returned: information safety system is daily for that may deposit
Event assign Monitoring instruction, command content, which includes that keyword, time of occurrence etc. are a series of, there may be event problem
Instruction, returns the result as the web content data comprising Monitoring instruction and information security attribute data.All types of information securities
System command monitoring data includes: 1, IDC/ISP believes that safety supervision measured data, tables of data include computer room ID, source IP, destination IP, source
Mouth, destination port, domain name, accumulation amount of access, Agent Type, Agent IP, proxy port, title, content, URL, attachment, acquisition
It is instant to obtain the period by time, unprocessed form xml;2, IRCS believes that safety supervision measured data, tables of data include source IP, destination IP, source
Mouth, destination port, domain name, accumulation amount of access, Agent Type, Agent IP, proxy port, title, content, URL, when triggering for the first time
Between, it is instant to obtain the period by unprocessed form xml;3, CDN believes that safety supervision measured data, tables of data include source IP, destination IP, source port, mesh
Port, domain name, accumulation amount of access, Agent Type, Agent IP, proxy port, title, content, URL, for the first time triggered time, it is former
It is instant to obtain the period by beginning format xml.
word2vec
Word2vec is the correlation model for being used to generate term vector for a group.These models are the shallow and double-deck neural network, are used
To train with the word text of construction linguistics again.Network is existing with vocabulary, and need to guess the input word of adjacent position,
Under bag of words are assumed in word2vec, the sequence of word is unessential.After training is completed, word2vec model can be used to reflect
Each word is penetrated to a vector, can be used to indicate word to the relationship between word, which is the hidden layer of neural network.
With the continuous expansion of computer application field, natural language processing receives the great attention of people.Machine turns over
It translates, the application demands such as speech recognition and information retrieval propose higher and higher want to the natural language processing ability of computer
It asks.In order to enable a computer to processing natural language, it is necessary first to be modeled to natural language.Natural language modeling method warp
It has gone through from rule-based method to the transformation based on statistical method.The natural language mould obtained from the modeling method based on statistics
Type is known as statistical language model.There are many statistical language modeling technologies, including n-gram, neural network and log_linear
Model etc..During being modeled to natural language, it may appear that dimension disaster, word similitude, model generalization ability with
And the problems such as model performance.The solution for finding the above problem is the internal motivation for pushing statistical language model to continue to develop.
Under the background studied statistical language model, Google company has opened this use of Word2vec in 2013
In the software tool of training term vector.Word2vec can be quick by the training pattern after optimization according to given corpus
A word is effectively expressed as vector form, provides new tool for the application study of natural language processing field.
Word2vec relies on skip-grams or continuous bag of words (CBOW) to establish neural word insertion.Word2vec is Thomas's rice section
Love (Tomas Mikolov) is created in the research team that Google is led.The algorithm is analyzed and is explained by other people gradually.
Bag of words
Bag of words (Bag-of-words model) are the tables being simplified under natural language processing and information retrieval (IR)
Up to model.It seem that text as sentence or file can be showed with the mode that a sack is filled with these words under this model,
This manifestation mode does not consider the sequence of the syntax and word.Nearest bag of words are also applicable in computer vision field.Bag of words
Model is widely used in document classification, and the frequency that word occurs can be used to the feature as training classifier.About " bag of words "
Damp league (unit of length) Harris can be traced in the article in Distributional Structure in 1954 with the origin of word in this.
Skip-gram model
Skip-gram model is a simple but very useful model.In natural language processing, the selection of corpus is
One considerable problem: first, corpus must be abundant.The word amount of one side dictionary wants sufficiently large, and on the other hand to use up can
Can mostly comprising reflect word between relationship sentence, for example, only have " fish is swum in water " this clause in corpus as far as possible
Ground is more, and model can learn semanteme and grammatical relation into this, this and the mankind learn one reason of natural language, repeats
Often, will also imitate;Second, corpus must be accurate.That is selected corpus can correctly reflect
The semanteme and grammatical relation of the language, this point seem to be not difficult to accomplish, such as in Chinese, the corpus of the Peoples Daily is than calibrated
Really.But when more, it is not that the selection of corpus has caused the worry to accuracy problem, but the method handled.
In n meta-model, because of the limitation of window size, cause the relationship between word and current word beyond window ranges cannot be by just
Really reflect among model, if expanding window size merely will increase trained complexity again.Skip-gram model
It is proposed has well solved these problems.As its name suggests, Skip-gram is exactly " skipping certain symbols ", for example, sentence " China
Football is kicked really to be too rotten " there are 43 yuan of phrases, it is " Chinese football is kicked " respectively, " football is kicked really to be ", " kicks really to be
It is too rotten ", " being really too rotten ", but it was found that the original idea of this sentence be exactly " Chinese football is too rotten " but above-mentioned 43
First phrase not can reflect this information.Skip-gram model but allows certain words to be skipped, therefore can form " China
Football is too rotten " this 3 yuan of phrases.If allowing to skip 2 words, i.e. 2-Skip-gram.
The application of word2vec
The extension application that Word2vec is used to the whole part file of construction (rather than independent word) has been suggested, which is known as
Paragraph2vec or doc2vec, and it is made into tool in fact with C, Python and Java/Scala.Java and Python are also propped up
It helps and infers that file is embedded in the file that do not observe.To word2vec frame why do word insertion so successfully know little about it, about Ah
Husband Goteborg (Yoav Goldberg) and Ou Moliewei (Omer Levy) point out that the function of word2vec leads to similar text
Originally possess it is similar insertion (being calculated with cosine similarity) and with John Lu Baite not this distribution hypothesis it is related.Word is embedding
Enter be language model and representative learning technology in natural language processing (NLP) general designation.For conceptive, it refers to ties up one
Number is that the higher dimensional space of the quantity of all words is embedded into the much lower vector row space of a dimension, each word or phrase
The vector being mapped as in real number field.The method of word insertion includes artificial neural network, to word co-occurrence matrix dimensionality reduction, probability mould
The explicit representation etc. of context where type and word.In bottom input, the method that phrase is indicated using word insertion is very big
Improve the effect of syntax analyzer and text emotion analysis etc. in NLP.Word embedded technology originates from 2000.Joshua's sheet
Uncommon Austria et al. has used neural probabilistic language model (Neural probabilistic language in a series of papers
Models) making machine, " the distributed of acquistion word indicates (learning a distributed representation for
Words) ", to achieve the purpose that word space dimensionality reduction.Luo Weisi (Roweis) is delivered in " science " with Sol (Saul)
Learn with (LLE) is locally linear embedding into the low-dimensional representation method of high dimensional data structure.Stable development when this field starts,
It advances by leaps and bounds after 2010;For to a certain extent, this is because the training of the quality of vector and model is fast in this period
Degree has great breakthrough.The branch that word is embedded in field is various, is dedicated to its research there are many scholar.2013, one support of Google
Kit word2vec has invented to carry out word insertion in the team of Maas meter Ke Luowei (Tomas Mikolov) leader, instructs
The speed for practicing vector space model is all faster than previous method.Many emerging word insertions are based on artificial neural network, rather than
Past n-gram model and the study of non-supervisory formula.
Term vector
Term vector has the good feature of semanteme, is the usual way for indicating word feature.Term vector represents one per one-dimensional value
A feature that there is certain semanteme and grammatically explain.So term vector can be known as a word feature per one-dimensional.
Term vector has diversified forms, and distributed representation is one of.One distributed
Representation is dense, low-dimensional a real-valued vectors.Distributed representation's is every one-dimensional
Indicate that a potential feature of word, this feature capture useful syntax and semantic characteristic.As it can be seen that distributed
Mono- word of distributed in representation embodies such a feature of term vector: by the different syntaxes of word
Each dimension for being distributed to it with semantic feature goes to indicate.
K-means
K-means algorithm is hard clustering algorithm, is the representative of the typically objective function clustering method based on prototype, it is data
Point is advised to certain objective function of distance as optimization of prototype using the adjustment that the method that function seeks extreme value obtains interative computation
Then.For K-means algorithm using Euclidean distance as similarity measure, it is to seek corresponding a certain initial cluster center vector V most optimal sorting
Class, so that evaluation index J is minimum.Algorithm is using error sum of squares criterion function as clustering criteria function.
K-means algorithm is the evaluation index very typically based on the clustering algorithm of distance, using distance as similitude,
Think that the distance of two objects is closer, similarity is bigger.The algorithm think cluster by forming apart from close object,
Therefore handle obtains compact and independent cluster as final goal.
The selection of k initial classes cluster centre point has large effect to cluster result, because in the algorithm first step
In be center of the random any k object of selection as initial clustering, initially represent a cluster.The algorithm is in each iteration
In remaining each object is concentrated to data, each object is assigned to again at a distance from each cluster center according to it nearest
Cluster.After having investigated all data objects, an iteration operation is completed, and new cluster centre is computed.If primary
Before and after iteration, the value of J illustrates that algorithm has been restrained there is no variation.
Algorithmic procedure is as follows:
1) K document is randomly selected as mass center from N number of document;
2) its distance for arriving each mass center is measured to remaining each document, and it is grouped into the class of nearest mass center;
3) mass center of obtained each class is recalculated;
4) 2~3 step of iteration is until new mass center is equal with the protoplasm heart or less than specified threshold, algorithm terminates.
Jieba segments component: stammerer participle component is Python Chinese word segmentation component best in open source software.
Support three kinds of participle modes:
Accurate model, it is intended to sentence most accurately be cut, text analyzing is suitble to;
Syntype can all be scanned all in sentence at the word of word, and speed is very fast, but not can solve discrimination
Justice;
Search engine mode, to long word cutting again, improves recall rate, draws suitable for search on the basis of accurate model
Hold up participle.It supports traditional font participle, supports Custom Dictionaries.Installation can be freely downloaded.
Distribution representation is distributed technique of expression
Distribution indicates (distributional representation): be it is theoretical based on distributional assumption, using co-occurrence matrix come
The semantic expressiveness for obtaining word can regard a kind of method for obtaining word and indicating as.
TF-IDF (Term Frequence - Inverse Document Frequency)
This algorithm is used to evaluate a word (Term) to the significance level of entire document, it only considered two factors: (1)
Whether the number that whether high (2) this word of the number that this entry occurs in the document occurs in all documents is high.It calculates
The thought of method is easy to do to understand: the word more than frequency of occurrence is important naturally in the document, but must punish that those are common
The all very high word of the number occurred in vocabulary, that is, all documents.TF-IDF is frequently used in search engine, for calculating query
With the degree of correlation of document.Formula sees wikipedia: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Viterbi algorithm: viterbi algorithm
Viterbi algorithm, which is a kind of dynamic programming algorithm, most possible generates observed events sequence-Viterbi road for finding
Diameter-hidden state sequence, especially in Markoff information source context and hidden Markov model.Term " Viterbi road
Diameter " and " viterbi algorithm " also be used to find observation result and most possibly explain relevant dynamic programming algorithm.Such as it is uniting
Dynamic programming algorithm can be used to find the character string of the derivation (parsing) of most probable context-free in meter syntactic analysis,
Sometimes referred to as " Viterbi analysis ".
Viterbi algorithm was proposed by Andrew's Viterbi (Andrew Viterbi) in 1967, in digital communication
Deconvolution is in link to eliminate noise.This algorithm is widely used in CDMA and GSM digital cellular network, dialing modulation /demodulation
Deconvolution code in device, satellite, deep space communication and 802.11 wireless networks.It is also frequently utilized for speech recognition now, keyword is known
Not, in computational linguistics and bioinformatics.Such as in voice (speech recognition), voice signal is as the event sequence observed
Column, and text-string, the reason of being counted as implicit generation voice signal, therefore voice signal application Viterbi can be calculated
Method finds most possible text-string.
The basis of viterbi algorithm can be summarized at following 3 points:
1. if the path p (shortest path in other words) of maximum probability passes through some point, such as X22 on the way, then this road
This cross-talk path Q of starting point S to X22 on diameter, must be the shortest path between S to X22.Otherwise, most with S to X22
Short path R substitutes Q, just constitutes a path more shorter than P, this is clearly contradictory.It demonstrates and meets the principle of optimization;
2. some state that the path from S to E must travel i-th of moment, it is assumed that k state is carved at i-th, if that
Have recorded the shortest path of all k nodes from S to i-th state, the necessary mistake of final shortest path wherein one, this
Sample, at any time, as long as considering very limited shortest path;
3. combining the above two o'clock, it is assumed that when we enter state i+1 from state i, the shortest path of each section from S to state i
Diameter has been found, and is recorded on the nodes, then calculating some nodes X i+1's from starting point S to i+1 state
When shortest path, as long as considering the shortest path of the k node all from S to preceding state i, and from this node to Xi
The distance of+1, j.
Corpus
Refer to the extensive e-text library through scientific sampling and processing.By computer analysis tool, researcher can carry out correlation
Language theory and application study.There are State Language Work Committee's Modern Chinese balanced corpus and ancient books corpus in China.
Summary of the invention
The device of semantic intellectual analysis and the event scenarios reduction of multi-data source is by data source acquisition module, Data Integration mould
Block, affair clustering module, entity abstraction module, event recovery module composition;Data acquisition module is climbed by depth crawler, routine
It takes device, security information receiver, depth buffer, conventional buffer, security information buffer, remove treasure, source data storage
Composition, wherein source data storage is made of depth data memory, routine data memory, security information data storage;
Data Integration module is made of data content withdrawal device, data attribute withdrawal device, content data base, attribute database;Event is poly-
Generic module is by scenic themes definer, event content withdrawal device, content washer, vector space model builder, text modeling
Device, text cluster device composition;Entity abstraction module is extracted by name withdrawal device, corporate message's withdrawal device, professional withdrawal device, place name
Device, keyword abstraction device, sensitive word withdrawal device, antistop list, sensitive vocabulary composition;
The key step of device of the semantic intellectual analysis and event scenarios reduction of realizing multi-data source includes:
1) data acquisition is carried out by data source acquisition module
1. depth web page crawl data: by depth crawler to having included the supervision news website of list, blog, forum, micro-
Rich, wechat public platform, social network sites, audio-video website comment progress text collection is simultaneously temporary to depth by collected text entry
Storage;
2. conventional web page crawler crawls data: crawling the web page contents under the top-level domain of non-forum's class website by conventional crawler
It generates text and conventional buffer is recorded;Forum's class website include: news website, blog, forum, microblogging, wechat public platform,
Social network sites, audio-video website comment;
3. receiving security information data: security information receiver is as the interface with information safety system, by information safety system
The result data that the Monitoring instruction assigned is returned backups to security information buffer from information safety system;
4. removing repeated data: by being stored in after going treasure that the data in depth buffer are removed to the operation of repeated data
Depth data memory;By being stored in routine after going treasure that the data in conventional buffer are removed to the operation of repeated data
Data storage;By being stored in safety after going treasure that the data in security information buffer are removed to the operation of repeated data
Information data store;
5. forming source data storage by depth data memory, routine data memory, security information data storage;Source number
Source data mark is generated according to data source to the data stored according to memory, and will be provided with the depth data of source data mark
Data in memory and has the data in the routine data memory of source data mark and have the safety of source data mark
Data in information data store are stored in source data storage as source data;
2) Data Integration is carried out by Data Integration module
1. the content extraction of data: reading the source data in source data storage by data content withdrawal device and generate band source data mark
The content-data of knowledge and by the content-data storage identified with source data in content data base, the content-data with source data mark
Include: source data mark, title, author, text, audio, video, picture;
2. the attribute extraction of data: reading the source data in source data storage by data attribute withdrawal device and generate band source data
The attribute data of mark simultaneously will be stored in attribute database with the attribute data that source data identifies, the attribute number with source data mark
According to comprising: data source URL, content are delivered the time, content-browsing amount, content commenting amount, content transfer amount, domain name, source ip, mesh
Ip, port numbers, computer room, for the first time the information security monitoring including discovery time, last discovery time, 24 hours accumulative amount of access
Information;
3) theme confirmation and affair clustering are carried out by affair clustering module
1. event topic is completed in the definition and confirmation of the event topic restored needed for being completed by scene event theme definer
The content of antistop list inputs;
2. by event content withdrawal device to be stored in content data base all with source data mark content-data according to event
The antistop list of theme is extracted, and the content-data that subject extraction is completed is generated;The content-data that subject extraction is completed is packet
The content-data with source data mark containing at least one keyword, keyword is the key that in the antistop list of event topic
Word;
3. carrying out data cleansing to the content-data that subject extraction is completed by content washer generates the content-data that cleaning is completed,
Cleaning process removes invalid link by testing first, removes repetition extraneous data, is secondly divided using jieba participle component
Word feature extraction, rejects stop words, and semanteme contributes minimum word, it is not intended to adopted word;
4. carrying out vector space model foundation to the content-data that cleaning is completed by vector space model builder, logic is established
Are as follows: a piece of text is considered as to the sequence of several Feature Words, which can be considered the vector of a multidimensional, and dimension is feature item number
Amount, each dimension size correspond to its occurrence frequency and weight;Be abstracted into formula: text collection D is made of n document:, wherein including M characteristic item, wherein the side that can use vectorization of each document
Method is abstracted:, whereinIt is characteristic item in i-th documentWeight;
5. the content-data vectorization for being completed cleaning by text modeling device generates vectorization collection of document, specific method is:
A) vectorization is carried out to each Feature Words, utilizes contextual information using Word2vec model, each Feature Words is turned
The real vector of fixed dimension is turned to, and similar word also closes in vector space, the skip-gram frame of Word2vec model
The term vector that frame definesFormula is defined as:,For learning efficiency,It is content
In word vector it is cumulative;
B) imparting of text feature weight is carried out using most mature TF-IDF technology at present, and the text cluster after being does base
Plinth: it is assumed that Feature Words are t, the text of appearance isIn, if the frequency that t occurs is higher, with TF factor representation;If t is herein
The middle frequency of occurrences is low, but the frequency of occurrences is higher in whole events, with IDF factor representation.TF*IDF is document feature itself,
It can be indicated as follows based on TF-IDF:,Word t is characterized in textIn
Weight,Exist for word tIn word frequency, N be training text sum,To occur the quantity of Feature Words t in N;TF-IDF
Method can be to the higher higher weight of the low feature of frequency of occurrence in other documents of frequency of occurrence in current document, in this way
The discrimination between document can be enhanced;For corresponding two documentsWith, the degree of association can indicate with its cosine:, wherein M is dimension,It isKth dimension weight;
C) the word weight of the term vector and Feature Words that will acquire combines, to obtain the vectorization of entire document: passing through TF-IDF
The characteristic item of acquisitionIn documentMiddle weight be, characteristic itemIt is obtained using word2vec model skip-gram frame
Fixed dimension term vector;Parameter is obtained according to the above method, current text can be converted to Feature Words and feature power
The sequence of weight;The content number finally completed using all cleanings of the formula vectorization
According to according to the document of the difference generation vectorization of source data mark, a source data mark generates the document of a vectorization, shape
At vectorization collection of document;
6. being clustered by document of the text cluster device to the vectorization in vectorization collection of document, the algorithm of k-means is used
Approximate number of documents parameter K is set, so that the vectorization document to Similar content converges, generates the vectorization text converged
Shelves collection;The vectorization document sets converged are corresponding with event topic defined in scene event theme definer;
4) extraction of feature entity is carried out by entity abstraction module, comprising:
1. by name withdrawal device to the vectorization document sets converged using based role mark Chinese personal name abstracting method into
The extraction of pedestrian's name entity, extracts Role Information using corpus first automatically, and take Viterbi algorithm to take out word result into
Row character labeling finally carries out maximum matching on the basis of role's sequence, realizes the extraction for name, generates name letter
Breath;The name information being drawn into and corresponding event topic are sent to event recovery module by name withdrawal device;
2. being compared to the vectorization document sets converged by third party's industry and commerce information bank by corporate message's withdrawal device and extracting legal person
Information;The corporate message being drawn into and corresponding event topic are sent to event recovery module by corporate message's withdrawal device;
3. being compared to the vectorization document sets converged by common professional library by professional withdrawal device and extracting occupational information;Occupation is taken out
Take device that the occupational information being drawn into and corresponding event topic are sent to event recovery module;
4. being compared to the vectorization document sets converged by country, province, city, county's title by place name withdrawal device and extracting place name letter
Breath;The information of place names being drawn into and corresponding event topic are sent to event recovery module by place name withdrawal device;
5. comparing extracting keywords information by antistop list to the vectorization document sets converged by keyword abstraction device;It closes
The key word information being drawn into and corresponding event topic are sent to event recovery module by keyword withdrawal device;Antistop list is by field
Scape event topic definer generates and sends when defining event topic and gives keyword abstraction device;
Sensitive word information is extracted 6. passing through sensitive vocabulary to the vectorization document sets converged by sensitive word withdrawal device and comparing;It is sensitive
The sensitive word information being drawn into and corresponding event topic are sent to event recovery module by word withdrawal device;Sensitive vocabulary is by entity
Abstraction module is generated according to the sensitive word content of internet management department uniform requirement;From sensitive word withdrawal device to converged to
Quantify document sets and extract temporal information by commonly using date, time, format match, and by temporal information and corresponding event topic
It is sent to event recovery module;
5) event reduction is completed by event recovery module, generates association map:
1. determining the corresponding vectorization document sets converged according to the event topic received by event recovery module, and extract
The mark of source data corresponding to the vectorization document sets converged identifies the attribute data from Data Integration module according to source data
Extract attribute data in library;
2. by event recovery module according to event topic by the name information received, corporate message, occupational information, information of place names,
Key word information, sensitive word information, temporal information and attribute data combination producing are associated with map.
Beneficial effect
Manual analysis work after being collected it solves regulatory agency for large-scale information security events or public sentiment hot information
The problems such as work amount is big.Artificial amount is effectively reduced by way of approaching scenario reduction, current scenario reduction is solved and exists
The larger problem of manual maintenance amount.
Detailed description of the invention
Fig. 1 is composite structural diagram of the invention.
Specific embodiment
The semantic intellectual analysis of multi-data source of the invention and the device of event scenarios reduction are realized referring to Fig. 1, comprising: number
According to source acquisition module A, Data Integration module B, affair clustering module C, entity abstraction module D, event recovery module E;Data are adopted
Collect modules A by depth crawler 11, conventional crawler 12, security information receiver 13, depth buffer 110, conventional buffer
120, security information buffer 130, go treasure 14, source data storage composition 15, wherein source data storage 15 is by depth number
It is formed according to memory 151, routine data memory 152, security information data storage 153;B is by data for Data Integration module
Hold withdrawal device 21, data attribute withdrawal device 22, content data base 23, attribute database 24 to form;C is by scene for affair clustering module
Theme definer 31, event content withdrawal device 32, content washer 33, vector space model builder 34, text modeling device 35,
Text cluster device 36 forms;Entity abstraction module D by name withdrawal device 41, corporate message's withdrawal device 42, professional withdrawal device 43,
Name withdrawal device 44, keyword abstraction device 45, sensitive word withdrawal device 46, antistop list 47, sensitive vocabulary 48 form;
The key step of device of the semantic intellectual analysis and event scenarios reduction of realizing multi-data source includes:
1) data acquisition is carried out by data source acquisition module A
1. depth web page crawl data: by depth crawler 11 to having included the supervision news website of list, blog, forum, micro-
Rich, wechat public platform, social network sites, audio-video website comment progress text collection is simultaneously temporary to depth by collected text entry
Storage 110;
2. conventional web page crawler crawls data: being crawled by conventional crawler 12 in the webpage under the top-level domain of non-forum's class website
Hold and generates text and conventional buffer 120 is recorded;Forum's class website includes: news website, blog, forum, microblogging, wechat public affairs
Many numbers, social network sites, audio-video website comment;
3. receiving security information data: security information receiver 130 is as the interface with information safety system, by information security system
The result data that the Monitoring instruction assigned is returned of uniting backups to security information buffer 130 from information safety system;
4. removing repeated data: after going treasure 14 that the data in depth buffer 110 are removed to the operation of repeated data
It is stored in depth data memory 151;By going treasure 14 that the data in conventional buffer 120 are removed to the behaviour of repeated data
Routine data memory 152 is stored in after work;By going treasure 14 that the data in security information buffer 130 are removed repetition
Security information data storage 153 is stored in after the operation of data;
5. being formed source data by depth data memory 151, routine data memory 152, security information data storage 153 and being deposited
Reservoir 15;15 pairs of data stored of source data storage generate source data mark 25 according to data source, and will be provided with source number
According to the data in the depth data memory 151 of mark 25 and have in the routine data memory 152 of source data mark 25
Data and the data having in the security information data storage 153 of source data mark 25 are stored in source data as source data and deposit
Reservoir 15;
2) Data Integration is carried out by Data Integration module B
1. the content extraction of data: reading the source data in source data storage 15 by data content withdrawal device 21 and generate band source number
According to the content-data of mark and by the content-data storage identified with source data in content data base 23, in source data mark
Holding data includes: source data mark, title, author, text, audio, video, picture;
2. the attribute extraction of data: reading the source data in source data storage 15 by data attribute withdrawal device 22 and generate band source
The attribute data of Data Identification simultaneously will be stored in attribute database 24 with the attribute data that source data identifies, with source data mark
Attribute data includes: data source URL, content are delivered the time, content-browsing amount, content commenting amount, content transfer amount, domain name,
Source ip, purpose ip, port numbers, computer room, for the first time information including discovery time, last discovery time, 24 hours accumulative amount of access
Safety monitoring information;
3) theme confirmation and affair clustering are carried out by affair clustering module C
1. the definition and confirmation of the event topic restored needed for being completed by scene event theme definer 31, i.e. completion event topic
Antistop list 47 content input;
2. by event content withdrawal device 32 to be stored in content data base 23 all with source data mark content-data according to
The antistop list 35 of event topic is extracted, and the content-data that subject extraction is completed is generated;The content number that subject extraction is completed
According to being the content-data with source data mark comprising at least one keyword, keyword is in the antistop list 47 of event topic
Keyword;
3. carrying out data cleansing to the content-data that subject extraction is completed by content washer 23 generates the content number that cleaning is completed
Invalid link is removed by testing first according to, cleaning process, removes and repeats extraneous data, secondly using jieba segment component into
Row participle feature extraction, rejects stop words, and semanteme contributes minimum word, it is not intended to adopted word;
4. carrying out vector space model foundation by the content-data that 34 pairs of vector space model builder cleanings are completed, foundation is patrolled
Volume are as follows: a piece of text is considered as to the sequence of several Feature Words, which can be considered the vector of a multidimensional, and dimension is feature item number
Amount, each dimension size correspond to its occurrence frequency and weight;Be abstracted into formula: text collection D is made of n document:, wherein including M characteristic item, wherein the side that can use vectorization of each document
Method is abstracted:, whereinIt is characteristic item in i-th documentWeight;
5. the content-data vectorization for being completed cleaning by text modeling device 35 generates vectorization collection of document, specific method is:
A) vectorization is carried out to each Feature Words, utilizes contextual information using Word2vec model, each Feature Words is turned
The real vector of fixed dimension is turned to, and similar word also closes in vector space, the skip-gram frame of Word2vec model
The term vector that frame definesFormula is defined as:,For learning efficiency,It is content
In word vector it is cumulative;
B) imparting of text feature weight is carried out using most mature TF-IDF technology at present, and the text cluster after being does base
Plinth: it is assumed that Feature Words are t, the text of appearance isIn, if the frequency that t occurs is higher, with TF factor representation;If t is herein
The middle frequency of occurrences is low, but the frequency of occurrences is higher in whole events, with IDF factor representation.TF*IDF is document feature itself,
It can be indicated as follows based on TF-IDF:,Word t is characterized in textIn
Weight,Exist for word tIn word frequency, N be training text sum,To occur the quantity of Feature Words t in N;TF-IDF
Method can be to the higher higher weight of the low feature of frequency of occurrence in other documents of frequency of occurrence in current document, in this way
The discrimination between document can be enhanced;For corresponding two documentsWith, the degree of association can indicate with its cosine:, wherein M is dimension,It isKth dimension weight;
C) the word weight of the term vector and Feature Words that will acquire combines, to obtain the vectorization of entire document: passing through TF-IDF
The characteristic item of acquisitionIn documentMiddle weight be, characteristic itemIt is obtained using word2vec model skip-gram frame
Fixed dimension term vector;Parameter is obtained according to the above method, current text can be converted to Feature Words and feature power
The sequence of weight;The content number finally completed using all cleanings of the formula vectorization
According to according to the document of the difference generation vectorization of source data mark 25, a source data mark 25 generates the texts of a vectorizations
Shelves form vectorization collection of document;
6. being clustered by document of the text cluster device 36 to the vectorization in vectorization collection of document, the calculation of k-means is used
Method sets approximate number of documents parameter K, so that the vectorization document to Similar content converges, generates the vectorization converged
Document sets;The vectorization document sets converged are corresponding with event topic defined in scene event theme definer 31;
4) extraction of feature entity is carried out by entity abstraction module D, comprising:
1. by name withdrawal device 41 the vectorization document sets converged are used with the Chinese personal name abstracting method of based role mark
The extraction of name entity is carried out, extracts Role Information automatically using corpus first, and takes Viterbi algorithm to pumping word result
Character labeling is carried out, maximum matching is finally carried out on the basis of role's sequence, realizes the extraction for name, generates name letter
Breath;The name information being drawn into and corresponding event topic are sent to event recovery module E by name withdrawal device 41;
2. comparing extraction method by third party's industry and commerce information bank to the vectorization document sets converged by corporate message's withdrawal device 42
People's information;The corporate message being drawn into and corresponding event topic are sent to event recovery module E by corporate message's withdrawal device 42;
3. being compared to the vectorization document sets converged by common professional library by professional withdrawal device 43 and extracting occupational information;Occupation
The occupational information being drawn into and corresponding event topic are sent to event recovery module E by withdrawal device 43;
4. being compared to the vectorization document sets converged by country, province, city, county's title by place name withdrawal device 44 and extracting place name letter
Breath;The information of place names being drawn into and corresponding event topic are sent to event recovery module E by place name withdrawal device 44;
5. comparing extracting keywords letter by antistop list 47 to the vectorization document sets converged by keyword abstraction device 45
Breath;The key word information being drawn into and corresponding event topic are sent to event recovery module E by keyword abstraction device 45;It is crucial
Vocabulary 47 is generated and sent when defining event topic to keyword abstraction device 45 by scene event theme definer 31;
Sensitive word information is extracted 6. passing through sensitive vocabulary 48 to the vectorization document sets converged by sensitive word withdrawal device 46 and comparing;
The sensitive word information being drawn into and corresponding event topic are sent to event recovery module E by sensitive word withdrawal device 46;Sensitive word
Table 48 is generated by entity abstraction module D according to the sensitive word content of internet management department uniform requirement;By sensitive word withdrawal device
46 pairs of vectorization document sets converged extract temporal information by commonly using date, time, format match, and by temporal information and
Corresponding event topic is sent to event recovery module E;
5) event reduction is completed by event recovery module E, generates association map 51:
1. determining the corresponding vectorization document sets converged according to the event topic received by event recovery module E, and extract
The mark of source data corresponding to the vectorization document sets converged 25 identifies 25 categories from Data Integration module B according to source data
Property database 24 extract attribute data;
2. by event recovery module E according to event topic by the name information received, corporate message, occupational information, information of place names,
Key word information, sensitive word information, temporal information and attribute data combination producing are associated with map 51.
Claims (1)
1. the device that the semantic intellectual analysis and event scenarios of multi-data source restore is by data source acquisition module, Data Integration mould
Block, affair clustering module, entity abstraction module, event recovery module composition;Data acquisition module is climbed by depth crawler, routine
It takes device, security information receiver, depth buffer, conventional buffer, security information buffer, remove treasure, source data storage
Composition, wherein source data storage is made of depth data memory, routine data memory, security information data storage;
Data Integration module is made of data content withdrawal device, data attribute withdrawal device, content data base, attribute database;Event is poly-
Generic module is by scenic themes definer, event content withdrawal device, content washer, vector space model builder, text modeling
Device, text cluster device composition;Entity abstraction module is extracted by name withdrawal device, corporate message's withdrawal device, professional withdrawal device, place name
Device, keyword abstraction device, sensitive word withdrawal device, antistop list, sensitive vocabulary composition;
The key step of device of the semantic intellectual analysis and event scenarios reduction of realizing multi-data source includes:
1) data acquisition is carried out by data source acquisition module
1. depth web page crawl data: by depth crawler to having included the supervision news website of list, blog, forum, micro-
Rich, wechat public platform, social network sites, audio-video website comment progress text collection is simultaneously temporary to depth by collected text entry
Storage;
2. conventional web page crawler crawls data: crawling the web page contents under the top-level domain of non-forum's class website by conventional crawler
It generates text and conventional buffer is recorded;Forum's class website include: news website, blog, forum, microblogging, wechat public platform,
Social network sites, audio-video website comment;
3. receiving security information data: security information receiver is as the interface with information safety system, by information safety system
The result data that the Monitoring instruction assigned is returned backups to security information buffer from information safety system;
4. removing repeated data: by being stored in after going treasure that the data in depth buffer are removed to the operation of repeated data
Depth data memory;By being stored in routine after going treasure that the data in conventional buffer are removed to the operation of repeated data
Data storage;By being stored in safety after going treasure that the data in security information buffer are removed to the operation of repeated data
Information data store;
5. forming source data storage by depth data memory, routine data memory, security information data storage;Source number
Source data mark is generated according to data source to the data stored according to memory, and will be provided with the depth data of source data mark
Data in memory and has the data in the routine data memory of source data mark and have the safety of source data mark
Data in information data store are stored in source data storage as source data;
2) Data Integration is carried out by Data Integration module
1. the content extraction of data: reading the source data in source data storage by data content withdrawal device and generate band source data mark
The content-data of knowledge and by the content-data storage identified with source data in content data base, the content-data with source data mark
Include: source data mark, title, author, text, audio, video, picture;
2. the attribute extraction of data: reading the source data in source data storage by data attribute withdrawal device and generate band source data
The attribute data of mark simultaneously will be stored in attribute database with the attribute data that source data identifies, the attribute number with source data mark
According to comprising: data source URL, content are delivered the time, content-browsing amount, content commenting amount, content transfer amount, domain name, source ip, mesh
Ip, port numbers, computer room, for the first time the information security monitoring including discovery time, last discovery time, 24 hours accumulative amount of access
Information;
3) theme confirmation and affair clustering are carried out by affair clustering module
1. event topic is completed in the definition and confirmation of the event topic restored needed for being completed by scene event theme definer
The content of antistop list inputs;
2. by event content withdrawal device to be stored in content data base all with source data mark content-data according to event
The antistop list of theme is extracted, and the content-data that subject extraction is completed is generated;The content-data that subject extraction is completed is packet
The content-data with source data mark containing at least one keyword, keyword is the key that in the antistop list of event topic
Word;
3. carrying out data cleansing to the content-data that subject extraction is completed by content washer generates the content-data that cleaning is completed,
Cleaning process removes invalid link by testing first, removes repetition extraneous data, is secondly divided using jieba participle component
Word feature extraction, rejects stop words, and semanteme contributes minimum word, it is not intended to adopted word;
4. carrying out vector space model foundation to the content-data that cleaning is completed by vector space model builder, logic is established
Are as follows: a piece of text is considered as to the sequence of several Feature Words, which can be considered the vector of a multidimensional, and dimension is feature item number
Amount, each dimension size correspond to its occurrence frequency and weight;Be abstracted into formula: text collection D is made of n document:, wherein including M characteristic item, wherein the side that can use vectorization of each document
Method is abstracted:, whereinIt is characteristic item in i-th documentWeight;
5. the content-data vectorization for being completed cleaning by text modeling device generates vectorization collection of document, specific method is:
A) vectorization is carried out to each Feature Words, utilizes contextual information using Word2vec model, each Feature Words is turned
The real vector of fixed dimension is turned to, and similar word also closes in vector space, the skip-gram frame of Word2vec model
The term vector that frame definesFormula is defined as:,For learning efficiency,It is content
In word vector it is cumulative;
B) imparting of text feature weight is carried out using most mature TF-IDF technology at present, and the text cluster after being does base
Plinth: it is assumed that Feature Words are t, the text of appearance isIn, if the frequency that t occurs is higher, with TF factor representation;If t is herein
The middle frequency of occurrences is low, but the frequency of occurrences is higher in whole events, with IDF factor representation;
TF*IDF is document feature itself, can be indicated as follows based on TF-IDF:,Word t is characterized in textIn weight,Exist for word tIn word frequency, N be training text sum,For in N
There is the quantity of Feature Words t;TF-IDF method can go out occurrence to frequency of occurrence is higher in current document in other documents
The low higher weight of feature of number, can enhance the discrimination between document in this way;For corresponding two documentsWith,
Its degree of association can be indicated with its cosine:, wherein M is dimension,It isKth dimension
Weight;
C) the word weight of the term vector and Feature Words that will acquire combines, to obtain the vectorization of entire document: passing through TF-IDF
The characteristic item of acquisitionIn documentMiddle weight be, characteristic itemIt is obtained using word2vec model skip-gram frame
Fixed dimension term vector;Parameter is obtained according to the above method, current text can be converted to Feature Words and feature power
The sequence of weight;The content number finally completed using all cleanings of the formula vectorization
According to according to the document of the difference generation vectorization of source data mark, a source data mark generates the document of a vectorization, shape
At vectorization collection of document;
6. being clustered by document of the text cluster device to the vectorization in vectorization collection of document, the algorithm of k-means is used
Approximate number of documents parameter K is set, so that the vectorization document to Similar content converges, generates the vectorization text converged
Shelves collection;The vectorization document sets converged are corresponding with event topic defined in scene event theme definer;
4) extraction of feature entity is carried out by entity abstraction module, comprising:
1. by name withdrawal device to the vectorization document sets converged using based role mark Chinese personal name abstracting method into
The extraction of pedestrian's name entity, extracts Role Information using corpus first automatically, and take Viterbi algorithm to take out word result into
Row character labeling finally carries out maximum matching on the basis of role's sequence, realizes the extraction for name, generates name letter
Breath;The name information being drawn into and corresponding event topic are sent to event recovery module by name withdrawal device;
2. being compared to the vectorization document sets converged by third party's industry and commerce information bank by corporate message's withdrawal device and extracting legal person
Information;The corporate message being drawn into and corresponding event topic are sent to event recovery module by corporate message's withdrawal device;
3. being compared to the vectorization document sets converged by common professional library by professional withdrawal device and extracting occupational information;Occupation is taken out
Take device that the occupational information being drawn into and corresponding event topic are sent to event recovery module;
4. being compared to the vectorization document sets converged by country, province, city, county's title by place name withdrawal device and extracting place name letter
Breath;The information of place names being drawn into and corresponding event topic are sent to event recovery module by place name withdrawal device;
5. comparing extracting keywords information by antistop list to the vectorization document sets converged by keyword abstraction device;It closes
The key word information being drawn into and corresponding event topic are sent to event recovery module by keyword withdrawal device;Antistop list is by field
Scape event topic definer generates and sends when defining event topic and gives keyword abstraction device;
Sensitive word information is extracted 6. passing through sensitive vocabulary to the vectorization document sets converged by sensitive word withdrawal device and comparing;It is sensitive
The sensitive word information being drawn into and corresponding event topic are sent to event recovery module by word withdrawal device;Sensitive vocabulary is by entity
Abstraction module is generated according to the sensitive word content of internet management department uniform requirement;From sensitive word withdrawal device to converged to
Quantify document sets and extract temporal information by commonly using date, time, format match, and by temporal information and corresponding event topic
It is sent to event recovery module;
5) event reduction is completed by event recovery module, generates association map:
1. determining the corresponding vectorization document sets converged according to the event topic received by event recovery module, and extract
The mark of source data corresponding to the vectorization document sets converged identifies the attribute data from Data Integration module according to source data
Extract attribute data in library;
2. by event recovery module according to event topic by the name information received, corporate message, occupational information, information of place names,
Key word information, sensitive word information, temporal information and attribute data combination producing are associated with map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811378557.3A CN109635107A (en) | 2018-11-19 | 2018-11-19 | The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811378557.3A CN109635107A (en) | 2018-11-19 | 2018-11-19 | The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109635107A true CN109635107A (en) | 2019-04-16 |
Family
ID=66068291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811378557.3A Pending CN109635107A (en) | 2018-11-19 | 2018-11-19 | The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109635107A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110389932A (en) * | 2019-07-02 | 2019-10-29 | 华北电力科学研究院有限责任公司 | Electric power automatic document classifying method and device |
CN110515926A (en) * | 2019-08-28 | 2019-11-29 | 国网天津市电力公司 | Heterogeneous data source mass data carding method based on participle and semantic dependency analysis |
CN110765233A (en) * | 2019-11-11 | 2020-02-07 | 中国人民解放军军事科学院评估论证研究中心 | Intelligent information retrieval service system based on deep mining and knowledge management technology |
CN111881330A (en) * | 2020-08-05 | 2020-11-03 | 上海奥珩企业管理有限公司 | Automatic restoration method and system for home service scene |
WO2021057133A1 (en) * | 2019-09-24 | 2021-04-01 | 北京国双科技有限公司 | Method for training document classification model, and related apparatus |
CN113449101A (en) * | 2020-03-26 | 2021-09-28 | 北京中科闻歌科技股份有限公司 | Public health safety event detection and event set construction method and system |
CN114090700A (en) * | 2021-11-22 | 2022-02-25 | 广州华森建筑与工程设计顾问有限公司 | Method, system and equipment for generating feature data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060277465A1 (en) * | 2005-06-07 | 2006-12-07 | Textual Analytics Solutions Pvt. Ltd. | System and method of textual information analytics |
CN102110140A (en) * | 2011-01-26 | 2011-06-29 | 桂林电子科技大学 | Network-based method for analyzing opinion information in discrete text |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN106951438A (en) * | 2017-02-13 | 2017-07-14 | 北京航空航天大学 | A kind of event extraction system and method towards open field |
CN107315778A (en) * | 2017-05-31 | 2017-11-03 | 温州市鹿城区中津先进科技研究院 | A kind of natural language the analysis of public opinion method based on big data sentiment analysis |
CN108763333A (en) * | 2018-05-11 | 2018-11-06 | 北京航空航天大学 | A kind of event collection of illustrative plates construction method based on Social Media |
-
2018
- 2018-11-19 CN CN201811378557.3A patent/CN109635107A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060277465A1 (en) * | 2005-06-07 | 2006-12-07 | Textual Analytics Solutions Pvt. Ltd. | System and method of textual information analytics |
CN102110140A (en) * | 2011-01-26 | 2011-06-29 | 桂林电子科技大学 | Network-based method for analyzing opinion information in discrete text |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN106951438A (en) * | 2017-02-13 | 2017-07-14 | 北京航空航天大学 | A kind of event extraction system and method towards open field |
CN107315778A (en) * | 2017-05-31 | 2017-11-03 | 温州市鹿城区中津先进科技研究院 | A kind of natural language the analysis of public opinion method based on big data sentiment analysis |
CN108763333A (en) * | 2018-05-11 | 2018-11-06 | 北京航空航天大学 | A kind of event collection of illustrative plates construction method based on Social Media |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110389932A (en) * | 2019-07-02 | 2019-10-29 | 华北电力科学研究院有限责任公司 | Electric power automatic document classifying method and device |
CN110515926A (en) * | 2019-08-28 | 2019-11-29 | 国网天津市电力公司 | Heterogeneous data source mass data carding method based on participle and semantic dependency analysis |
WO2021057133A1 (en) * | 2019-09-24 | 2021-04-01 | 北京国双科技有限公司 | Method for training document classification model, and related apparatus |
CN110765233A (en) * | 2019-11-11 | 2020-02-07 | 中国人民解放军军事科学院评估论证研究中心 | Intelligent information retrieval service system based on deep mining and knowledge management technology |
CN113449101A (en) * | 2020-03-26 | 2021-09-28 | 北京中科闻歌科技股份有限公司 | Public health safety event detection and event set construction method and system |
CN111881330A (en) * | 2020-08-05 | 2020-11-03 | 上海奥珩企业管理有限公司 | Automatic restoration method and system for home service scene |
CN111881330B (en) * | 2020-08-05 | 2023-10-27 | 颐家(上海)医疗养老服务有限公司 | Automatic home service scene restoration method and system |
CN114090700A (en) * | 2021-11-22 | 2022-02-25 | 广州华森建筑与工程设计顾问有限公司 | Method, system and equipment for generating feature data |
CN114090700B (en) * | 2021-11-22 | 2022-05-17 | 广州华森建筑与工程设计顾问有限公司 | Method, system and equipment for generating feature data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635107A (en) | The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source | |
CN104933164B (en) | In internet mass data name entity between relationship extracting method and its system | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN107122413A (en) | A kind of keyword extracting method and device based on graph model | |
WO2018097091A1 (en) | Model creation device, text search device, model creation method, text search method, data structure, and program | |
CN111950273A (en) | Network public opinion emergency automatic identification method based on emotion information extraction analysis | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN113268606B (en) | Knowledge graph construction method and device | |
CN105045852A (en) | Full-text search engine system for teaching resources | |
CN112307364B (en) | Character representation-oriented news text place extraction method | |
TWI656450B (en) | Method and system for extracting knowledge from Chinese corpus | |
Asgari-Chenaghlu et al. | Topic detection and tracking techniques on Twitter: a systematic review | |
CN112256939A (en) | Text entity relation extraction method for chemical field | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN107102976A (en) | Entertainment newses autocreating technology and system based on microblogging | |
Saju et al. | A survey on efficient extraction of named entities from new domains using big data analytics | |
CN109446399A (en) | A kind of video display entity search method | |
Campbell et al. | Content+ context networks for user classification in twitter | |
CN108595466B (en) | Internet information filtering and internet user information and network card structure analysis method | |
CN115017302A (en) | Public opinion monitoring method and public opinion monitoring system | |
CN107908749A (en) | A kind of personage's searching system and method based on search engine | |
CN112507097B (en) | Method for improving generalization capability of question-answering system | |
Pasca et al. | Answer mining from on-line documents | |
Hajjem et al. | Building comparable corpora from social networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190416 |
|
WD01 | Invention patent application deemed withdrawn after publication |