CN106709006A

CN106709006A - Associated data compressing method friendly to query

Info

Publication number: CN106709006A
Application number: CN201611209081.1A
Authority: CN
Inventors: 顾进广; 彭燊; 黄智生; 符海东; 梅琨
Original assignee: Wuhan Chu Tianyun Polytron Technologies Inc; Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan Chu Tianyun Polytron Technologies Inc; Wuhan University of Science and Engineering WUSE; Wuhan University of Science and Technology WHUST
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-05-24
Anticipated expiration: 2036-12-23
Also published as: CN106709006B

Abstract

The invention relates to an associated data compressing method friendly to query. The method comprises the following steps: defining a relation mining rule, and mining a potential incidence relation in a triad; defining a compression query memory model which consists of a subject vector, a predicate vector and an object matrix; defining a serialization mode of the compression query memory model, and implementing serialization and deserialization by using three auxiliary symbols; defining a query mode of executing SPARQL on the compression query memory model, querying a subject and a predicate by using a binary search method, and querying an object by using a linear traverse method; and defining a scheme for solving slow query caused by the over-large object matrix, and dividing a large data block into a plurality of small data blocks. Compared with most of existing compression schemes, an associated data set processed by the method has the characteristics that the compression ratio is increased, and SPARQL query operation can be carried out directly under the compression state.

Description

A kind of associated data compression method friendly to inquiry

Technical field

The present invention relates to big data field, storage for magnanimity RDF, LOD and knowledge mapping related data, transmit and look into Ask.More particularly to a kind of associated data compression method friendly to inquiry

Background technology

Existing associated data compression scheme has a many kinds, but major part is for inquiry and unfriendly.The pressure generally accepted Contracting scheme has HDT, and this compression scheme compression ratio is higher, but inquiry when need first to decompress, to inquiry and it is unfriendly.Receive The inspiration of HDT schemes, many compress techniques based on HDT schemes are also suggested, such as HDT FoQ, WaterFowl, HDT++, this The characteristics of a little compress techniques have one jointly：High compression rate, but to inquiry and it is unfriendly.

Also there are some schemes friendly to inquiry, for example BitMat methods, this compression scheme uses the side of three-dimensional matrice Formula states triple relation, is many non-existent triple relations also reserved storage space.When associated data set is arrived greatly necessarily During scale, this three-dimensional matrice has reformed into a super sparse matrix, and due to storing many redundancies, compression ratio is not It is preferable.In order to reduce the redundancy of storage, K²- triple schemes are suggested, and be divided into for three-dimensional matrice according to predicate by it Multiple two-dimensional matrixs, two-dimensional matrix is stored with the structure of K2 trees.This method improves compression ratio to a certain extent, but also breaks Original intuitively matrix structure is broken, so need first to go back original matrix when being inquired about, and this operation can reduce RDF Search efficiency.

Increasing associated data is flooded with whole data network, when needing to manage and inquire about these data, inquiry Performance and data are expansible has become focal issue.Although can be stored using enough storage mediums increasingly huger Associated data set, but huge data set not only results in search efficiency reduction, can also aggravate other common processes (such as RDF issue and exchange) performance issue.As the long-range SPARQL end points inquiry mode by network transmission implementing result is more next More welcome, the issue and exchange of RDF use more and more frequent in the inquiry of associated data.Therefore a kind of inquiry friend is found Good associated data compression method is significant.

The content of the invention

Regarding to the issue above, the purpose of the present invention is to find a kind of compression scheme friendly to inquiry.Not to association In the case of data compression data decompression, SPARQL inquiries can be directly carried out, while improving compression ratio as far as possible.

Target of the present invention concentrates potential relational matrix to realize by excavation associated data.The method includes：

A kind of associated data compression method friendly to inquiry, it is characterised in that

The step of one structure structural model, specifically include：

Step 1, is based on triple memory model N-Triple format associated datas and parses, obtains triplet sets, Then dictionary is built, and by triple IDization, wherein, the process of parsing includes：

Step 1.1, filters out the row or null started with " # "；

Step 1.2, reads and cutting character string in space is pressed per data line；

Step 1.3, data after cutting are mapped to subject, predicate and the object of triple, are built into a triple；

Step 2, based on relation excavation constraint, potentially associates in excavation triple；

Step 3, defines Compressed text search memory model, is made up of header information, dictionary and data set of blocks, each data block It is made up of subject vector, predicate vector sum object matrix；The Compressed text search memory model uses subject vector, predicate vector sum The mode of object matrix represents triple relation：Subject vector is defined for a length is the column vector of m, predicate vector is one Length is the row vector of n, and object matrix is a matrix of m*n, and subject vector sum predicate vector does vector multiplication, obtains one The matrix being made up of subject-predicate language with object matrix size identical, then mapped one by one with the data item of object matrix, after mapping Each single item is a triple relation；

One the step of carry out high compression rate data storage based on structural model, specifically includes：

Step 4, the memory length for defining each ID is identical, the serializing mode based on Compressed text search memory model：Use Auxiliary sign carries out serializing and unserializing operation；

Step 4.1, serializing：For each data block, by object matrix flattening, with preposition accessory ID by subject to Amount, the object matrix data of predicate vector sum flattening connect into linear data structure, then by the linear data structure after treatment Linked together with data block accessory ID；

Step 4.2, unserializing：The data of serializing are divided into each according to data block auxiliary identifier to count one by one According to block, for each data block, the object square of subject vector, predicate vector sum flattening is divided into according to preposition auxiliary identifier The object matrix of flattening is reduced into ewal matrix by battle array, the length further according to subject vector；

One the step of carry out data query based on structural model, specifically includes：

Step 5, the inquiry of the SPARQL based on Compressed text search memory model：Inquiry to subject and predicate is looked into using two points Look for, the inquiry for object is searched using linear sweep, is specifically included：

Step 5.1, subject querying method is specifically included：The subject vector of all data blocks is traveled through, because in subject vector Portion has sorted, and using binary chop method, therefore subject query time complexity is O (log2n)；

Step 5.2, predicate querying method is specifically included：The predicate vector of all data blocks is traveled through, because in predicate vector Portion has sorted, and using binary chop method, therefore predicate query time complexity is O (log2n)；

Step 5.3, object querying method is specifically included：The object matrix of all data blocks is traveled through, because in object matrix Portion is unsorted, can only sequential search, time complexity be O (n).

In a kind of above-mentioned associated data compression method friendly to inquiry, based on N-Triple format associated datas and solve Analysis, obtains triplet sets and specifically includes：

Step 2.1, filters out the row or null started with #；

Step 2.2, reads and cutting character string in space is pressed per data line；

Step 2.3, data after cutting are mapped to subject, predicate and the object of triple, are built into a triple.

In above-mentioned associated data compression method a kind of friendly to inquiry, dictionary is built, and triple IDization is specific Including：

Step 3.1, flattening operation is carried out by triple obtained in the previous step, removes repeated data；

Step 3.2, is the unique ID of each single item data distribution one, obtains Dictionary；

Step 3.3, extracts every item data identical header information in Dictionary, obtains Header；By original three Tuple data is replaced with ID, obtains the triplet sets of IDization.

In above-mentioned associated data compression method a kind of friendly to inquiry, relation excavation constraint includes：

Constraints one：Merge the triple with identical subject and predicate；

Constraints two：All triples are classified according to subject, are merged all predicates and object of identical subject, Predicate vector sum object vector is formed, the predicate vector of each subject is extracted；

Constraints three：Merge the triple with identical predicate (predicate vector) and object；

Constraints four：All triples are classified according to predicate vector, merge the subject and object of identical predicate vector, Form subject vector sum object matrix.

In a kind of above-mentioned associated data compression method friendly to inquiry, Compressed text search memory model, using subject to Amount, the mode of predicate vector sum object matrix represent triple relation：It is assumed that subject vector is the column vector of m for a length, Predicate vector is the row vector of n for length, and object matrix is a matrix of m*n, subject vector sum predicate vector do to Amount multiplication, obtains the matrix that and object matrix size identical are made up of subject-predicate language, then the data item one with object matrix One mapping, each single item after mapping is a triple relation.

In above-mentioned associated data compression method a kind of friendly to inquiry, the inquiry of SPARQL also includes：

Step 5.4, complex query, all of complex query can be broken into above three kinds of simple queries, remerge The result of simple queries；The step 5.1 to step 5.4 can be performed concurrently.

In above-mentioned associated data compression method a kind of friendly to inquiry, also including a step for object matrix-split Suddenly, the data block excessive for object matrix, is split as multiple data blocks, specifically causes object to be inquired about when object matrix is excessive Slow solution, keeps predicate vector constant, and subject vector sum object matrix correspondence is split, and obtains multiple small data Block, this method for splitting can safeguard Compressed text search memory model structure, it is ensured that the Compressed text search memory model after fractionation still can enter The concurrent inquiry operation of row.

Therefore, the invention has the advantages that：Using the associated data set of present invention treatment, relative to most of existing pressure Contracting scheme, improves compression ratio, and under compression, can directly carry out SPARQL inquiry operations.

Brief description of the drawings

Fig. 1 is the contraction principle figure of the embodiment of the present invention.

Fig. 2 is the Compressed text search memory model figure of the embodiment of the present invention.

Fig. 3 is four relation excavations rule description figure of the embodiment of the present invention.

Fig. 4 splits rule description figure for the long data block of the embodiment of the present invention.

Fig. 5 is method of the present invention schematic flow sheet.

Specific embodiment

Technical solution of the present invention is described in detail below in conjunction with drawings and Examples.

The technical scheme that the present invention is provided is the associated data set compression algorithm based on relational matrix, specifically includes following step Suddenly：

1. triple memory model is defined, comprising subject S, predicate P and tri- data segments of object O；

2. it is input into N-Triple format associated datas and parses, obtains triplet sets；

Detailed process is as follows：

2.1. the row or null started with " # " are filtered out；

2.2. read and cutting character string in space is pressed per data line；

2.3., data after cutting are mapped to subject, predicate and the object of triple, a triple is built into；

3. dictionary is built, and triple IDization；

Detailed process is as follows：

3.1. triple obtained in the previous step is carried out into flattening operation, removes repeated data；

3.2. it is the unique ID of each single item data distribution one, obtains Dictionary；

3.3. every item data identical header information in Dictionary is extracted, Header is obtained；

3.4. original triple data are replaced with ID, obtains the triplet sets of IDization；

4. the relation excavation first step, merges the triple with identical subject and predicate, and Step1 in 1, derives referring to the drawings Formula Rule1 in 3 referring to the drawings；

5. relation excavation second step, all triples are classified according to subject, merge all predicates of identical subject And object, predicate vector sum object vector is formed, the predicate vector of each subject is extracted, and predicate vector inside is arranged Sequence, referring to the drawings Step2 in 1, derivation formula Rule2 in 3 referring to the drawings；

6. the step of relation excavation the 3rd, merges the triple with identical predicate (predicate vector) and object, in 1 Step3, derivation formula Rule3 in 3 referring to the drawings；

7. the step of relation excavation the 4th, by all triples according to predicate (predicate vector) classification, merges identical predicate (predicate Vector) subject and object, form internal ordering subject vector sum object matrix, will it is such a by subject vector, call The structure of language vector sum object matrix composition is referred to as a data block, referring to the drawings Step4 in 1, and derivation formula is referring to the drawings in 3 Rule4；

8. extract the subject vector of each data block, predicate vector sum object matrix and set up Compressed text search memory model, Compressed text search memory model is referring to the drawings 2；

9. the SPARQL inquiry modes under compressive state, can carry out concurrent inquiry operation in all of data block；

Detailed process is as follows：

9.1. subject inquiry, travels through the subject vector of all data blocks, because subject vector is internal having sorted, uses two points Lookup method, therefore subject query time complexity is O (log2n)；

9.2. predicate inquiry, travels through the predicate vector of all data blocks, because predicate vector is internal having sorted, uses two points Lookup method, therefore predicate query time complexity is O (log2n)；

9.3. object inquiry, travels through the object matrix of all data blocks, because object internal matrix is unsorted, can only order Search, time complexity is O (n), for the king-sized data block of object matrix, can carry out deblocking, it is linear to reduce Traversal searches the time overhead for bringing, refer to the attached drawing 4；

9.4. complex query, all of complex query can be broken into above three kinds of simple queries, remerge simple The result of inquiry.

10. serializing write-in file, the memory length for setting each ID is identical, using auxiliary symbol " | " (or identifier), ", " (subject-predicate object decollator) and "/" (data block decollator) realize serializing and unserializing.

Specific embodiment described herein is only to the spiritual explanation for example of the present invention.Technology neck belonging to of the invention The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from spirit of the invention or surmount scope defined in appended claims.

Claims

1. a kind of to inquiring about friendly associated data compression method, it is characterised in that

The step of one structure structural model, specifically include：

Step 1.1, filters out the row or null started with #；

Step 3, defines Compressed text search memory model, is made up of header information, dictionary and data set of blocks, and each data block is by leading Language vector, predicate vector sum object matrix composition；The Compressed text search memory model uses subject vector, predicate vector sum object The mode of matrix represents triple relation：Subject vector is defined for a length is the column vector of m, predicate vector is a length It is the row vector of n, object matrix is a matrix of m*n, and subject vector sum predicate vector does vector multiplication, obtains one and guest The matrix that language matrix size identical is made up of subject-predicate language, then mapped one by one with the data item of object matrix, it is each after mapping Xiang Weiyi triple relation；

Step 4.1, serializing：For each data block, by object matrix flattening, with preposition accessory ID by subject vector, The object matrix data of predicate vector sum flattening connects into linear data structure, then by the linear data structure number after treatment Linked together according to block accessory ID；

Step 4.2, unserializing：The data of serializing are divided into by each data one by one according to data block auxiliary identifier Block, for each data block, the object square of subject vector, predicate vector sum flattening is divided into according to preposition auxiliary identifier The object matrix of flattening is reduced into ewal matrix by battle array, the length further according to subject vector；

Step 5, the inquiry of the SPARQL based on Compressed text search memory model：Inquiry to subject and predicate uses binary chop, Inquiry for object is searched using linear sweep, is specifically included：

Step 5.1, subject querying method is specifically included：The subject vector of all data blocks is traveled through, because subject vector is internal Sequence, using binary chop method, therefore subject query time complexity is O (log2n)；

Step 5.2, predicate querying method is specifically included：The predicate vector of all data blocks is traveled through, because predicate vector is internal Sequence, using binary chop method, therefore predicate query time complexity is O (log2n)；

Step 5.3, object querying method is specifically included：The object matrix of all data blocks is traveled through, because object internal matrix is not Sequence, can only sequential search, time complexity be O (n).

2. it is according to claim 1 a kind of to inquiring about friendly associated data compression method, it is characterised in that based on N- Triple format associated datas are simultaneously parsed, and are obtained triplet sets and are specifically included：

Step 2.1, filters out the row or null started with #；

3. it is according to claim 1 a kind of to inquiring about friendly associated data compression method, it is characterised in that to build word Allusion quotation, and triple IDization is specifically included：

Step 3.3, extracts every item data identical header information in Dictionary, obtains Header；By original triple Data are replaced with ID, obtain the triplet sets of IDization.

4. it is according to claim 1 a kind of to inquiring about friendly associated data compression method, it is characterised in that relation excavation Constraint includes：

Constraints one：Merge the triple with identical subject and predicate；

Constraints two：All triples are classified according to subject, is merged all predicates and object of identical subject, formed Predicate vector sum object vector, extracts the predicate vector of each subject；

Constraints three：Merge the triple with identical predicate vector sum object；

Constraints four：All triples are classified according to predicate vector, merges the subject and object of identical predicate vector, formed Subject vector sum object matrix.

5. it is according to claim 1 a kind of to inquiring about friendly associated data compression method, it is characterised in that Compressed text search Memory model, triple relation is represented using the mode of subject vector, predicate vector sum object matrix：It is assumed that subject vector is one Individual length is the column vector of m, and predicate vector is the row vector of n for a length, and object matrix is a matrix of m*n, subject Vector sum predicate vector does vector multiplication, obtains one and matrix that object matrix size identical is made up of subject-predicate language, then with The data item of object matrix maps one by one, and each single item after mapping is a triple relation.

6. it is according to claim 1 a kind of to inquiring about friendly associated data compression method, it is characterised in that SPARQL's Inquiry also includes：

Step 5.4, complex query, all of complex query can be broken into above three kinds of simple queries, remerge simple The result of inquiry；The step 5.1 to step 5.4 can be performed concurrently.

7. it is according to claim 1 a kind of to inquiring about friendly associated data compression method, it is characterised in that also including one The step of individual object matrix-split, the data block excessive for object matrix is split as multiple data blocks, specifically when object square Battle array is excessive to cause object to inquire about slow solution, keeps predicate vector constant, and subject vector sum object matrix correspondence is torn open Point, multiple small data blocks are obtained, this method for splitting can safeguard Compressed text search memory model structure, it is ensured that the compression after fractionation is looked into Asking memory model still can carry out concurrent inquiry operation.