CN110166279B - Dynamic layout method of unstructured cloud data management system - Google Patents

Dynamic layout method of unstructured cloud data management system Download PDF

Info

Publication number
CN110166279B
CN110166279B CN201910282178.2A CN201910282178A CN110166279B CN 110166279 B CN110166279 B CN 110166279B CN 201910282178 A CN201910282178 A CN 201910282178A CN 110166279 B CN110166279 B CN 110166279B
Authority
CN
China
Prior art keywords
data
data partition
cloud
partition
unstable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910282178.2A
Other languages
Chinese (zh)
Other versions
CN110166279A (en
Inventor
郑美光
杨姣
常成龙
杨柳
胡志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201910282178.2A priority Critical patent/CN110166279B/en
Publication of CN110166279A publication Critical patent/CN110166279A/en
Application granted granted Critical
Publication of CN110166279B publication Critical patent/CN110166279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of big data management, and discloses a dynamic layout method of an unstructured cloud data management system, which is used for reducing data transmission overhead caused by frequent data movement in the unstructured data management system; establishing a cloud model of each data partition in the first data partition set, and dividing all data partitions in the first data partition set into stable data partition groups and unstable data partition groups according to the cloud model; and calling each data partition in the unstable data partition group into a Slave storage node by adopting a data layout algorithm.

Description

Dynamic layout method of unstructured cloud data management system
Technical Field
The invention relates to the field of big data management, in particular to a dynamic layout method for identifying unstable partitions of an unstructured cloud data management system.
Background
With the development of cloud computing technology, internet technology, and mobile device technology, human society has entered the big data era. The traditional relational database is difficult to meet the flexible management requirement of mass data, in recent years, an unstructured data management system represented by a NoSQL system develops rapidly, and the unstructured data management system is used as a typical cloud storage system to effectively provide support for large data management. Under the environment of big data, the organization management mode of the data is greatly changed, the data is stored in a distributed mode, the distributed storage of the data can be really realized only by dividing the big data, and after the big data is divided, the data is stored on each node in a partitioned and distributed mode by using a proper strategy.
Some NoSQL databases directly use traditional distribution techniques such as hash distribution, random distribution, and round robin distribution to accomplish the distribution of data partitions. For example, HBase is distributed on system nodes by using a polling distribution technology after passing through range division, Dynamo uses a consistent Hash data division method, a data distribution strategy contained in the division method is the Hash distribution technology, and a data layout scheme combining a K-means algorithm is provided by considering the dependency relationship among data, so that the data moving times are effectively reduced; or the data are placed to the proper data center through quantitative calculation by utilizing the relation among the data, so that the data movement amount among the data centers is reduced; it is also proposed to mine data association according to log information, and then aggregate and place strongly associated data on the same rack, reducing data migration across racks. The data distribution technology belongs to a static distribution technology, and refers to a distribution technology designed before system design, and does not consider the operation mode of the system.
And the requirement of users on big data is changed, a large amount of data needs to be transmitted through a network, so that a large amount of network transmission cost is caused, and the storage nodes of the data partitions need to be dynamically adjusted. When the unstructured data management system is accessed, high-concurrency user query and modification operations need to frequently perform I/O operation and data synchronization on a plurality of data nodes, and data transmission overhead caused by high-concurrency access causes various types of overhead of a bottom storage system to be increased.
Therefore, it is important to solve the problem of the unstructured data management system in data transmission management.
Disclosure of Invention
The invention aims to provide a dynamic layout method of an unstructured cloud data management system, so as to reduce data transmission overhead caused by frequent data movement in the unstructured data management system, reduce system energy consumption and improve the energy efficiency level of the unstructured data management system.
In order to achieve the above object, the present invention provides a dynamic layout method for identifying unstable partitions of an unstructured cloud data management system, comprising the following steps:
s1: determining a data set required by a task submitted by a client, and partitioning the data set by adopting a metadata manager in a Master node to obtain a first data partition set;
s2: establishing a cloud model of each data partition in the first data partition set, and dividing all data partitions in the first data partition set into stable data partition groups and unstable data partition groups according to the cloud model;
s3: and calling each data partition in the unstable data partition group into a Slave storage node by adopting a data layout algorithm.
Preferably, the S3 specifically includes the following steps:
s31: setting data partitions dpjStored at node siUpper, lower dpj|siAnd the time transmission Cost (D) corresponding to each data set and each data setj[x],dpj,si) Form dpjOne cloud drop of the cloud model;
s32: inputting the cloud droplets into a reverse cloud generator, and constructing a cloud model of the data partition, wherein the computing formula of the cloud model is as follows:
Figure BDA0002022019780000021
in the formula, Ex represents expectation, En represents entropy, He represents super entropy, N represents the number of cloud drops in a cloud model, and u represents each cloud drop in the cloud model;
s33: repeating the steps of S31-S32 to calculate the cloud models of all the data partitions in the first data partition set, selecting the data partition corresponding to the cloud model with the minimum expected value as a stable data partition, and selecting the data partition corresponding to the cloud model with the maximum expected value as an unstable data partition;
s34: and calculating the cloud model similarity of the data partition corresponding to each cloud model with the stable data partition and the unstable data partition respectively, and dividing according to a similarity threshold value to obtain a stable data partition group and an unstable data partition group.
Preferably, the calculation formula of the cloud model similarity is as follows:
Figure BDA0002022019780000022
in the formula (I), the compound is shown in the specification,
Figure BDA0002022019780000023
representation of Normal cloud model DPC1The area formed between the desired curve of (a) and the abscissa,
Figure BDA0002022019780000024
representation of Normal cloud model DPC2And S represents an area of an intersection overlap portion of the expected curves of the two normal cloud models.
Preferably, the Master node is provided with a query analyzer and a data partition manager, and the Master node analyzes the data processing request of the task through the query analyzer to obtain a data set required by the task; and the Master node calls the data layout algorithm through the data partition manager to realize dynamic data partition management.
Preferably, the method further comprises the step of: a monitor is additionally arranged for each node; in S2, the collecting information about each data partition in real time is implemented by the monitor.
Preferably, in S4, the invoking, by using a data layout algorithm, each data partition in the unstable data partition group into a Slave storage node specifically includes the steps of:
taking the local access amount of the data set in the unstable data partition group as a judgment basis for executing copying or migration, and when the local access amount of the data set in the unstable data partition group is greater than or equal to a set threshold value, continuously storing the unstable data partition to the current node to execute copying operation; and when the local access amount of the data set in the unstable data partition group is smaller than a set threshold, executing migration operation to other Slave nodes with low cost for storage.
Preferably, in S3, when the data partitions in the unstable data partition group are called into a Slave storage node, the method further includes the steps of: and copying at least two data partitions, and calling each copied data partition into a different Slave storage node.
Preferably, the data layout algorithm is a Random algorithm or a K-Means algorithm.
Preferably, the related information of each data partition includes the size of the data partition moving data amount and the requested number of times.
The invention has the following beneficial effects:
according to the dynamic layout method of the unstructured cloud data management system, cloud modeling is carried out on data partitions in the storage system aiming at the unstructured data management system, incidence relations among the data partitions are analyzed, unstable data partitions are identified and are rearranged, data transmission overhead caused by frequent movement of data in the unstructured data management system can be reduced, system energy consumption is reduced, and the energy efficiency level of the unstructured data management system can be further improved.
The present invention will be described in further detail below with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a dynamic layout method for unstable partition identification of an unstructured cloud data management system according to a preferred embodiment of the present invention;
FIG. 2 is a data layout framework of the unstructured data management system of the preferred embodiment of the present invention;
FIG. 3 is a diagram illustrating a variation of data transmission times when the number of tasks increases, where the number of experimental nodes in the first group is 10 in the preferred embodiment of the present invention;
FIG. 4 is a diagram illustrating the change of the data transmission amount when the number of the first group of experimental nodes is 10 and the number of the tasks is increased according to the preferred embodiment of the present invention;
FIG. 5 is a diagram illustrating a change of data transmission time when the number of tasks increases, where the number of experimental nodes in the first group is 10 in the preferred embodiment of the present invention;
FIG. 6 is a diagram illustrating a situation that the number of data transmission times increases when the number of tasks increases, where the number of experimental nodes in the first group is 2-10 according to the preferred embodiment of the present invention;
FIG. 7 is a schematic diagram of the change of the corresponding data transmission amount when the number of tasks increases, where the number of the first group of experimental nodes is 2-10 according to the preferred embodiment of the present invention;
FIG. 8 is a diagram illustrating the change of data transmission time when the number of tasks increases, for a first set of experimental nodes of 2-10 in accordance with the preferred embodiment of the present invention;
FIG. 9 is a diagram illustrating a case that the number of the second group of experimental nodes is 10, and when the number of tasks increases, the corresponding number of data transmission times increases according to the preferred embodiment of the present invention;
FIG. 10 is a diagram illustrating the change of the data transmission amount when the number of the second set of experimental nodes is 10 and the number of the tasks is increased according to the preferred embodiment of the present invention;
FIG. 11 is a diagram illustrating a change of data transmission time when the number of tasks increases, where the number of experimental nodes in the second group is 10 in accordance with the preferred embodiment of the present invention;
FIG. 12 is a diagram illustrating a case that the number of the second set of experimental nodes is 2-10, and the number of data transmission times increases when the number of tasks increases according to the preferred embodiment of the present invention;
FIG. 13 is a diagram illustrating the change of the data transmission amount when the number of the second set of experimental nodes is 2-10 and the number of the tasks is increased according to the preferred embodiment of the present invention;
fig. 14 is a diagram illustrating the change of data transmission time when the number of the second group of experimental nodes is 2-10 and the number of tasks is increased according to the preferred embodiment of the present invention.
Detailed Description
The embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways as defined and covered by the claims.
Unless otherwise defined, all terms of art used hereinafter have the same meaning as commonly understood by one of ordinary skill in the art. The use of "first," "second," and similar terms in the description and in the claims of the present application do not denote any order, quantity, or importance, but rather the intention is to distinguish one element from another. Also, the use of the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.
Example 1
Referring to fig. 1, the present embodiment provides a dynamic layout method of an unstructured cloud data management system, including the following steps:
s1: determining a data set required by a task submitted by a client, and partitioning the data set by adopting a metadata manager in a Master node to obtain a first data partition set;
s2: establishing a cloud model of each data partition in the first data partition set, and dividing all data partitions in the first data partition set into stable data partition groups and unstable data partition groups according to the cloud model;
s3: and calling each data partition in the unstable data partition group into a Slave storage node by adopting a data layout algorithm.
According to the dynamic layout method for identifying the unstable partitions of the unstructured cloud data management system, cloud modeling is performed on the data partitions in the storage system by aiming at the unstructured data management system, the incidence relation among the data partitions is analyzed, the unstable data partitions are identified and are rearranged, data transmission overhead caused by frequent movement of data in the unstructured data management system can be reduced, system energy consumption is reduced, and the energy efficiency level of the unstructured data management system can be further improved.
It should be noted that in the framework of the unstructured data management system, the system distributes data partitions accessed by operations to a plurality of nodes, meanwhile, the system tends to assign tasks to the nodes requiring the least data transmission, and reading data which the node does not have from other nodes inevitably generates a large amount of data transmission. Therefore, by starting with identifying data partitions in nodes where data is frequently transmitted, and rearranging such data partitions, data transmission overhead can be reduced. In this embodiment, in the data management system oriented to NoSQL, a data partition in which data transmission overhead is large due to frequent and passive calling of a data set in a Slave node is referred to as an unstable data partition.
Specifically, the unstructured data management system in this embodiment is based on a Master-Slave structure adopted by a NoSQL data management system represented by BigTable and Hbase. The Data layout framework of the unstructured Data management system is shown in fig. 2, in a Master-Slave structure, a metadata Manager in a Master node partitions a Data set, and distributes Data partitions to different Data storage Slave nodes, multiple copies are usually copied and distributed to different Slave nodes during partitioning, so that the parallelism of the system can be improved, and single-point failure can be avoided. In addition, each node in the system is additionally provided with a Monitor (Monitor), and information such as the size of the data volume (the size of the data storage space) of each node data partition, the number of requested times and the like is collected in real time.
And the data position determines the execution position of the task, and the calculation is completed on the node where the data is positioned. The content of the dashed box in fig. 2 is a work flow of the data management system, and specifically includes: the client transmits the tasks submitted by the user to a task scheduling management server; the task scheduling management server acquires a data position from a metadata management server in a Master node according to data required by the task and schedules the task to a Slave node according to a related task scheduling strategy; the Slave node acquires data and executes the task according to the analysis result of the task scheduling manager; and the result of the task execution is transmitted to the client through the task scheduling management server and is presented to the client.
First, the target system S includes n Slave nodes S ═ S1,s2,...snThe bandwidth matrix between nodes is defined as follows:
Figure BDA0002022019780000061
the data set in the system is D ═ D1,d2,...drEach dataset is denoted dk={sizek},sizekRepresenting a data set dkThe sum of the sizes of all data in. The task set is T ═ T1,t2,...tlWhere t isc={Dc},DcRepresenting a task tcA desired data set, wherein the parameters and definitions of the system are shown in table 1 below:
table 1 parameter description of the system
Figure BDA0002022019780000062
The Master node divides a data set to be stored into data partitions, a certain data partition and a copy of the data partition are stored on different Slave nodes, and each Slave node comprises a plurality of data partitions. Recording the data partition set as DP ═ DP1,dp2,...,dpmEach data partition is formed by a doublet dpj={Dj,SjDenotes wherein Dj、SjAre sets stored in array form, DjRepresenting dpjSet of data sets within by accessing Dj[x]A corresponding data set d can be obtainedk,SjFor a set of storage nodes of a data partition and its replica, by accessing Sj[y]Obtaining the storage node s corresponding to the data partition or the copy thereofiE.g. dp2={D2,S2}->S2={s1,s2,s3,s7Denotes data partition dp2Is stored in s1,s2,s3,s7Four nodes, S2[3]=s7D is then dp2One of the storage nodes of is s7,|DjI is dpjNumber of data sets contained, | SjI is the total number of copies of the data partition, used below
Figure BDA0002022019780000063
Denotes the storage node as siIs divided into data partitions dpj
Will be stored in siUpper data partition dpjMiddle data set dkIs marked as
Figure BDA0002022019780000064
The number of requests made to it by different nodes is different, and is recorded as
Figure BDA0002022019780000065
In this embodiment, the transmission time of the passively called data set corresponding to the data partition is used as a metric for identifying the unstable data partition and is stored in the node siUpper data partition dpjData set d in (1)kThe total overhead of transmission time is calculated as follows:
Figure BDA0002022019780000071
firstly, a cloud model of each data partition and copy is obtained, the cloud model is used as an uncertainty conversion model for processing quantitative description and qualitative concepts, and the meaning of one concept is integrally represented by three digital characteristics of expected Ex, entropy En and super entropy He. In this embodiment, a reverse cloud generator based on cloud model theory is used, that is, three characteristic values of qualitative concept are generated according to quantitative cloud droplets, and data are partitioned into dp regionsjStored at node siIn the above-mentioned manner,
Figure BDA0002022019780000072
and the time transmission Cost (D) corresponding to each data set and each data setj[x],dpj,si) I.e. to form dpjInputting the cloud droplets into a reverse cloud generator to construct a cloud model DPC { Ex, En, He } of the data partition, which is:
Figure BDA0002022019780000073
wherein, N represents the number of cloud drops in a cloud model, and u represents each cloud drop in the cloud model.
After the cloud model of each data partition is obtained, the data partition is divided into a Stable data partition Group (SG) and an Unstable data partition Group (UG) based on the cloud model of each data partition. The SG and the UG are initialized, the data partition with the minimum Ex value is selected to enter the SG, the data partition with the maximum Ex value is selected to enter the UG, then, the similarity between each data partition and the SG and the UG is calculated based on a cloud comparison algorithm of cloud model similarity, specific data partitions in the SG and the UG are obtained, and therefore unstable data partitions in the system are identified.
Partitioning known data
Figure BDA0002022019780000074
The similarity with UG or SG takes the maximum value of the similarity between the cloud model and the existing data partition cloud model in the packet as the data partition
Figure BDA0002022019780000075
Similarity to a packet. In this embodiment, two cloud models DPC1(Ex1,En1,He1) And DPC2(Ex2,En2,He2) The similarity calculation of (2) adopts an expected curve-based Cloud Model similarity (ECM) method. Specifically, the similarity of the two cloud models is calculated by solving the area S of the intersecting and overlapping part of the expected curves of the two cloud models, and only [ Ex-3En, Ex +3En ] is considered according to the 3En rule of the normal cloud]After the overlapping area S is obtained by the cloud droplet distribution in the interval, the cloud model similarity ECM is calculated as follows:
Figure BDA0002022019780000076
in the formula (I), the compound is shown in the specification,
Figure BDA0002022019780000077
representation of Normal cloud model DPC1The area formed between the desired curve of (a) and the abscissa,
Figure BDA0002022019780000078
representation of Normal cloud model DPC2And S represents an area of an intersection overlap portion of the expected curves of the two normal cloud models.
In this embodiment, the data partition in UG is the identified unstable data partition, the algorithm time complexity is O (m · e · σ), m is the number of data partitions in the system, e represents the number of data sets | D in the data partitionjI, σ denotes the number of copies of the data partition | SjThe number of copies is typically 3[16 ]]。
Taking 3 nodes, 15 data sets and 5 data partitions as an example, the initial distribution is shown in table 2 below:
TABLE 2 initial data distribution
Figure BDA0002022019780000081
The bandwidth matrix between nodes is as follows:
Figure BDA0002022019780000082
dp is shown in Table 22And its copy is stored in s2And s3In the method, 2 cloud models are generated, and 7 data partitions and copies thereof in table 2 are obtained, so that 7 cloud models are obtained, as shown in table 3:
TABLE 3 data distribution cloud model
Figure BDA0002022019780000083
As can be seen from table 3, in the seven cloud models,
Figure BDA0002022019780000084
the Ex value of (Ex — 17.167) is the largest, and is classified as the initial data partition of UG.
Figure BDA0002022019780000091
Is the smallest (Ex-3.233), as the initial data partition of SG, then the similarity of each of the rest data partitions to SG and UG is calculated, and the final recognition result in this embodiment is
Figure BDA0002022019780000092
Figure BDA0002022019780000093
Further, for the unstable data partition in the identification result UG, firstly, it is determined that its operation property is replication or migration, and the local access condition of the data set in the unstable data partition is used as the judgment basis for performing replication or migration. When the local access amount is larger than or equal to the set threshold value, the local access amount is continuously stored to the current node, the copying operation is executed, and the parallelism of the tasks is improved; and when the local access amount is smaller than the set threshold, executing the migration operation to other nodes with low cost for storage. It should be noted that the set threshold at this point is different according to different specific cloud data management systems, and the present invention does not limit the specific value of the threshold, and in this embodiment, only the local access amount is compared relatively large or small, and whether to execute the copy operation or the migration operation is determined according to the relative comparison result. The division of the copying and the migration can comprehensively consider the local access condition of the data partition, and the layout of the unstable data can be more flexibly realized.
Collecting each data partition in UG according to current task queue
Figure BDA0002022019780000094
The access information of the data set is that when the local data is used, the bandwidth factor can be ignored. The local access amount is calculated according to the data size and the local access times, and is stored on the node si, and the data partition dpjMiddle data set dkLocal access volume V of(dk,dpj,si) Comprises the following steps:
Figure BDA0002022019780000095
then the data is partitioned
Figure BDA0002022019780000096
The local access amount of (a) is:
Figure BDA0002022019780000097
the local access amount of UG is equal to the sum of the local access amounts of all unstable data partitions, and is calculated as follows:
Figure BDA0002022019780000098
the number of data partitions in UG is recorded as mUGThe average access amount of the unstable data partition is:
V(UG)avg=V(UG)/mUG
partitioning data in UG
Figure BDA0002022019780000099
When in use
Figure BDA00020220197800000910
When the temperature of the water is higher than the set temperature,
Figure BDA00020220197800000911
and executing the migration operation, otherwise, executing the copy operation.
For unstable data partitions in UG, determine the nature of the operation it needs to be replication or migration, and then call the data partition into the appropriate storage node using existing data layout algorithms.
And (3) experimental verification:
further, by establishing a simulation system to evaluate the efficiency and performance of the dynamic Layout strategy based on the partition identification of the unstable data, in this embodiment, two sets of experiments are performed, the first set uses Random algorithm (Random) as the data Layout algorithm, and the Random Layout (CM-UPI + Random) after identifying the unstable data partition using the method of the present invention (CM-UPI algorithm) is compared with the data transmission overhead reduced by directly adopting the Random algorithm to Layout the data partition based on the initial data allocation, but it should be noted that the unstable partition identification algorithm generates new overhead, when the transmission overhead reduced by CM-UPI + Random does not exceed the overhead brought by the Layout, the unstable identification will not be compensated, for this reason, when performing the experimental verification, the CM-UPI + Random-Layout is considered at the same time (when performing the second set of experiments, CM-UPI + K-means-Layout) and comparing the transmission overhead generated during the Layout of each method as follows. In the drawing of this embodiment, Random is a data transmission overhead reduced by allocating initial data after Random data Layout is adopted, Random-Layout is a transmission overhead generated in the Random data Layout process, CM-UPI + Random is a data transmission overhead reduced by allocating initial data by Random Layout after unstable data partitions are identified by using the method of the present invention (CM-UPI algorithm), and CM-UPI + Random-Layout is a transmission overhead generated in the Random Layout process after unstable data partitions are identified by using the method of the present invention (CM-UPI algorithm).
And the second group of experiments adopt a K-means algorithm as a layout algorithm, 3000 data sets are stored in the experiments, and the identification effect of the CM-UPI algorithm is verified through the dynamic adjustment tasks and the number of nodes.
In this embodiment, the physical platform for performing the simulation experiment is a PC with an Intel core i5-6200U CPU and a 4GB memory, and simulation parameters for nodes, tasks, and data sets in the experiment process are as follows: the bandwidth value between nodes is 100-300, the unit is MB/s, the number of data sets required by the task is subjected to normal distribution, the mean value is set to be 70, the variance is set to be 15, the size of the data sets is subjected to normal distribution, the mean value is set to be 60 MB, and the variance is set to be 20.
Specifically, the first group of experiments includes setting 10 nodes, the number of tasks is 400-2000, the corresponding data transmission times are shown in fig. 3 when the number of tasks is increased, the transmission quantity is shown in fig. 4, and the change of transmission time is shown in fig. 5.
As can be seen from fig. 3-5, as the number of tasks increases, the number of transmissions, the transmission amount, and the transmission time of the CM-UPI + Random algorithm decrease continuously and far exceed the layout overhead thereof, and meanwhile, the data transmission overhead of the CM-UPI + Random algorithm decrease is far greater than that of the data partition adopting the Random layout directly, and the Random layout directly has not only higher layout overhead but also no optimization effect on data transmission. Meanwhile, the layout overheads corresponding to the two algorithms have no obvious expansion along with the increase of the number of tasks, and the layout overheads of Random far exceed that of CM-UPI + Random.
In the experimental process, as the number of tasks increases, the requirement for data transmission correspondingly increases, the Random algorithm is arranged in a mode of randomly generating data partition storage nodes, the Random is used for arranging the current data partition, the data partitions in the system can be moved, high arrangement cost is generated, meanwhile, the data partition with the better storage nodes is arranged on the nodes with frequent transmission, data transmission cannot be optimized, the data partition identified by the CM-UPI algorithm is stored in the nodes with frequent transmission, even if the Random arrangement is used, the data partition is arranged on the nodes with less transmission cost, and meanwhile, the CM-UPI + Random algorithm is only used for unstable data partitions and has low arrangement cost.
Further, the number of tasks is set to 2000, the number of nodes is set to be 2-10, when the number of nodes increases, the corresponding data transmission times are shown in fig. 6, the transmission quantity is shown in fig. 7, and the change of the transmission time is shown in fig. 8.
As can be seen from fig. 6-8, as the number of nodes increases, the layout overhead of the Random algorithm tends to increase continuously, and the layout overhead of the CM-UPI + Random algorithm tends to be stable and smaller than that of the Random algorithm. Meanwhile, with the increase of the number of nodes, the transmission times, transmission quantity and transmission time overhead reduced by the CM-UPI + Random algorithm far exceed the layout overhead and are better than the Random. When the Random layout data partition is used, the layout overhead is high, and the optimization effect on data transmission is not achieved.
In the experimental process, similar to the case with the increase of the number of tasks, the difference is that with the increase of the number of nodes, the transmission times, the transmission amount and the transmission time which are reduced by the CM-UPI + Random algorithm are increased first and then reduced, because when the number of nodes is small, the number of nodes which can be selected by the Random algorithm is small, the probability of being distributed to the superior node is high, and with the increase of the nodes, the probability of the data partitions being distributed to the superior node is gradually reduced, so that the optimization effect is weakened. But the optimized transmission times, transmission quantity and transmission time of the CM-UPI + Random algorithm far exceed the layout overhead, the optimized effect far exceeds the Random algorithm, and the effectiveness of the CM-UPI algorithm is verified again.
The second set of experiments specifically included: setting 10 nodes, wherein the number of tasks is 400-2000, the corresponding data transmission times are shown in figure 9 when the number of tasks is increased, the transmission quantity is shown in figure 10, and the change of transmission time is shown in figure 11.
As can be seen from FIGS. 9-11, as the number of tasks increases, the reduced transmission times, transmission amount and transmission time of the CM-UPI + K-means algorithm continuously increase far beyond the layout overhead thereof. Meanwhile, with the increase of the number of tasks, the layout overheads corresponding to the two algorithms tend to be stable, and the layout overheads of the K-means are about twice of the layout overheads of the CM-UPI + K-means.
In the experimental process, along with the increase of the number of tasks, the requirement for data transmission is correspondingly increased, the K-means algorithm takes the transmission times as the standard, the data which support similar tasks are partitioned and clustered to the same node, and the node is used for scheduling the task with the maximum support degree, so that the data transmission times are effectively reduced, and the transmission quantity and the transmission time are correspondingly reduced. When only K-means algorithm layout is used, all data partitions with the maximum dependence degree are clustered to the same node, the data partitions with the better storage nodes are migrated to the nodes with the maximum dependence degree, the layout overhead of part of the data partitions is larger than the reduced overhead, the data partitions identified by the CM-UPI algorithm are stored in the nodes with frequent transmission, the K-means algorithm layout is used on the basis, the data partitions are placed to the nodes with small overhead, and the layout overhead is less than that of the K-means layout. Meanwhile, the CM-UPI algorithm copies part of unstable data partitions, improves the multiplexing of data and enables the algorithm to have a better optimization effect.
Further, the number of tasks is set to 2000, the number of nodes is set to be 2-10, when the number of tasks is increased, the corresponding data transmission times are shown in fig. 12, the transmission quantity is shown in fig. 13, and the change of the transmission time is shown in fig. 14.
As can be seen from fig. 12-14, similar to the case with the increase of the number of tasks, the number of transmissions, the amount of transmissions, and the transmission time of the CM-UPI + K-means algorithm decrease with the increase of the number of nodes, and increase continuously, far exceeding the layout overhead. And the optimization effect is better than that of the K-means under the condition that the layout overhead is gradually less than that of the K-means. Meanwhile, with the increase of the number of nodes, the layout overhead of the K-means algorithm is in a continuously increasing trend, and the layout overhead of the CM-UPI + K-means algorithm is stable and smaller than that of the K-means algorithm.
Therefore, the method (CM-UPI algorithm) identifies the rearrangement of the unstable data partitions, which can effectively improve the data transmission overhead and verify the effectiveness of the algorithm for identifying the unstable data partitions.
Example 2
In correspondence with the foregoing method embodiments, the present embodiment provides an unstable partition identification system for an unstructured cloud data management system, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the foregoing method when executing the computer program.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A dynamic layout method of an unstructured cloud data management system is characterized by comprising the following steps:
s1: determining a data set required by a task submitted by a client, and partitioning the data set by adopting a metadata manager in a Master node to obtain a first data partition set;
s2: establishing a cloud model of each data partition in the first data partition set, and dividing all data partitions in the first data partition set into stable data partition groups and unstable data partition groups according to the cloud model;
s3: and calling each data partition in the unstable data partition group into a Slave storage node by adopting a data layout algorithm.
2. The dynamic layout method of the unstructured cloud data management system according to claim 1, wherein the S3 specifically includes the following steps:
s31: setting data partitions dpjStored at node siIn the above, will
Figure FDA0002944565950000011
And the time transmission Cost (D) corresponding to each data set and each data setj[x],dpj,si) Form dpjOne cloud drop of the cloud model;
s32: inputting the cloud droplets into a reverse cloud generator, and constructing a cloud model of the data partition, wherein the computing formula of the cloud model is as follows:
Figure FDA0002944565950000012
in the formula, Ex represents expectation, En represents entropy, He represents super entropy, N represents the number of cloud drops in a cloud model, and u represents each cloud drop in the cloud model;
s33: repeating the steps of S31-S32 to calculate the cloud models of all the data partitions in the first data partition set, selecting the data partition corresponding to the cloud model with the minimum expected value as a stable data partition, and selecting the data partition corresponding to the cloud model with the maximum expected value as an unstable data partition;
s34: and calculating the cloud model similarity of the data partition corresponding to each cloud model with the stable data partition and the unstable data partition respectively, and dividing according to a similarity threshold value to obtain a stable data partition group and an unstable data partition group.
3. The dynamic layout method of the unstructured cloud data management system according to claim 2, wherein the calculation formula of the cloud model similarity is as follows:
Figure FDA0002944565950000013
in the formula (I), the compound is shown in the specification,
Figure FDA0002944565950000014
representation of Normal cloud model DPC1The area formed between the desired curve of (a) and the abscissa,
Figure FDA0002944565950000015
representation of Normal cloud model DPC2And S represents an area of an intersection overlap portion of the expected curves of the two normal cloud models.
4. The dynamic layout method of the unstructured cloud data management system according to claim 1, wherein the Master node is provided with a query analyzer and a data partition manager, and the Master node analyzes the data processing request of the task through the query analyzer to obtain a data set required by the task; and the Master node calls the data layout algorithm through the data partition manager to realize dynamic data partition management.
5. The dynamic layout method of an unstructured cloud data management system of claim 1, characterized in that the method further comprises the steps of: a monitor is additionally arranged for each node; in S2, collecting information about each data partition in real time is achieved by the monitor.
6. The dynamic layout method of the unstructured cloud data management system according to claim 1, wherein in S3, the step of calling each data partition in the unstable data partition group into a Slave storage node by using a data layout algorithm specifically includes:
taking the local access amount of the data set in the unstable data partition group as a judgment basis for executing copying or migration, and when the local access amount of the data set in the unstable data partition group is greater than or equal to a set threshold value, continuously storing the unstable data partition to the current node to execute copying operation; and when the local access amount of the data set in the unstable data partition group is smaller than a set threshold, executing migration operation to other Slave nodes with low cost for storage.
7. The dynamic layout method of the unstructured cloud data management system according to claim 1, wherein in the step S3, when each data partition in the unstable data partition group is called into a Slave storage node, the method further includes the steps of: and copying at least two data partitions, and calling each copied data partition into a different Slave storage node.
8. The dynamic layout method of the unstructured cloud data management system of claim 1, wherein the data layout algorithm is a Random algorithm or a K-Means algorithm.
9. The dynamic layout method of the unstructured cloud data management system according to claim 1 or 5, wherein the relevant information of each data partition comprises the size of data partition movement data volume and the requested number of times.
CN201910282178.2A 2019-04-09 2019-04-09 Dynamic layout method of unstructured cloud data management system Active CN110166279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910282178.2A CN110166279B (en) 2019-04-09 2019-04-09 Dynamic layout method of unstructured cloud data management system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910282178.2A CN110166279B (en) 2019-04-09 2019-04-09 Dynamic layout method of unstructured cloud data management system

Publications (2)

Publication Number Publication Date
CN110166279A CN110166279A (en) 2019-08-23
CN110166279B true CN110166279B (en) 2021-05-18

Family

ID=67639339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910282178.2A Active CN110166279B (en) 2019-04-09 2019-04-09 Dynamic layout method of unstructured cloud data management system

Country Status (1)

Country Link
CN (1) CN110166279B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985181B (en) * 2020-08-25 2023-09-22 北京灵汐科技有限公司 Node layout method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007091123A3 (en) * 2005-12-23 2008-01-03 Inst Of Comp Science System and method of determining the speed of digital application specific integrated circuits
CN102013969A (en) * 2010-12-02 2011-04-13 中兴通讯股份有限公司 Method and device for realizing time synchronization
CN106789350A (en) * 2017-01-23 2017-05-31 郑州云海信息技术有限公司 A kind of method and device of back-level server virtualization system host node High Availabitity
CN109565471A (en) * 2016-06-18 2019-04-02 科里维网络有限公司 High performance intelligent adaptive transport layer is proposed using multichannel

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9405589B2 (en) * 2011-06-06 2016-08-02 International Business Machines Corporation System and method of optimization of in-memory data grid placement
CN103561047A (en) * 2013-07-31 2014-02-05 南京理工大学 P2P network trust cloud model calculating method based on interest groups
US10289707B2 (en) * 2015-08-10 2019-05-14 International Business Machines Corporation Data skipping and compression through partitioning of data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007091123A3 (en) * 2005-12-23 2008-01-03 Inst Of Comp Science System and method of determining the speed of digital application specific integrated circuits
CN102013969A (en) * 2010-12-02 2011-04-13 中兴通讯股份有限公司 Method and device for realizing time synchronization
CN109565471A (en) * 2016-06-18 2019-04-02 科里维网络有限公司 High performance intelligent adaptive transport layer is proposed using multichannel
CN106789350A (en) * 2017-01-23 2017-05-31 郑州云海信息技术有限公司 A kind of method and device of back-level server virtualization system host node High Availabitity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
虚拟数据代理云模型构建及数据布局;胡志刚等;《中南大学学报》;20190331;第588-595页 *

Also Published As

Publication number Publication date
CN110166279A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
Quamar et al. SWORD: scalable workload-aware data placement for transactional workloads
US10445344B2 (en) Load balancing for large in-memory databases
CN108418858B (en) Data copy placement method for Geo-distributed cloud storage
WO2011034625A1 (en) Distributed content storage and retrieval
WO2009103221A1 (en) Effective relating theme model data processing method and system thereof
Elmeleegy et al. Spongefiles: Mitigating data skew in mapreduce using distributed memory
CN108519856B (en) Data block copy placement method based on heterogeneous Hadoop cluster environment
Senthilkumar et al. A survey on job scheduling in big data
Ma et al. Dependency-aware data locality for MapReduce
Kim et al. Load-balancing in distributed selective search
Liroz-Gistau et al. Dynamic workload-based partitioning for large-scale databases
CN110166279B (en) Dynamic layout method of unstructured cloud data management system
CN108509628B (en) Database configuration method and device, computer equipment and storage medium
Elmeiligy et al. An efficient parallel indexing structure for multi-dimensional big data using spark
Kumar et al. An extended approach to Non-Replicated dynamic fragment allocation in distributed database systems
CN108932258A (en) Data directory processing method and processing device
Fang et al. A-DSP: An adaptive join algorithm for dynamic data stream on cloud system
Fang et al. Cost-effective stream join algorithm on cloud system
KR101718739B1 (en) System and Method for Replicating Dynamic Data for Heterogeneous Hadoop
CN108520053B (en) Big data query method based on data distribution
Luo et al. Data placement algorithm for improving I/O load balance without using popularity information
Luo et al. Superset: a non-uniform replica placement strategy towards high-performance and cost-effective distributed storage service
CN113742346A (en) Asset big data platform architecture optimization method
Floratou et al. Adaptive caching algorithms for big data systems
Ghandour et al. User-based Load Balancer in HBase.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant