CN111949652A - Data fingerprint detection method and device and storage medium - Google Patents

Data fingerprint detection method and device and storage medium Download PDF

Info

Publication number
CN111949652A
CN111949652A CN202010576124.XA CN202010576124A CN111949652A CN 111949652 A CN111949652 A CN 111949652A CN 202010576124 A CN202010576124 A CN 202010576124A CN 111949652 A CN111949652 A CN 111949652A
Authority
CN
China
Prior art keywords
data
change
data block
factor
entry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010576124.XA
Other languages
Chinese (zh)
Inventor
娄婷
段净化
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202010576124.XA priority Critical patent/CN111949652A/en
Publication of CN111949652A publication Critical patent/CN111949652A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a data fingerprint detection method, a data fingerprint detection device and a storage medium. The method adopts a change record table to record the history information of each data block changing within a period of time, and when a certain condition is met, the change record table is subjected to wear processing according to a preset wear factor; and detecting whether a change recording table after abrasion processing has a mergeable table entry which corresponds to a preset aggregation section factor and meets an aggregation condition, and if so, merging at least two data blocks corresponding to the mergeable table entry into one data block. Therefore, according to the information such as the number of times or frequency of each data conversion, the data blocks which do not change frequently can be dynamically combined, the number of the data blocks is reduced, and the number of the data fingerprints to be processed is correspondingly reduced, so that the resource consumption is further reduced, and the processing capacity and the throughput of the system are greatly improved.

Description

Data fingerprint detection method and device and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for detecting a data fingerprint, and a storage medium.
Background
With the continuous development of data processing technology and network transmission capability, the storage and transmission of large files, even very large files, is becoming more and more common. For large files, especially for very large files, it is very important to perform differential backup and transmission by comparing data fingerprints, which not only can greatly reduce transmission bandwidth, but also can improve the processing capacity of the file storage system
In existing schemes, regardless of file size, large files are usually chunked using a fixed size, and a data fingerprint is established for each data chunk.
The inventors have found that for some large files, although the file is large, the frequently changing parts are concentrated. At this time, if a large file is still partitioned into blocks with a fixed size, every time backup or transmission is performed, even if most of the data blocks are not changed, the data fingerprints of the data blocks still need to be compared one by one, which not only takes a long time, but also consumes a large amount of computer resources.
Therefore, how to improve the blocking strategy of the large file and improve the processing efficiency of the fingerprint data is a technical problem to be solved.
Disclosure of Invention
In view of the above problems, the present inventors have creatively conceived: under the condition, if the data which changes frequently can be accurately positioned, the data blocks which do not change frequently are combined, so that the number of the blocks can be greatly reduced, the time for comparing the data fingerprints is shortened, and the resources required by comparing the data fingerprints are saved.
Based on the above inventive concept, the present inventors provide a data fingerprint detection method, apparatus, and storage medium.
According to a first aspect of embodiments of the present invention, a data fingerprint detection method includes, when a first condition is met, performing the following operations: acquiring a first change recording table of all data blocks, wherein each table entry of the first change recording table records history information of each data block changing within a period of time, and each data block corresponds to a data fingerprint; carrying out abrasion processing on the first change record table according to a preset abrasion factor to obtain a second change record table; and detecting whether a second change recording table has a mergeable table entry which corresponds to a preset aggregation section factor and meets the aggregation condition, and if so, merging at least two data blocks corresponding to the mergeable table entry into one data block.
According to an embodiment of the present invention, before meeting the first condition, the method further includes: acquiring a file to be differentially processed; partitioning a file to obtain data partitions; and creating a first change recording table, wherein each table entry of the first change recording table corresponds to each data block and is used for recording historical information of each data block which changes within a period of time.
According to an embodiment of the present invention, after creating the first change record table, the method further includes performing the following operations when performing differential processing on the file: acquiring first data fingerprints of all data blocks, wherein the first data fingerprints are the latest data fingerprints corresponding to each data block; acquiring second data fingerprints of all the data blocks, wherein the second data fingerprints are the data fingerprints stored last time corresponding to each data block; and detecting whether the first data fingerprint and the second data fingerprint of each data block are the same or not, and if so, updating the table entry corresponding to the corresponding data block in the first change recording table.
According to an embodiment of the present invention, updating the table entry corresponding to the corresponding data chunk in the first change record table includes: acquiring all table entries corresponding to the corresponding data blocks; and updating each table entry in all the table entries in sequence.
According to an embodiment of the present invention, each table entry of the first transformation record table is used for recording history information of each data partition that changes over a period of time, and includes: each table entry of the first transformation record table is used for recording the times of change of each data block in a period of time; correspondingly, updating the table entry corresponding to the data partition includes: the number of times recorded by the table entry is increased by one.
According to an embodiment of the present invention, the first condition includes reaching a predetermined time.
According to an embodiment of the present invention, the first change recording table includes a change recording table using a bitmap as a storage structure.
According to an embodiment of the present invention, before performing a wear process on the first change record table according to a preset wear factor to obtain the second change record table, the method further includes: a wear factor and a polymerization stage factor are determined.
According to a second aspect of the embodiments of the present invention, a data fingerprint detection apparatus, the apparatus comprising: the first condition detection module is used for detecting whether a first condition is met; the change record table acquisition module is used for acquiring first change record tables of all data blocks; the abrasion module is used for carrying out abrasion treatment on the first change recording table according to a preset abrasion factor to obtain a second change recording table; the aggregation section factor detection module is used for detecting whether the second change recording table has a mergeable table entry which corresponds to a preset aggregation section factor and meets the aggregation condition; and the aggregation module is used for merging at least two data blocks corresponding to the mergeable table entry into one data block.
According to a third aspect of embodiments of the present invention, there is provided a storage medium having stored thereon program instructions, wherein the program instructions are operable when executed to perform any of the above-mentioned data fingerprint detection methods.
The embodiment of the invention provides a data fingerprint detection method, a device and a storage medium, wherein the method adopts a change recording table to record historical information of each data block changing within a period of time, and when a certain condition is met, the change recording table is subjected to wear processing according to a preset wear factor; and detecting whether a change recording table after abrasion processing has a mergeable table entry which corresponds to a preset aggregation section factor and meets an aggregation condition, and if so, merging at least two data blocks corresponding to the mergeable table entry into one data block. Therefore, according to the information such as the number of times or frequency of each data conversion, the data blocks which do not change frequently can be dynamically combined, the number of the data blocks is reduced, and the number of the data fingerprints to be processed is correspondingly reduced, so that the resource consumption is further reduced, and the processing capacity and the throughput of the system are greatly improved.
It is to be understood that the teachings of the embodiments of the present invention need not achieve all of the above-described advantages, but rather that certain features may achieve certain technical effects, and that other implementations of the embodiments of the present invention may achieve other advantages not mentioned above.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 is a schematic diagram of a flow chart of a data fingerprint detection method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an implementation of a method for fingerprint detection of application data according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data fingerprint detection device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
According to a first aspect of the embodiments of the present invention, a data fingerprint detection method, as shown in fig. 1, includes performing the following operations when a first condition is met: operation 110, obtaining a first change record table of all data blocks, where each entry of the first change record table records history information of each data block changing within a period of time, and each data block corresponds to a data fingerprint; an operation 120, performing wear processing on the first change record table according to a preset wear factor to obtain a second change record table; operation 130 detects whether there is a mergeable entry corresponding to the preset aggregation section factor and meeting the aggregation condition in the second change recording table, and if there is a mergeable entry, merges at least two data partitions corresponding to the mergeable entry into one data partition.
In operation 110, to facilitate differencing a large file, such as a differential backup, the large file is typically chunked to obtain a plurality of data chunks, and a data fingerprint is then created for each data chunk. In this way, in the subsequent differential processing, whether the data of each data block is transformed or not can be known by comparing the data fingerprint of the data block.
In order to record the history information of the data blocks which are transformed in a period of time, the data fingerprint detection method of the embodiment of the invention uses the first change recording table to record the change condition of each data block. Each table entry of the first change recording table corresponds to one data block, and history information of the data block changing within a period of time is recorded. With the historical information that each data block changes in a period of time, the rule and trend of each data block can be mastered, and the data blocks which do not change for a long time or have low change frequency can be dynamically combined, so that the data blocks and the number of fingerprint data corresponding to the data blocks are reduced, and the processing capacity and the throughput of the system are further improved.
In the present embodiment, the specific form and data structure of the first change recording table are not limited, and the implementer may select any suitable specific form and data structure according to specific implementation conditions.
The period of time refers to a period of time in which more than two differential processes can occur, otherwise, the rule related to the number of data changes or frequency is not enough to be obtained, and the beneficial effect of the data fingerprint detection method in the embodiment of the invention is difficult to be embodied. Theoretically, the more data accumulated in this period of time, the easier it is to accurately locate the part of the file which is not changed frequently, so that the better the data merging effect is. However, the period of time cannot be too long, because the data merging is performed after the period of time, if the period of time is too long, the data blocks which are not changed frequently cannot be merged in time, and the outstanding effect of the data fingerprint detection method according to the embodiment of the present invention cannot be repeated. Therefore, the setting of the period of time can be performed according to the experimental data which is closer to the actual data, and the adjustment is performed according to the experimental result to obtain a better actual application effect.
The history information of the change mainly refers to the number of times or frequency of data change with each data block in a period of time.
The data fingerprint refers to a sequence of values that can be used to determine data integrity and identity, for example, a digest value with identification property obtained by performing a certain operation on data content using a certain algorithm, wherein the common algorithm is an information digest algorithm (MD5), and the like.
In operation 120, the wear factor refers to a value according to which each entry in the first change recording table is subjected to the unified wear process. The abrasion processing mainly refers to subtracting the abrasion factor from the value of each table entry, dividing the abrasion factor by the abrasion factor, or calculating the value of each table entry by using a specific function taking the abrasion factor as a parameter to obtain a new numerical value. And the second change log table is a table composed of new values obtained after these wear treatments. And by carrying out abrasion processing, on one hand, the difference of data can be highlighted, on the other hand, data blocks with less change can be found more easily, and the complexity of subsequent calculation is reduced.
In operation 130, the aggregate segment factor mainly refers to the number of data chunks compared and merged at a time when the data chunk merging is performed. For example, if the aggregation section factor is 2, it means that two data partitions are compared each time, and if the two data partitions meet the aggregation condition, the two data partitions are merged into one data partition; if the aggregation section factor is 3, the method means that three data blocks are compared each time, and if the three data blocks meet the aggregation condition, the three data blocks are combined into one data block; and so on. Here, the aggregation section factor is set to a value of 2 or more, otherwise it makes no sense. The aggregation condition is a condition to be satisfied by merging the data partitions corresponding to the aggregation section factor, and generally, the condition is that change history information recorded by the data partitions corresponding to the aggregation section factor is required to be the same or similar and is lower than a certain value, so that adjacent mergeable data partitions which do not change frequently can be found. The polymerization conditions in a specific implementation process can also be determined by the implementation conditions and the goals or effects that the implementer wants to achieve.
The operation is a substantial operation for realizing data block merging, and merges changed historical information, especially data blocks which do not change frequently within a period of time, so that the efficiency of differential processing is improved, the calculation and comparison of data fingerprints are reduced, the resource consumption can be greatly reduced, and the throughput of a system is improved.
According to an embodiment of the present invention, before meeting the first condition, the method further includes: acquiring a file to be differentially processed; partitioning a file to obtain data partitions; and creating a first change recording table, wherein each table entry of the first change recording table corresponds to each data block and is used for recording historical information of each data block which changes within a period of time.
Wherein, the differential processing refers to only processing newly added and modified data after the last full processing each time, such as differential backup and the like. The files to be differentially processed are typically large files or even very large files. To perform the difference processing, the file needs to be first partitioned. When there is no history information of data change, the file may be divided into data blocks with equal size according to an empirical value, and a change record table is created, and each entry in the change record table is used to record the change condition of each data block in a period of time. The change record table provides historical information of changes of the data blocks in a period of time for the data fingerprint detection method in the embodiment of the invention, and is a data basis on which the data blocks are combined.
According to an embodiment of the present invention, after creating the first change record table, the method further includes performing the following operations when performing differential processing on the file: acquiring first data fingerprints of all data blocks, wherein the first data fingerprints are the latest data fingerprints corresponding to each data block; acquiring second data fingerprints of all the data blocks, wherein the second data fingerprints are the data fingerprints stored last time corresponding to each data block; and detecting whether the first data fingerprint and the second data fingerprint of each data block are the same or not, and if so, updating the table entry corresponding to the corresponding data block in the first change recording table.
In this embodiment, each time the differential processing is performed on the file, the changed data blocks are recorded at the same time, and the corresponding entries of the changed data blocks in the first change recording table are updated. Thus, history information such as whether each data block changes in a period of time, the number of times of changes, the frequency of changes and the like can be recorded.
According to an embodiment of the present invention, updating the table entry corresponding to the corresponding data chunk in the first change record table includes: acquiring all table entries corresponding to the corresponding data blocks; and updating each table entry in all the table entries in sequence.
After a plurality of data blocks are subjected to abrasion processing, the merged data blocks correspond to a plurality of table entries after being merged according to the mergeable conditions corresponding to the aggregation section factors. At this time, all the entries corresponding to the merged data chunks need to be updated in sequence.
According to an embodiment of the present invention, each table entry of the first transformation record table is used for recording history information of each data partition that changes over a period of time, and includes: each table entry of the first transformation record table is used for recording the times of change of each data block in a period of time; correspondingly, updating the table entry corresponding to the data partition includes: the number of times recorded by the table entry is increased by one.
In this embodiment, each entry of the first change recording table records the number of times each data block changes within a period of time. The recording mode is simple and easy to implement, and the change times or frequency of each data block in a period of time and the different frequencies of the change of different contents of the whole file can be visually embodied, so that some data blocks which do not change frequently can be dynamically combined according to the change frequencies of the data blocks, the number of the data blocks is greatly reduced, and the operations of calculation, comparison and the like of data fingerprint detection are performed.
According to an embodiment of the present invention, the first condition includes reaching a predetermined time.
In the present embodiment, the wear processing and the merging of data blocks are performed in a timed manner. Namely, the wear processing and the data blocking are performed every other preset time period. The implementation method is simple and easy to implement, and the practice proves that the effect is better. As described above, whether the preset duration is reasonable or not may affect the implementation effect of the data fingerprint detection method according to the embodiment of the present invention, so that an experiment may be performed through experimental data similar to actual data, and a duration with a better implementation effect may be determined according to an experimental result.
According to an embodiment of the present invention, the first change recording table includes a change recording table using a bitmap (bitmap) as a storage structure.
In the present embodiment, a bitmap is used as a storage structure of the first change recording table. The bitmap is a Value (Value) corresponding to an element marked with a bit (bit), and the Key (Key) is the element. Using a bitmap may also reduce each unit of the statistical array for count ordering to a bit-level Boolean array. Because the data is stored by using the bit as a unit, the storage space can be greatly saved, and the complexity of calculation is simplified.
According to an embodiment of the present invention, before performing a wear process on the first change record table according to a preset wear factor to obtain the second change record table, the method further includes: a wear factor and a polymerization stage factor are determined.
In the present embodiment, both the wear factor and the polymerization stage factor are preset values. The wear factor and the aggregation section factor are determined to be proper, a plurality of alternative values can be specified according to experience, then experiment data similar to actual data is used for carrying out experiments, and a value with better implementation effect is selected from the several alternative values according to the experiment result for setting.
The following describes a specific implementation flow of an application data fingerprint detection method according to an embodiment of the present invention with reference to fig. 2. In this application, the data fingerprint detection method in the embodiment of the present invention is mainly applied to differential backup of large files. For convenience of description of the process and the embodiment, it is assumed that the size of the file is 64M, but in practice, the size of the file may be much larger than 64M. First, it is assumed that a full backup of the file has been performed before the data fingerprint detection method according to the embodiment of the present invention is implemented, and only the transformed data blocks need to be transmitted when the file is subjected to the differential backup subsequently. The process of carrying out differential backup and transmitting differential data mainly comprises the following steps:
step 2010, setting a unit of minimum data block, and creating a change record table corresponding to the unit;
assume that at the beginning of performing differential backup, the unit of the minimum data block is set to 4M, and the file is equally divided into blocks to obtain 16 data blocks. A transformation record table using a bitmap as a storage structure is created for the 16 data blocks, and an initial value of a bit corresponding to each data block is set to 1. At this time, the data state of the change log table is shown in table 1:
1(1) 1(2) 1(3) 1(4)
1(5) 1(6) 1(7) 1(8)
1(9) 1(10) 1(11) 1(12)
1(13) 1(14) 1(15) 1(16)
TABLE 1
In table 1, the value in each table unit represents the value stored in the corresponding table entry of each data block, and the number in the parentheses after the value represents the number of the data block. The data state shown in table 1 is that the value of the change record table entry corresponding to each data block from the 1 st data block to the 16 th data block is 1.
Step 2020, reading the file according to data blocks;
step 2030, reading data fingerprint of data block;
step 2040, transmitting differential data according to the comparison result of the data fingerprints, and updating a conversion record table;
if it is known from the comparison result of the data fingerprints at this time that the 2 nd, 6 th, 7 th, 10 th and 13 th data partitions are transformed, 1 is added to the entries of the 2 nd, 6 th, 7 th, 10 th and 13 th data partitions, and a change record table with data states as shown in table 2 is obtained:
1(1) 2(2) 1(3) 1(4)
1(5) 2(6) 2(7) 1(8)
1(9) 2(10) 1(11) 1(12)
2(13) 1(14) 1(15) 1(16)
TABLE 2
Step 2050, judging whether the preset abrasion and merging time is reached, if so, continuing to step 2060, otherwise, returning to step 2020, and continuing to wait for the next differential processing;
the predetermined wear and merge time has not been reached while this differential backup is in progress, and then the process returns to step 2020 to continue waiting for a second differential backup.
Assuming that, according to the fingerprint detection comparison results in steps 2020 to 2040, when it is known that the 3 rd, 7 th, 8 th, 11 th, and 14 th data partitions are transformed during the second differential backup, 1 is added to the entries of the 3 rd, 7 th, 8 th, 11 th, and 14 th data partitions, respectively, and a change record table with data states as shown in table 3 is obtained:
1(1) 2(2) 2(3) 1(4)
1(5) 2(6) 3(7) 1(8)
1(9) 2(10) 2(11) 1(12)
2(13) 2(14) 1(15) 1(16)
TABLE 3
Assuming that the predetermined wear and merge time has been reached at this point, step 2060 continues.
Step 2060, wear aggregation is performed on the conversion record table, and the segment information is updated. Then, the process returns to step 2020 to continue waiting for the next differencing process, and when the next differencing process is performed, the file is read using the data partition specified by the new segment information.
Assuming that the wear factor is 1 in the present application, the change recording table shown in table 3 is subjected to wear processing. In this application, the wear factor is subtracted from each entry to perform the wear process, so as to obtain the change record table shown in table 4:
0(1) 1(2) 1(3) 0(4)
0(5) 1(6) 2(7) 0(8)
0(9) 1(10) 1(11) 0(12)
1(13) 1(14) 0(15) 0(16)
TABLE 4
Assume in this application that the aggregation section factor is 2. When merging the aggregation segments, first, reading the values of the data partitions corresponding to the aggregation segment factors, for example, the values (0, 1) corresponding to the 1 st and 2 nd data partitions; numerical values (1, 0) corresponding to the 3 rd and 4 th data blocks; numerical values (0, 1) corresponding to the 5 th and 6 th data blocks; numerical values (2, 0) corresponding to the 7 th and 8 th data blocks; the numerical value (0, 1) corresponding to the 9 th and 10 th data blocks; numerical values (1, 0) corresponding to 11 th and 12 th data blocks; the numerical values (1, 1) corresponding to the 13 th and 14 th data blocks; and the 15 th and 16 th data blocks correspond to the numerical value (0, 0).
It is assumed that the aggregation conditions in this application are such that the aggregation level values corresponding to the aggregation level factors are all equal to or less than zero. The practical meaning represented by the aggregation condition can be understood that the change times of the data blocks in the aggregation section corresponding to the aggregation section factor in a period of time are less than or equal to zero after abrasion treatment; in other words, the number of changes of the data blocks in the aggregation section corresponding to the aggregation section factor is less than or equal to the value set by the wear factor. According to this condition, it is known that, in the data blocks in the aggregation segment, the values of the 15 th and 16 th data blocks are in accordance with this aggregation condition, and therefore the 15 th and 16 th data blocks can be combined into one data block.
After the data blocks in the table 4 are merged according to the aggregation section factors and the aggregation conditions, a change recording table with the data state shown in table 5 is obtained:
0(1) 1(2) 1(3) 0(4)
0(5) 1(6) 2(7) 0(8)
0(9) 1(10) 1(11) 0(12)
1(13) 1(14) 0(15) 0(15)
TABLE 5
The original 15 th and 16 th data blocks are merged into the data block 15, the number of the whole data blocks is reduced from 16 to 15, and correspondingly, in the subsequent differential backup, only 15 data fingerprints need to be calculated, maintained and compared.
Thereafter, the process may return to step 2020 to continue waiting for the next differential backup and read the file using the data blocks specified by the new segment information, and repeat the above process, continuously and dynamically updating, wear-out processing, and merging the data blocks for the change log.
It should be noted that, in the change record table shown in table 5, since the 15 th data partition is formed by merging the 15 th data partition and the 16 th data partition, the 15 th data partition corresponds to two entries, and when the change record table is updated according to the change of the 15 th data partition, all entries (i.e., the entry 15 and the entry 16) corresponding to the 15 th data partition need to be acquired, and all entries corresponding to the 15 th data partition need to be updated in sequence.
It should be noted that the specific implementation flow of the application data fingerprint detection method in the embodiment of the present invention is only an exemplary illustration, and an implementer may adopt any suitable implementation method according to implementation conditions to implement the data fingerprint detection method in the embodiment of the present invention.
According to a second aspect of the embodiments of the present invention, a data fingerprint detection apparatus 30, as shown in fig. 3, includes: a first condition detecting module 301, configured to detect whether a first condition is met; a change record table obtaining module 302, configured to obtain a first change record table of all data blocks; the wear module 303 is configured to perform wear processing on the first change record table according to a preset wear factor to obtain a second change record table; an aggregation section factor detection module 304, configured to detect whether there is a mergeable entry corresponding to a preset aggregation section factor and meeting an aggregation condition in the second change record table; the aggregating module 305 is configured to merge at least two data partitions corresponding to the mergeable entry into one data partition.
According to an embodiment of the present invention, the apparatus 30 further includes: the differential file acquisition module is used for acquiring files to be differentially processed; the blocking module is used for blocking the file to obtain data blocks; and the change record table creating module is used for creating a first change record table, wherein each table entry of the first change record table corresponds to each data block and is used for recording the history information of the change of each data block in a period of time.
According to an embodiment of the present invention, the apparatus 30 further includes: the first data fingerprint acquisition module is used for acquiring first data fingerprints of all the data blocks, wherein the first data fingerprints are the latest data fingerprints corresponding to each data block; the second data fingerprint acquisition module is used for acquiring second data fingerprints of all the data blocks, wherein the second data fingerprints are the last stored data fingerprints corresponding to each data block; the data fingerprint detection module is used for detecting whether the first data fingerprint and the second data fingerprint of each data block are the same or not; and the change record table updating module is used for updating the table entry corresponding to the corresponding data block in the first change record table.
According to an embodiment of the present invention, the change log table updating module includes: the table entry obtaining submodule is used for obtaining all table entries corresponding to the corresponding data blocks; and the table entry updating submodule is used for updating each table entry in all the table entries in sequence.
According to an embodiment of the present invention, each table entry of the first transformation record table is used for recording history information of each data partition that changes over a period of time, and includes: each table entry of the first transformation record table is used for recording the times of change of each data block in a period of time; correspondingly, the change recording table updating module is specifically configured to increase the number of times recorded by the table entry by one.
According to an embodiment of the present invention, the first condition detecting module 301 is specifically configured to detect whether to conduct for a predetermined time.
According to an embodiment of the present invention, the apparatus 30 further includes a wear factor and aggregate segment factor determining module for determining the wear factor and the aggregate segment factor.
According to a third aspect of embodiments of the present invention, there is provided a storage medium having stored thereon program instructions, wherein the program instructions are operable when executed to perform any of the above-mentioned data fingerprint detection methods.
Here, it should be noted that: the above description of the embodiment of the data fingerprint detection apparatus and the above description of the embodiment of the computer storage medium are similar to the description of the foregoing method embodiments, and have similar beneficial effects to the foregoing method embodiments, and therefore, the description thereof is omitted. For the technical details of the embodiment of the data fingerprint detection device and the embodiment of the computer storage medium of the present invention that have not been disclosed yet, please refer to the description of the foregoing method embodiment of the present invention for understanding, and therefore, for brevity, will not be repeated.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage medium, a Read Only Memory (ROM), a magnetic disk, and an optical disk.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, an optical disk, or the like, which can store the program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method of data fingerprint detection, the method comprising, when a first condition is met:
acquiring a first change record table of all data blocks, wherein each table entry of the first change record table records history information of each data block changing within a period of time, and each data block corresponds to a data fingerprint;
carrying out abrasion processing on the first change record table according to a preset abrasion factor to obtain a second change record table;
and detecting whether the second change recording table has a mergeable table entry which corresponds to a preset aggregation section factor and meets an aggregation condition, and if so, merging at least two data blocks corresponding to the mergeable table entry into one data block.
2. The method of claim 1, prior to the meeting of the first condition, the method further comprising:
acquiring a file to be differentially processed;
partitioning the file to obtain data partitions;
and creating the first change recording table, wherein each table entry of the first transformation recording table corresponds to each data block and is used for recording historical information of each data block which changes within a period of time.
3. The method of claim 2, after said creating said first change record table, further comprising performing the following operations while differencing said file:
acquiring first data fingerprints of all data blocks, wherein the first data fingerprints are the latest data fingerprints corresponding to each data block;
acquiring second data fingerprints of all data blocks, wherein the second data fingerprints are the data fingerprints stored last time corresponding to each data block;
and detecting whether the first data fingerprint and the second data fingerprint of each data block are the same or not, and if so, updating the table entry corresponding to the corresponding data block in the first change record table.
4. The method of claim 3, wherein updating the entry corresponding to the corresponding data chunk in the first change record table comprises:
acquiring all table entries corresponding to the corresponding data blocks;
and updating each table entry in all the table entries in sequence.
5. The method of any of claims 1 to 4, wherein each entry of the first transformation record table is used for recording history information of each data block changing in a period of time, and the method comprises:
each table entry of the first transformation record table is used for recording the times of change of each data block in a period of time;
correspondingly, the updating the table entry corresponding to the data partition includes:
and adding one to the times recorded by the table entry.
6. The method of claim 1, the first condition comprising reaching a predetermined time.
7. The method of claim 1, the first change record table comprising a change record table using a bitmap as a storage structure.
8. The method of claim 1, prior to said wear processing the first change log table according to a preset wear factor to obtain a second change log table, the method further comprising:
determining the wear factor and the polymerization stage factor.
9. A data fingerprint detection apparatus, the apparatus comprising:
the first condition detection module is used for detecting whether a first condition is met;
the change record table acquisition module is used for acquiring first change record tables of all data blocks;
the abrasion module is used for carrying out abrasion treatment on the first change record table according to a preset abrasion factor to obtain a second change record table;
the aggregation section factor detection module is used for detecting whether the second change recording table has a mergeable table entry which corresponds to a preset aggregation section factor and meets the aggregation condition;
and the aggregation module is used for merging at least two data blocks corresponding to the mergeable table entry into one data block.
10. A storage medium having stored thereon program instructions for performing, when executed, a data fingerprint detection method according to any one of claims 1 to 8.
CN202010576124.XA 2020-06-22 2020-06-22 Data fingerprint detection method and device and storage medium Pending CN111949652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010576124.XA CN111949652A (en) 2020-06-22 2020-06-22 Data fingerprint detection method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010576124.XA CN111949652A (en) 2020-06-22 2020-06-22 Data fingerprint detection method and device and storage medium

Publications (1)

Publication Number Publication Date
CN111949652A true CN111949652A (en) 2020-11-17

Family

ID=73337157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010576124.XA Pending CN111949652A (en) 2020-06-22 2020-06-22 Data fingerprint detection method and device and storage medium

Country Status (1)

Country Link
CN (1) CN111949652A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180246934A1 (en) * 2017-02-27 2018-08-30 Timescale, Inc. Adjusting partitioning policies of a database system in view of storage reconfiguration
US20180300373A1 (en) * 2017-04-12 2018-10-18 Oracle International Corporation Combined sort and aggregation
CN109324998A (en) * 2018-09-18 2019-02-12 郑州云海信息技术有限公司 A kind of document handling method, apparatus and system
CN110674086A (en) * 2019-09-29 2020-01-10 广州华多网络科技有限公司 Data merging method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180246934A1 (en) * 2017-02-27 2018-08-30 Timescale, Inc. Adjusting partitioning policies of a database system in view of storage reconfiguration
US20180300373A1 (en) * 2017-04-12 2018-10-18 Oracle International Corporation Combined sort and aggregation
CN109324998A (en) * 2018-09-18 2019-02-12 郑州云海信息技术有限公司 A kind of document handling method, apparatus and system
CN110674086A (en) * 2019-09-29 2020-01-10 广州华多网络科技有限公司 Data merging method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US9507539B2 (en) Performing authorization control in a cloud storage system
EP3238106B1 (en) Compaction policy
US9575984B2 (en) Similarity analysis method, apparatus, and system
US7853770B2 (en) Storage system, data relocation method thereof, and recording medium that records data relocation program
CN110727404A (en) Data deduplication method and device based on storage end and storage medium
CN103019887B (en) Data back up method and device
CN113448491B (en) Data migration for storage systems
CN113901279B (en) Graph database retrieval method and device
CN111880731B (en) Data processing method and device and related components
CN110781156A (en) Data node distribution method, equipment and medium
CN112199054A (en) File storage method and system
WO2017020735A1 (en) Data processing method, backup server and storage system
CN113253932B (en) Read-write control method and system for distributed storage system
CN111949652A (en) Data fingerprint detection method and device and storage medium
CN112328587A (en) Data processing method and device for ElasticSearch
CN109658985B (en) Redundancy removal optimization method and system for gene reference sequence
CN107315806B (en) Embedded storage method and device based on file system
CN115576947A (en) Data management method and device, combined library, electronic equipment and storage medium
CN111459928B (en) Data deduplication method applied to data backup scene in cluster range and application
CN116820323A (en) Data storage method, device, electronic equipment and computer readable storage medium
CN111143288A (en) Data storage method, system and related device
CN112015791A (en) Data processing method and device, electronic equipment and computer storage medium
CN114153842B (en) Cross-platform data processing method, system, equipment and medium
CN117076387B (en) Quick gear restoration system for mass small files based on magnetic tape
CN116414733B (en) Data processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination