CN111124279B - Storage deduplication processing method and device based on host - Google Patents

Storage deduplication processing method and device based on host Download PDF

Info

Publication number
CN111124279B
CN111124279B CN201911200383.6A CN201911200383A CN111124279B CN 111124279 B CN111124279 B CN 111124279B CN 201911200383 A CN201911200383 A CN 201911200383A CN 111124279 B CN111124279 B CN 111124279B
Authority
CN
China
Prior art keywords
fingerprint
host
data
heat
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911200383.6A
Other languages
Chinese (zh)
Other versions
CN111124279A (en
Inventor
陈东河
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN201911200383.6A priority Critical patent/CN111124279B/en
Publication of CN111124279A publication Critical patent/CN111124279A/en
Application granted granted Critical
Publication of CN111124279B publication Critical patent/CN111124279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a storage deduplication processing method based on a host, which comprises the following steps: adding host labels to data block fingerprints in a hard disk, taking the fingerprints of the same host as a set, and calculating the fingerprint heat and host set heat in each host set; loading the fingerprints in the host with the highest heat degree in the host set into a memory and/or sequencing all the fingerprint heat degrees from high to low to load a plurality of fingerprints with the highest sequence into the memory; responding to data to be written into the hard disk, calculating a fingerprint of the data and searching the fingerprint from the memory to perform deduplication processing on the searched fingerprint; and in response to not finding the fingerprint in the memory, finding the fingerprint from the hard disk, and updating the fingerprint heat and the host set heat written into the host. Compared with the method for searching the fingerprint data from the hard disk, the method improves the fingerprint searching efficiency, further improves the throughput rate and improves the storage deduplication efficiency.

Description

Storage deduplication processing method and device based on host
Technical Field
The present invention relates to the field of computers, and in particular, to a method and an apparatus for processing deduplication based on a host.
Background
Data deduplication (data deduplication processing) is a main technology for data reduction in enterprise storage, deduplication is realized by only storing one part of the same data in storage, other repeated data blocks keep one address and are referred to the unique storage block, fingerprints are calculated by partitioning the data according to a specified size, whether the data are the same data or not is judged through the fingerprints, a large amount of storage space can be saved for a large amount of redundant data through deduplication processing, the storage space is saved for an enterprise, and the storage cost investment is reduced.
The measurement data deduplication effect can be summarized into two indexes: a re-puncturing rate and a throughput rate. The data deduplication rate is high, the data reduction effect is more obvious, and the occupied storage space is less. The throughput rate and the data deduplication processing efficiency are high, and the influence on the service application delay of the host is small. Many studies have shown that the smaller the data slice, the greater the deduplication rate, but the lower the throughput rate; conversely, the larger the data slice, the lower the deduplication rate, but the higher the throughput. Efficient querying or building of new data slice indexes in a data slice management system based on fingerprints of data slices is key to improving throughput. How to optimize the re-puncturing rate and the throughput rate becomes a problem to be considered in the invention.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a storage deduplication processing method and apparatus based on a host, so as to optimize and improve fingerprint search efficiency and improve enterprise storage deduplication throughput.
Based on the above object, an aspect of the embodiments of the present invention provides a storage deduplication processing method based on a host, including the following steps:
adding host labels to data block fingerprints in a hard disk, taking the fingerprints of the same host as a set, and calculating the fingerprint heat and host set heat in each host set;
loading the fingerprints in the host with the highest heat degree in the host set into a memory and/or sequencing all the fingerprint heat degrees from high to low to load a plurality of fingerprints with the highest sequence into the memory;
responding to data to be written into the hard disk, calculating a fingerprint of the data and searching the fingerprint from the memory to perform deduplication processing on the searched fingerprint;
and in response to not finding the fingerprint in the memory, finding the fingerprint from the hard disk, and updating the fingerprint heat and the host set heat written into the host.
In some embodiments, adding a host tag to the data block fingerprints in the hard disk and using the fingerprints of the same host as a set, and calculating the fingerprint heat and the host set heat in each host set comprises:
the hash value of the content of each data block is calculated by using SHA-1 digital signature algorithm to obtain the fingerprint of the data block.
In some embodiments, adding a host tag to the fingerprints of the data blocks in the hard disk and using the fingerprints of the same host as one set, and calculating the heat of the fingerprints and the heat of the host set in each host set further comprises:
and calculating the heat of the fingerprints in the host set by using a least recently used algorithm, sequencing the heat, and calculating the heat of the host set by weighting the heat of the fingerprints of the same host.
In some embodiments, loading the fingerprint in the host with the highest heat of host set into memory and/or ordering all the fingerprints from high to low to load the highest-ordered plurality of fingerprints into memory comprises:
and selecting the number of the loaded fingerprints according to the size of the memory capacity.
In some embodiments, the method further comprises:
three data areas are maintained in the hard disk, including a metadata area, a data area, and a fingerprint area, wherein,
the metadata area comprises a logical address of the data block and a corresponding fingerprint;
the data area comprises non-repeated data blocks left after the deduplication processing;
the fingerprint area includes key-value pairs of a fingerprint and data chunk metadata.
In some embodiments, in response to writing data into the hard disk, computing a fingerprint of the data and searching the fingerprint from the memory to perform deduplication processing on the found fingerprint includes:
and in response to finding the fingerprint in the memory, not writing the data into a data area of the hard disk, and only updating the logical address and the fingerprint in the metadata area.
In some embodiments, said in response to not finding said fingerprint in said memory, finding said fingerprint from said hard disk, and updating said fingerprint heat and host set heat written to a host comprises:
and searching the fingerprint in the hard disk, responding to the fingerprint found in the hard disk, not writing the data into a data area of the hard disk, and only updating the logic address and the fingerprint in the metadata area.
In some embodiments, said in response to not finding said fingerprint in said memory, finding said fingerprint from said hard disk, and updating said fingerprint heat and host set heat written to a host further comprises:
and searching the fingerprint in the hard disk, responding to the condition that the fingerprint is not found in the hard disk, adding the fingerprint into the fingerprint area, storing data corresponding to the fingerprint into the data area, and updating the metadata area.
In some embodiments, the method further comprises:
and inquiring the heat degrees of the host set and/or the heat degrees of the fingerprints with the highest rank at preset time intervals so as to reload the fingerprints into the memory in response to the change of the heat degrees.
Another aspect of the embodiments of the present invention provides a storage deduplication processing apparatus based on a host, including:
at least one processor; and
a memory storing program code executable by the processor, the program code implementing the method of any one of the above when executed by the processor.
The invention has the following beneficial technical effects: when the host-based storage deduplication processing method and device provided by the embodiment of the invention are used for performing deduplication data fingerprint management and searching, a host fingerprint set is introduced, data fingerprints of different hosts are stored in different sets, and the fingerprint of the host set with the highest heat degree is loaded into a memory.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a flow chart of a host-based storage deduplication processing method according to the present invention;
FIG. 2 is a flow diagram illustrating a multi-service host write store data deduplication process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of associating a data fingerprint with a heat according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a corresponding relationship among a metadata area, a data area, and a fingerprint area according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a hardware structure of a host-based storage deduplication processing apparatus according to the present invention.
Detailed Description
Embodiments of the present invention are described below. However, it is to be understood that the disclosed embodiments are merely examples and that other embodiments may take various and alternative forms. The figures are not necessarily to scale; certain features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention. As one of ordinary skill in the art will appreciate, various features illustrated and described with reference to any one of the figures may be combined with features illustrated in one or more other figures to produce embodiments that are not explicitly illustrated or described. The combination of features shown provides a representative embodiment for a typical application. However, various combinations and modifications of the features consistent with the teachings of the present invention may be desired for certain specific applications or implementations.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
In practical application, the enterprise stores the single business application on each stored host, and the read-write of the same business application has the characteristics of the enterprise, so the read-write characteristics of each host reflect the read-write characteristics of the running business, and the deduplication processing of the read-write data of the hosts has locality, namely, the fingerprint frequency of the read-write data of the same host has relevance, the use frequency of a certain fingerprint of the same host is high, and the probability that the use frequencies of other fingerprints of the host are high is the same. Based on the characteristics, in the embodiment of the invention, the host is taken as the set tag to store the duplicate deletion fingerprints of the same host together, the fingerprint data of each host tag is sequenced through an algorithm and loaded into the memory as hot data instead of reading and searching the fingerprint data from the hard disk, so that the fingerprint searching efficiency is improved, and the duplicate deletion efficiency is further improved.
Based on the above object, an aspect of the embodiments of the present invention provides a storage deduplication processing method based on a host, as shown in fig. 1, including the following steps:
step S101: adding host labels to data block fingerprints in a hard disk, taking the fingerprints of the same host as a set, and calculating the fingerprint heat and host set heat in each host set;
step S102: loading the fingerprints in the host with the highest heat degree in the host set into a memory and/or sequencing all the fingerprints from high to low in heat degree so as to load a plurality of highest-ranked fingerprints into the memory;
step S103: responding to data to be written into the hard disk, calculating a fingerprint of the data and searching the fingerprint from the memory to perform deduplication processing on the searched fingerprint;
step S104: and in response to the fact that the fingerprint is not found in the memory, searching the fingerprint from the hard disk, and updating the fingerprint heat and the host set heat written into the host.
In some embodiments, adding host tags to the data block fingerprints in the hard disk and using the fingerprints of the same host as a set, and calculating the fingerprint heat and the host set heat in each host set comprises: the hash value of each data block is computed using the SHA-1 digital signature algorithm on its contents to derive a fingerprint for the data block.
In an embodiment according to the present invention, a scenario in which an enterprise is stored in actual use is generally as shown in fig. 2, a plurality of hosts running different services use different volumes on a storage to read and write at the same time, and in a compression-enabled scenario, data written into the storage by each host is finally written into a storage hard disk through several steps of data blocking, fingerprint calculation and search, data deduplication, and data writing. The fingerprint of each data block is calculated, and the most common method is to calculate the hash value of the content of each data block by using the SHA-1 digital signature algorithm, which is called the data block fingerprint, and the fingerprint is used as the unique identifier of the data block to index each data block. Any data chunks having the same fingerprint are considered to be identical data chunks.
In some embodiments, adding a host tag to the data block fingerprints in the hard disk and using the fingerprints of the same host as a set, and calculating the fingerprint heat and the host set heat in each host set further comprises: the fingerprint heat in the host set is calculated and sorted by using Least Recently Used (LRU) algorithm, and the host set heat is calculated by weighting the fingerprint heat of the same host, as shown in fig. 3. The LRU (Least recently used) algorithm eliminates data according to the history access record of the data, and its core idea is that "if the data has been accessed recently, the probability of being accessed in the future is higher.
In some embodiments, loading the fingerprint in the host with the highest heat in the host set into memory and/or ordering all the fingerprints from high to low in heat to load the highest-ordered plurality of fingerprints into memory comprises: and selecting the number of the loaded fingerprints according to the size of the memory capacity. It should be understood that in principle, as many fingerprints as possible should be loaded, for example, all fingerprints in the host with the highest heat of the host set are loaded into the memory.
Since the data writing IO of different services have respective characteristics, and different hosts run specific service applications, the data writing IO of each host has respective characteristics, for example, it can be understood that the host a running database service, the fingerprints of all written data can be classified into large and small letters such as A, B, C … …, etc., the fingerprints of all written data of the host B running backup service can be classified into numbers 1, 2, 3 … …, etc., and the fingerprints of all written data of the host C running other services can be classified into small letters a, B, C, etc. Therefore, according to this feature, in one embodiment according to the present invention, during fingerprint storage management, fingerprints of the same host are stored as sets in the hard disk, when a fingerprint in a certain host set is searched for use, all the host sets are sorted according to the heat using an LRU algorithm, and all the fingerprints in the host set are loaded into the memory together according to the heat using (if the memory capacity allows). According to the locality principle, after a certain data fingerprint in a certain host set is searched and used, other data fingerprints in the host set can be searched and used with high probability, and therefore the fingerprint searching efficiency can be improved by loading all the data fingerprints in the host set into the memory. And the fingerprints in the same host set can also be subjected to hot sorting by using an LRU algorithm, the fingerprints which are searched and used in a long-term miss mode are removed from a memory, and the hot degree of the fingerprints is dynamically updated.
In some embodiments, the method further comprises: maintaining three data areas in the hard disk, including a metadata area, a data area and a fingerprint area, as shown in fig. 4, wherein the metadata area includes a logical address of data and corresponding fingerprint information; the data area comprises non-repeated data blocks left after the deduplication processing; the fingerprint area comprises a series of key value pairs of data block fingerprints and data block metadata, wherein the data block metadata comprises addresses, sizes and the like of data blocks in the hard disk, and the fingerprints in the fingerprint area and the data blocks in the data area are in one-to-one correspondence.
In some embodiments, in response to writing data into the hard disk, computing a fingerprint of the data and searching the fingerprint from the memory to perform deduplication processing on the found fingerprint includes: and in response to finding the fingerprint in the memory, not writing the data into a data area of the hard disk, and only updating the logical address and the fingerprint in the metadata area.
In some embodiments, said in response to not finding said fingerprint in said memory, finding said fingerprint from said hard disk, and updating said fingerprint heat and host set heat written to a host comprises: and searching the fingerprint in the hard disk, responding to the fingerprint found in the hard disk, not writing the data into a data area of the hard disk, and only updating the logic address and the fingerprint in the metadata area.
In some embodiments, said in response to not finding said fingerprint in said memory, finding said fingerprint from said hard disk, and updating said fingerprint heat and host set heat written to a host further comprises: and searching the fingerprint in the hard disk, responding to the situation that the fingerprint is not found in the hard disk, adding the fingerprint into the fingerprint area, storing the data corresponding to the fingerprint into the data area, and updating the metadata area.
In one embodiment according to the present invention, host a (with the highest aggregate popularity, to load in memory) is to write data with fingerprint B to the storage hard disk; the storage system searches a fingerprint B from the fingerprint data in the memory, and if the fingerprint B is found, the data with the fingerprint B is subjected to deduplication processing, and a data logic address and fingerprint information in the metadata information are updated; if the fingerprint B is not found in the memory, the fingerprint B is searched in the fingerprint area in the hard disk, if the fingerprint B is found, the data with the fingerprint B is subjected to deduplication processing, and the fingerprint hot degree and the fingerprint set hot degree of the host computer A are updated by using an LRU algorithm. If the fingerprint B is not found in the hard disk fingerprint area, adding the fingerprint B into the fingerprint area, storing data corresponding to the fingerprint B into the data area, updating the metadata area, updating the fingerprint heat and the fingerprint set heat of the host A by using an LRU algorithm, and loading all the fingerprints in the fingerprint set of the host A into the memory.
In some embodiments, the method further comprises: and inquiring the heat degree of the fingerprint of the host set and/or the highest-ranking plurality at preset time intervals so as to reload the fingerprint into the memory in response to the change of the heat degree. Because the user continuously performs the operations of adding and deleting data in the hard disk, the heat of the corresponding fingerprint inevitably changes along with time, and the heat of the fingerprint in the memory is probably no longer the highest, so that the fingerprint heat and/or the heat of the host set need to be checked regularly, and the fingerprint data in the memory needs to be updated in time.
In some embodiments, fingerprints in the host set loaded into memory may also be dynamically updated according to the LRU algorithm, and fingerprints that have not been looked up for a long time may be removed from memory.
Where technically feasible, the technical features listed above for different embodiments may be combined with each other or changed, added, omitted, etc. to form further embodiments within the scope of the invention.
It can be seen from the foregoing embodiments that, in the storage deduplication processing method based on a host provided in the embodiments of the present invention, when performing deduplication data fingerprint management and lookup, a host fingerprint set is introduced, data fingerprints of different hosts are stored in different sets, and a fingerprint of a host set with the highest heat is loaded into a memory.
In view of the foregoing, another aspect of the embodiments of the present invention provides an embodiment of a host-based storage deduplication processing apparatus.
The storage deduplication processing device based on the host comprises a memory and at least one processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes any one of the methods when executing the computer program.
Fig. 5 is a schematic diagram of a hardware structure of an embodiment of a host-based storage deduplication processing apparatus according to the present invention.
Taking the computer device shown in fig. 5 as an example, the computer device includes a processor 501 and a memory 502, and may further include: an input device 503 and an output device 504.
The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The memory 502, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the host-based storage deduplication processing method in the embodiments of the present application. The processor 501 executes various functional applications of the server and data processing, i.e., the host-based storage deduplication processing method of the above-described method embodiment, by running the nonvolatile software program, instructions, and modules stored in the memory 502.
The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to a host-based storage deduplication processing method, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 502 may optionally include memory located remotely from processor 501, which may be connected to local modules over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus based on the host computer's storage deduplication processing method. The output means 304 may comprise a display device such as a display screen.
Program instructions/modules corresponding to the one or more host-based storage deduplication processing methods are stored in the memory 502, and when executed by the processor 501, perform the host-based storage deduplication processing methods in any of the above-described method embodiments.
Any embodiment of the computer device executing the host-based storage deduplication processing method may achieve the same or similar effects as any corresponding method embodiment described above.
Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
In addition, the apparatuses, devices and the like disclosed in the embodiments of the present invention may be various electronic terminal devices, such as a mobile phone, a Personal Digital Assistant (PDA), a tablet computer (PAD), a smart television and the like, or may be a large terminal device, such as a server and the like, and therefore the scope of protection disclosed in the embodiments of the present invention should not be limited to a specific type of apparatus, device. The client disclosed in the embodiment of the present invention may be applied to any one of the above electronic terminal devices in the form of electronic hardware, computer software, or a combination of both.
Furthermore, the method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions described herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the above embodiments of the present invention are merely for description, and do not represent the advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.
The above-described embodiments are possible examples of implementations and are presented merely for a clear understanding of the principles of the invention. Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A storage deduplication processing method based on a host is characterized by comprising the following steps:
adding host labels to data block fingerprints in a hard disk, taking the fingerprints of the same host as a set, and calculating the fingerprint heat and host set heat in each host set;
loading the fingerprint in the host with the highest heat degree of the host set into a memory;
responding to data to be written into the hard disk, calculating a fingerprint of the data and searching the fingerprint from the memory to perform deduplication processing on the searched fingerprint;
and in response to the fact that the fingerprint is not found in the memory, searching the fingerprint from the hard disk, and updating the fingerprint heat and the host set heat written into the host.
2. The method of claim 1, wherein adding a host tag to fingerprints of data blocks in the hard disk and using fingerprints of the same host as a set, and wherein calculating the heat of fingerprints and the heat of host sets in each host set comprises:
the hash value of the content of each data block is calculated by using SHA-1 digital signature algorithm to obtain the fingerprint of the data block.
3. The method of claim 2, wherein adding a host tag to the data block fingerprints in the hard disk and using the fingerprints of the same host as a set, and wherein calculating the heat of fingerprints in each host set and the heat of the host set further comprises:
and calculating the heat degree of the fingerprints in the host set by using a least recently used algorithm, sequencing the heat degrees, and calculating the heat degree of the host set by performing a weighting algorithm on the heat degree of the fingerprints of the same host.
4. The method of claim 1, wherein loading the fingerprint of the host with the highest heat in the host set into memory comprises:
and selecting the number of the loaded fingerprints according to the size of the memory capacity.
5. The method of claim 1, further comprising:
three data areas are maintained in the hard disk, including a metadata area, a data area, and a fingerprint area, wherein,
the metadata area comprises a logical address of the data block and a corresponding fingerprint;
the data area comprises non-repeated data blocks left after the deduplication processing;
the fingerprint area includes key-value pairs of a fingerprint and data chunk metadata.
6. The method of claim 5, wherein in response to writing data to the hard disk, computing a fingerprint of the data and searching the fingerprint from the memory for deduplication processing of the searched fingerprint comprises:
and in response to finding the fingerprint in the memory, not writing the data into the data area of the hard disk, and only updating the logical address and the fingerprint in the metadata area.
7. The method of claim 5, wherein the responsive to not finding the fingerprint in the memory, finding the fingerprint from the hard disk, and updating the fingerprint heat and the host set heat for writing to the host comprises:
and searching the fingerprint in the hard disk, responding to the fingerprint found in the hard disk, not writing the data into a data area of the hard disk, and only updating the logic address and the fingerprint in the metadata area.
8. The method of claim 7, wherein the responsive to not finding the fingerprint in the memory, finding the fingerprint from the hard disk, and updating the fingerprint heat and the host set heat for writing to the host further comprises:
and searching the fingerprint in the hard disk, responding to the condition that the fingerprint is not found in the hard disk, adding the fingerprint into the fingerprint area, storing data corresponding to the fingerprint into the data area, and updating the metadata area.
9. The method of claim 1, further comprising:
and inquiring the heat of the host set at preset time intervals so as to reload the fingerprint into the memory in response to the change of the heat.
10. A host-based storage deduplication processing apparatus, comprising:
at least one processor; and
a memory storing program code executable by the processor, the program code implementing the method of any one of claims 1-9 when executed by the processor.
CN201911200383.6A 2019-11-29 2019-11-29 Storage deduplication processing method and device based on host Active CN111124279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911200383.6A CN111124279B (en) 2019-11-29 2019-11-29 Storage deduplication processing method and device based on host

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911200383.6A CN111124279B (en) 2019-11-29 2019-11-29 Storage deduplication processing method and device based on host

Publications (2)

Publication Number Publication Date
CN111124279A CN111124279A (en) 2020-05-08
CN111124279B true CN111124279B (en) 2022-07-26

Family

ID=70497066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911200383.6A Active CN111124279B (en) 2019-11-29 2019-11-29 Storage deduplication processing method and device based on host

Country Status (1)

Country Link
CN (1) CN111124279B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106610790B (en) * 2015-10-26 2020-01-03 华为技术有限公司 Method and device for deleting repeated data
CN105573669A (en) * 2015-12-11 2016-05-11 上海爱数信息技术股份有限公司 IO read speeding cache method and system of storage system
CN108415669A (en) * 2018-03-15 2018-08-17 深信服科技股份有限公司 The data duplicate removal method and device of storage system, computer installation and storage medium

Also Published As

Publication number Publication date
CN111124279A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
US11169710B2 (en) Method and apparatus for SSD storage access
CN108804031B (en) Optimal record lookup
US9116936B2 (en) Inline learning-based selective deduplication for primary storage systems
US10678654B2 (en) Systems and methods for data backup using data binning and deduplication
US20170123676A1 (en) Reference Block Aggregating into a Reference Set for Deduplication in Memory Management
US11580162B2 (en) Key value append
KR102564170B1 (en) Method and device for storing data object, and computer readable storage medium having a computer program using the same
CN106874348B (en) File storage and index method and device and file reading method
CN111858520B (en) Method and device for separately storing block chain node data
CN107704202B (en) Method and device for quickly reading and writing data
CN103019887A (en) Data backup method and device
US8935481B2 (en) Apparatus system and method for providing raw data in a level-two cache
WO2018205151A1 (en) Data updating method and storage device
CN110727404A (en) Data deduplication method and device based on storage end and storage medium
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
US20170123678A1 (en) Garbage Collection for Reference Sets in Flash Storage Systems
CN106708825A (en) Data file processing method and system
US20170123689A1 (en) Pipelined Reference Set Construction and Use in Memory Management
CN114281989B (en) Data deduplication method and device based on text similarity, storage medium and server
US20240241640A1 (en) Systems, methods, devices, and media for data processing
CN113010526A (en) Storage method and device based on object storage service
US20170123677A1 (en) Integration of Reference Sets with Segment Flash Management
EP3343395B1 (en) Data storage method and apparatus for mobile terminal
CN114115734B (en) Data deduplication method, device, equipment and storage medium
CN112487027B (en) Efficient data query implementation method based on block chain electronic transaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant