CN104270412A

CN104270412A - Three-level caching method based on Hadoop distributed file system

Info

Publication number: CN104270412A
Application number: CN201410455411.XA
Authority: CN
Inventors: 孙知信; 谢怡; 宫婧
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2014-06-24
Filing date: 2014-09-09
Publication date: 2015-01-07

Abstract

The invention discloses a three-level caching method based on a Hadoop distributed file system. The method comprises the first step of task scheduling of data localization processing, the second step of local access of data in a local memory and the third step of repeated utilization of the data of the local memory. The method can improve the data hit rate, reduce the data transmission quantity and improve the MapReduce execution efficiency.

Description

A kind of three grades of caching methods based on Hadoop distributed file system

Technical field

The present invention relates to field of data storage, refer more particularly to a kind of three grades of caching methods based on Hadoop distributed file system.

Background technology

Apache Hadoop (generally referred to as Hadoop) is a distributed data processing platform of increasing income, it mainly comprises field of distributed file processing (Hadoop Distributed File System, HDFS) and MapReduce computation module.

Apache Hadoop is the intensive Distributed Application of a supported data and the open source software framework issued with Apache2.0 permission agreement.It is supported in the application program that the large-scale cluster of commodity hardware structure runs.Hadoop be MapReduce and the Google archives economy delivered according to Google company paper voluntarily implementation form.

Hadoop framework is pellucidly for application provides reliability and data mobile.It achieve the programming paradigm of MapReduce by name: application program is divided into many fractions, and each part can arbitrary node in the cluster perform or re-execute.In addition, Hadoop additionally provides distributed file system, and in order to store the data of all computing nodes, this is that whole cluster brings very high bandwidth.The design of MapReduce and distributed file system, makes whole framework can handle node failures automatically.The data of the computer that it makes application program and thousands of independence calculate and PB level.Generally believe now that whole Apache Hadoop " platform " comprises Hadoop kernel, MapReduce, Hadoop distributed file system (HDFS) and some relevant items, have Apache Hive and Apache HBase etc.

HDFS can meet the storage demand of large-scale data well, but it exists many deficiencies in the reading of process real time data.A large amount of digital independent is related in process due to execution MapReduce task, can exert heavy pressures on to Internet Transmission and I/O (Input/Output) bandwidth, so caching system will be arranged on the basis of HDFS, reduce volume of transmitted data, to improve the execution efficiency of MapReduce.

Data calculation process is divided into two stage: Map and Reduce by MapReduce, corresponding to two process function mapper and reducer.In the Map stage, initial data is transfused to mapper and carries out filtering and changing, and the intermediate data result of acquisition, as the input of reducer, obtains last result.In the processing procedure of whole MapReduce, from HDFS, read the time that initial data spends the longest, therefore, wants the execution efficiency improving MapReduce, need to start with from the reading of initial data.By arranging corresponding caching mechanism, improving data hit rate, the reading time of Map stage initial data is reduced.

Memcached and RAMCloud (internal memory cloud) is two typical internal memory level buffer memory systems.Memcached be in based on the mass data of disk storage for rear end provide Relatively centralized, non-cooperating, to the nontransparent data buffer storage service of user.RAMCloud uses distributed shared memory to substitute disk to complete data store and management, and proposes data cached based on many disk dispersion backups, parallel thought of repairing fast.

The framework that both inherently still deposit towards data in magnetic disk Relatively centralized, computational resource is separated with storage resources, is therefore difficult to be directly applied for MapReduce platform; Application type difference simultaneously owing to supporting, the two does not all consider the feature due to MapReduce application data localization process.

Carried out the temporal locality of quantized data access by the access times of other data blocks at institute interval between twice access of same data block based on the cache replacement policy LAC (Locality-Aware Cooperative Caching, location aware cooperation caching) of temporal locality.The factors such as the nearest access time of the transmission cost of data block, the size of data block and data block are built buffer memory and replaces Cost Model.

These mechanism are all towards conventional data centers platform architecture, but dispose and the feature of data localization process in the close coupling of Map/Reduce platform computational resource and storage resources, make the data access feature statistics based on block level be subject to the interference of computational resource allocation strategy and real time load, be difficult to complete real reflection data access feature.

Reading a large amount of data for needing in MapReduce tasks carrying process, Internet Transmission and I/O bandwidth being exerted heavy pressures on, in prior art, there is no good solution.

Summary of the invention

For solving the problems of the technologies described above, present invention employs following technical scheme:

Based on three grades of caching methods of Hadoop distributed file system, adopt Apache Hadoop to realize, its method is as follows:

The task scheduling of step one, data localization process, comprises substep again:

1st step, user are to the request of Jobtrack submit job, and Jobtrack obtains the Job data area that will read and breakdown of operation is become several Map tasks and Reduce task;

The data that 2nd step, Jobtrack will read according to each Map task, obtain the DataNode position of depositing these data by the metadata of accessing NameNode;

3rd step, the timing of idle Tasktrack node report oneself situation to Jobtrack, Jobtrack selects the DataNode having target data from the Tasktrack node of these free time, and by corresponding Map task matching in this node;

Step 2, data are accessed in the locality of local internal memory, comprise again following sub-step:

1st step, the memory headroom of server is divided into equal-sized some storage areas, each block region is referred to as page frame;

2nd step, every one page are the most basic Memory Allocation.Under the bottom reserved byte of every one page deposits sensing one page address or represent that this data block terminates.Each data block is expressed as the chained list of a string page in internal memory.

Safeguard in 3rd step, internal memory that a data block calls in information table, when the data block in internal memory reaches the upper limit of memory space, when new data block needs to call in, uses and does not use replacement algorithm to perform the replacement of data block recently at most; In addition, the position diagram of a memory page table is also safeguarded in internal memory.

The recycling of step 3, local internal storage data

1st step, in Master server, safeguard a global buffer information management table, whether cache information and this node of being responsible for recording each Slave node have enough Slot resources.Each Slave server timing sends information to Master server, reports oneself cache information and Slot resource information.

2nd step, when Jobtrack carries out task scheduling, first check global buffer information table, if find the data having Map required by task in buffer memory, then this task of priority allocation.If then do not follow the scheduling strategy of data localization process.

Leaf frame size in step 2 in the 1st step is 64k.

In step 2, in the 2nd step, byte number is 4.

In step 2, the 3rd step does not use replacement algorithm to replace for selecting data block not visited in maximum duration in nearest a period of time recently at most.

In step 2 the 3rd step data block call in information table to record in have data block sequence number, the initial frame number of data block, data block access time.

The position diagram of a memory page table is also safeguarded in 3rd step internal memory in step 2, its method is divide one piece of fixed storage region in internal memory, each bit of each internal storage location represents a page, if this page is assigned with, then corresponding bit position 1, otherwise set to 0, to determine whether there is free cells in internal memory.

Three grades of caching mechanisms based on HDFS proposed by the invention can improve data hit rate, reduce volume of transmitted data, promote the execution efficiency of MapReduce.

Accompanying drawing explanation

Fig. 1 is based on three grades of cache shelf compositions of HDFS.

Embodiment

Three grades of caching mechanisms proposed by the invention are the distributed file system HDFS running on Hadoop platform, comprise three levels, as shown in Figure 1: 1, the task scheduling of data localization process; 2, data are accessed in the locality of local internal memory; 3, the recycling of local internal storage data.

Generally, HDFS comprises a NameNode and is deployed in a large amount of DataNode on master server and is deployed in the metadata information that NameNode from server is in charge of user file.Metadata is made up of three parts, is the corresponding relation of the data block split by file system directory tree information, file and file, the data block distributing position information on DataNode respectively.The file being stored in HDFS is splitted into onesize data block (normally 64MB), and these data blocks will copy and be stored in multiple data storage server.Each data storage server preserves these blocks by Linux file system on local disk, and reads and writes data block.

MapReduce computation schema is a kind of functional expression programming mode of standard.Client user is by the operation of programming realization to file.Each user program can regard an operation as, and operation will resolve into several Map and Reduce tasks by Jobtrack (job scheduling).Map task read data from HDFS, the data write HDFS that Reduce task is handled well, so proposed by the invention three grades of caching systems serve primarily in Map task, ensures that Map can find the data wanted within the shortest time.

Step one: the task scheduling of data localization process

Because HDFS is a Master/Slave structure (host-guest architecture), thus can be deployed in Master server (master server) by NameNode (namenode) and Jobtrack, DataNode (back end) and Tasktrack (task scheduling) is deployed in Slave server from server.If Map task data to be dealt with are kept on home server just, so Map task directly can read data from local disk, reduces the data transmission time in a network.When Jobtrack carries out task scheduling, preferentially by Map task matching to comprise this task want on the DataNode of process data block.In order to realize this purpose, split burst size and data block in the same size, therefore in InputSplit metadata information, host list only comprises a node, data localization process can be realized completely.

With data localization process for the task scheduling process of guiding is as follows:

1, user writes the new JobClient example of MapReduce program creation, to the request of Jobtrack submit job.Jobtrack receives a Job Client and asks, and replys.Then obtaining the Job data area that will read and breakdown of operation become several Map tasks and Reduce task, the corresponding partial data of each Map task process, is a split size.

2, the Jobtrack data that will read according to each Map task, obtain the DataNode position of depositing these data by the metadata of accessing NameNode, comprise the back end position of backup.

3, the situation of oneself is reported in the Tasktrack node timing of free time to Jobtrack, and Jobtrack selects the DataNode having target data from the Tasktrack node of these free time, and by corresponding Map task matching in this node.

Step 2: data are accessed in the locality of local internal memory

Obviously, the reading rate of disk does not catch up with the processing speed of CPU far away, so arrange core buffer to balance both gaps.Partial data in disk is read in internal memory in advance, when CPU runs into the instruction of reading data, directly from internal memory, obtains corresponding data.Can reduce Map task reads data time from local disk like this, internal memory DBMS buffer scheduling proposed by the invention is based on locality access.Although be independently between two Map tasks, but Map task decomposes out from user program, so meet locality access principle between front and back, data in advance can be called in internal memory or the data that were read last time do not recall internal memory at once, when next Map task arrives, just directly can read data.

Internal memory DBMS buffer scheduling process based on locality access is as follows:

1, the memory headroom of server is divided into equal-sized some storage areas, each block region is referred to as page frame, and page frame size is 64K, and the size due to data block each in HDFS is 64MB, so the page frame number needed for a data block is 1024.Page frame in internal memory from 0 serial number.

2, every one page is the most basic Memory Allocation.4 bytes are reserved to deposit the address of one page under sensing or to represent that this data block terminates in the bottom of every one page.Each data block is expressed as the chained list of a string page in internal memory.

3, conveniently from internal memory, required data block is found, from the internal memory of server, safeguard that a data block calls in information table, show which data block and call in internal memory, memory location in internal memory and when call in internal memory, the option that therefore will record in this information table has: the initial frame number of data block sequence number, data block, data block access time.When the data block in internal memory reaches the upper limit of memory space, when new data block needs to call in, use and do not use replacement algorithm (Least Recently Used recently at most, LRU) replacement of data block is performed, namely data block not visited in maximum duration in nearest a period of time is selected to replace, because locality of reference principle to think in a period of time in the past the data block of never accessed mistake, in the immediate future also can not be accessed, so need recording data blocks in data block information table call in the time, prepare in addition for performing lru algorithm, the position diagram of a memory page table is also safeguarded in internal memory, one piece of fixed storage region is divided in internal memory, each bit of each internal storage location represents a page, if this page is assigned with, then corresponding bit position 1, otherwise set to 0, to determine whether there is free cells in internal memory, the effect of position diagram is to point out in internal memory, whether each page is assigned with away, and be the sum of assignment page.Also refer to above, the conveniently management of internal memory, memory headroom is divided into some page frames that size is identical, data block is called in internal memory, must dispenser paging store, so in order to manage the distribution of page frame, so represent the distribution condition of every one page with position diagram, be assigned with set, do not distribute reset.When will call in data block, first to check a diagram next time, see if there is enough page frames and call in for data block, if then will lru algorithm be performed not, data block is before recalled internal memory, vacates page frame, call in for data block below.

4, suppose that memory headroom size is 8GB, the space of reserved 2GB is as other purposes, and the space size that so really can be used for doing memory cache is 6GB.The size of each data block of HDFS is 64MB, so a performance is called in data block and is had 96 data blocks in memory cache, so data block call in information table can not be very large.

Step 3: the recycling of local internal storage data

After Map tasks carrying terminates, the data that it reads can be retained in internal memory.When next Map task matching is to this DataNode, in order to avoid internal memory replaces the unnecessary time spent, Jobtrack should pay the utmost attention to the match condition of data in node memory and Map required by task deal with data when next round allocating task.So the present invention proposes with the task scheduling of internal storage data recycling for driving.

With the task scheduling process of internal storage data recycling for driving:

1, in Master server, safeguard a global buffer information management table, whether cache information and this node of being responsible for recording each Slave node have enough Slot resources.Each Slave server timing sends information to Master server, reports oneself cache information and Slot resource information, and TaskTracker uses slot to represent the stock number divided in the first-class amount of this node.Slot represents computational resource (CPU, internal memory etc.).Just have an opportunity after a Map Task gets a slot to run.

2, when Jobtrack carries out task scheduling, first global buffer information table is checked, if find the data having Map required by task in buffer memory, then this task of priority allocation.If then do not follow the scheduling strategy of data localization process.

So, according to three grades of caching mechanisms based on HDFS proposed by the invention, Jobtrack follows following process when task scheduling and Data import: when Jobtrack distributes Map task, first checks global buffer information table, preferentially task matching to the node of calling in target data block in internal memory; Then according to the strategy of data localization process, by remaining Map task matching in the node preserving target data block, will the data block of required reading is loaded in internal memory when performing Map task, internal memory replaces the internal memory DBMS buffer scheduling followed based on locality access.

The present invention is based on three grades of caching mechanisms of HFDS, comprise by data localization process be guiding task scheduling, by locality access based on internal memory DBMS buffer scheduling, with internal storage data recycling for drive task scheduling.Internal memory DBMS buffer scheduling process based on locality access, it is store by page that data block is called in internal memory, and each data block is expressed as the chained list of a string page in internal memory, is easy to the recycling of memory headroom.Call in information table by data block and carry out internal memory replacement.

Claims

1., based on three grades of caching methods of Hadoop distributed file system, adopt Apache Hadoop to realize, its method is as follows:

2nd step, every one page are the most basic Memory Allocation, under the bottom reserved byte of every one page deposits sensing one page address or represent that this data block terminates, each data block is expressed as the chained list of a string page in internal memory;

Safeguard in 3rd step, internal memory that a data block calls in information table, when the data block in internal memory reaches the upper limit of memory space, when new data block needs to call in, uses and does not use replacement algorithm to perform the replacement of data block recently at most; In addition, the position diagram of a memory page table is also safeguarded in internal memory;

The recycling of step 3, local internal storage data:

1st step, in Master server safeguard a global buffer information management table, whether cache information and this node of being responsible for recording each Slave node have enough Slot resources, each Slave server timing sends information to Master server, reports oneself cache information and Slot resource information;

2nd step, when Jobtrack carries out task scheduling, first check global buffer information table, if find the data having Map required by task in buffer memory, then this task of priority allocation, if then do not follow the scheduling strategy of data localization process.

2. a kind of three grades of caching methods based on Hadoop distributed file system according to claim 1, the leaf frame size in its step 2 in the 1st step is 64k.

3. a kind of three grades of caching methods based on Hadoop distributed file system according to claim 1, in its step 2, in the 2nd step, byte number is 4.

4. a kind of three grades of caching methods based on Hadoop distributed file system according to claim 1, in its step 2, the 3rd step does not use replacement algorithm to replace for selecting data block not visited in maximum duration in nearest a period of time recently at most.

5. a kind of three grades of caching methods based on Hadoop distributed file system according to claim 1, in its step 2 the 3rd step data block call in information table to record in have data block sequence number, the initial frame number of data block, data block access time.

6. a kind of three grades of caching methods based on Hadoop distributed file system according to claim 1, the position diagram of a memory page table is also safeguarded in 3rd step internal memory in its step 2, its method is divide one piece of fixed storage region in internal memory, each bit of each internal storage location represents a page, if this page is assigned with, then corresponding bit position 1, otherwise set to 0, to determine whether there is free cells in internal memory.